# [Hardware] I found my RC5 parallel core design notes

jbass at dmsd.com jbass at dmsd.com
Tue Sep 14 06:48:49 EDT 2004

```updated a few numbers for larger devices, and current dnet stats.

SBOX implemented as cascaded 26 bit LUT shifters would require
32*2=64 LUTs.  This is the same cost as the serial design per stage.

The equivalent functions as described in C for an unrolled loop are:

unsigned long *S, *L, *E;
S[1] = ROTL3(S[-26] + S[0] + L[0]); S++;
L[1] = ROTL(L[-2] + S[0] + L[0], S[0] + L[0]); L++;
E[1] = ROTL(E[-1] ^ E[0], E[0]) + S[0]; E++;

The S function requires two 32 bit adder stages, 32*2=64 LUTs. The
ROTL3 function is just a fixed rotate implemented by wiring. In
addition, the SBox can be implemented as thirty two 26 bit LUT based
shift registers, requiring another 64 LUTs.

The L function also requires two 32 bit adder stages, 32*2=64 LUTs.
The ROTL function however, requires a 32 bit wide, 5 stages of 2:1
MUXs in log shifter arrangement, for a total of 32*5=160 2:1 MUXs.
Worst case this is one LUT per MUX, but in reality, the H, F5, F6,
FX MUXs can be used also to reduce the number of LUTs depending on
the device. For XC2V and XC2VP devices, only 32+16+16+16+16=96 LUTs
are required. Since Virtex devices do not have F7 MUX's, the six F7
MUX's must be replaced with 2:1 LUT MUXs, for a total of 96+6=102
LUTs for an XCV Virtex ROTL design.

The E function XOR is pretty much free for Virtex devices, but the
ROTL function has the same cost as above. If fully unrolled, only
the last 26 stages need an E function.

Unrolled there are 78 stages (actually a few less than that), where
most stages have 9 LUT's and 30 levels of ripple carry chain delays
worst case, or about 20ns latency pipelined at the stage level.
Pipelined at the S, L add, and L ROTL level the latency is around
6-8ns, allowing around 125M keys/sec per engine, or about 2,500 blocks
per day per device (depending on device and speed range).  Tigher
pipelining is possible, but will start to use more LUTs.

S+L stages are 64+64+64+96=288 LUTs for XC2V or XC2VP devices, and
64+64+64+102=294 LUT's for XCV devices. Unrolled, XC2V and XC2VP
devices must have 22464 LUT's, and XCV devices must have 22932
LUTs, plus the controller state machine to solve one key per clock.
Thus the smallest unrolled Virtex device is an XCV1000, and the
smallest unrolled Virtex II device is an XC2V3000. Very rough
performance numbers are in the +/- 30% range of:

Device            LUTs  RC5's      Keys/Sec     Blocks/Day

XCV1000         24,576    1     137,176,347      2,760
XCV1600         31,104    1     173,613,815      3,493

XC2V3000        28,672    1     163,373,219      3,287
XC2V4000        46,080    2     262,564,103      5,282
XC2V6000        67,584    3     385,094,017      7,747
XC2V8000        93,184    4     530,962,963     10,681
XC2V10000       122,880   5     700,170,940     14,085

Given that the fastest team currently knocks off 100,000 blocks per day,
this speed could be replaced with around 7 of the largest and fastest
Xilinx devices. The entire DNet key rate can be replaced with under
250 of these largest and fastest devices. I believe the largest device
with the fastest speed grade may be able to do twice or three times this
rate with better pipelining.

For smaller devices, a large number of RC5 engines that are rolled,
one S, L, and E function per engine allow a fit, at the cost of a
more complex controller state machine and lower performance. Each
RC5 engine would then cost 64+64+64+96+96=384 LUTs for XC2V or XC2VP
devices, and 64+64+64+102+102=396 LUT's for XCV devices. Typical
solution time would be 78 clocks per key. Ball park performance for
various Xilinx devices (vary +/- 30% by speed grade) would be roughly:

Device             LUTs  RC5's     Keys/Sec     Blocks/Day

XCV50             1,536    4      6,365,190        128
XCV100            2,400    6      9,945,610        200
XCV200            4,704   12     19,493,395        392
XCV300            6,144   16     25,460,761        512
XCV400            9,600   24     39,782,440        800
XCV600           13,824   35     57,286,713      1,152
XCV800           18,816   48     77,973,582      1,569
XCV1000          24,576   62    101,843,046      2,049
XCV1600          31,104   79    128,895,105      2,593

XC2V40              512    1      2,188,034         44
XC2V80            1,024    3      4,376,068         88
XC2V250           3,072    8     13,128,205        264
XC2V500           6,144   16     26,256,410        528
XC2V1000         10,240   27     43,760,684        880
XC2V1500         15,360   40     65,641,026      1,320
XC2V2000         21,504   56     91,897,436      1,849
XC2V3000         28,672   75    122,529,915      2,465
XC2V4000         46,080  120    196,923,077      3,961
XC2V6000         67,584  176    288,820,513      5,810
XC2V8000         93,184  243    398,222,222      8,011
XC2V10000       122,880  320    525,128,205     10,564

There are slightly more dense and higher performance solutions,
by sharing an E function between every three RC engines.

Have fun,
John Bass
```