[Hardware] The market of ASICs (One GigaKey / Second?)

jbass at dmsd.com jbass at dmsd.com
Tue Aug 10 05:00:02 EDT 2004

"Dan Oetting" <dan_oetting at uswest.net> writes:
> The overall block diagram looks much like the parallel version. The 
> details for each iteration stage will look considerably different. The 
> adders are just a handful of gates and implement carry as a 1 bit time 
> delay feeding back into the adder. Rotates require a 32 bit time delay 
> and up to 64 bits of storage and 10 selector gates for the variable 
> rotate. To compensate for the rotate delays all the other terms carried 
> through the stage also need to be delayed. Each stage will contain 
> about 320 bits of fifo.

I went back and did some pencil and paper serial design update from my
original folded 32 bit wide FPGA RC5 engine design that looks pretty
close to yours, maybe a bit lighter. Serial design didn't really create
much of a difference in total performance per device, maybe another

Serial solution would then be (52 + 38 + 15)/2 = 52 LUTs per RC5 engine.
In addition the design would have to be wrapped with a controlling
state machine and initialization storage per FPGA.

Each engine would check a solution every 1248 clocks. An XC2VP70 contains
just over 65K LUTS, should hold about 1,270 RC5 engines to check a solution
roughly every clock at a speed of over 150mhz. Or roughly 153MKeys/sec.

Currently d.net is solving 114,719 GKeys/sec. So it would take 750 or so
XC2VP70 FPGAs to match d.net's current performance.

I have 33M LUT's going into my design, which should net about 635K engines,
at an average speed of about 120mhz, for about 61,057,155,000 Keys/sec
after burning about 17.5KW per hour.  $750/mo if I let it run very long,
which is highly unlikely, as the ROI is nearly zero.

> The delay between rounds for the S values would need to be 1664 bits 
> (64*26). So the two delays totaling 3228 bits can be replaced by 
> regenerating 3 rounds as shown above using 320 bits totaling only 960 
> bits.

Every way I looked at regen with the folded design, it comes out more
expensive in several ways.

> The actual encryption add about 160 fifo bits per stage. The S0 
> constants can be generated inline with a 1 bit adder per stage which is 
> probably cheaper than trying to clock the constants in from a 32 bit 
> register. The whole shebang will be around 54K fifo bits plus a few 
> logic gates. Clock rates of over 4 gig could probably be expected 
> (assuming the heat can be dissipated).

I ended up with 52 LUT's for the S vector, 38 LUT's for the keygen, and
15 LUT's for the encryption after factoring duplicate logic into the
controller, which is only needed once per chip.


More information about the Hardware mailing list