[Hardware] The market of ASICs (One GigaKey / Second?)
jbass at dmsd.com
jbass at dmsd.com
Tue Aug 10 05:00:02 EDT 2004
"Dan Oetting" <dan_oetting at uswest.net> writes:
> The overall block diagram looks much like the parallel version. The
> details for each iteration stage will look considerably different. The
> adders are just a handful of gates and implement carry as a 1 bit time
> delay feeding back into the adder. Rotates require a 32 bit time delay
> and up to 64 bits of storage and 10 selector gates for the variable
> rotate. To compensate for the rotate delays all the other terms carried
> through the stage also need to be delayed. Each stage will contain
> about 320 bits of fifo.
I went back and did some pencil and paper serial design update from my
original folded 32 bit wide FPGA RC5 engine design that looks pretty
close to yours, maybe a bit lighter. Serial design didn't really create
much of a difference in total performance per device, maybe another
Serial solution would then be (52 + 38 + 15)/2 = 52 LUTs per RC5 engine.
In addition the design would have to be wrapped with a controlling
state machine and initialization storage per FPGA.
Each engine would check a solution every 1248 clocks. An XC2VP70 contains
just over 65K LUTS, should hold about 1,270 RC5 engines to check a solution
roughly every clock at a speed of over 150mhz. Or roughly 153MKeys/sec.
Currently d.net is solving 114,719 GKeys/sec. So it would take 750 or so
XC2VP70 FPGAs to match d.net's current performance.
I have 33M LUT's going into my design, which should net about 635K engines,
at an average speed of about 120mhz, for about 61,057,155,000 Keys/sec
after burning about 17.5KW per hour. $750/mo if I let it run very long,
which is highly unlikely, as the ROI is nearly zero.
> The delay between rounds for the S values would need to be 1664 bits
> (64*26). So the two delays totaling 3228 bits can be replaced by
> regenerating 3 rounds as shown above using 320 bits totaling only 960
Every way I looked at regen with the folded design, it comes out more
expensive in several ways.
> The actual encryption add about 160 fifo bits per stage. The S0
> constants can be generated inline with a 1 bit adder per stage which is
> probably cheaper than trying to clock the constants in from a 32 bit
> register. The whole shebang will be around 54K fifo bits plus a few
> logic gates. Clock rates of over 4 gig could probably be expected
> (assuming the heat can be dissipated).
I ended up with 52 LUT's for the S vector, 38 LUT's for the keygen, and
15 LUT's for the encryption after factoring duplicate logic into the
controller, which is only needed once per chip.
More information about the Hardware