[Hardware] The market of ASICs (One GigaKey / Second?)
jbass at dmsd.com
jbass at dmsd.com
Tue Aug 10 05:00:02 EDT 2004
"Dan Oetting" <dan_oetting at uswest.net> writes:
> The overall block diagram looks much like the parallel version. The
> details for each iteration stage will look considerably different. The
> adders are just a handful of gates and implement carry as a 1 bit time
> delay feeding back into the adder. Rotates require a 32 bit time delay
> and up to 64 bits of storage and 10 selector gates for the variable
> rotate. To compensate for the rotate delays all the other terms carried
> through the stage also need to be delayed. Each stage will contain
> about 320 bits of fifo.
I went back and did some pencil and paper serial design update from my
original folded 32 bit wide FPGA RC5 engine design that looks pretty
close to yours, maybe a bit lighter. Serial design didn't really create
much of a difference in total performance per device, maybe another
10%.
Serial solution would then be (52 + 38 + 15)/2 = 52 LUTs per RC5 engine.
In addition the design would have to be wrapped with a controlling
state machine and initialization storage per FPGA.
Each engine would check a solution every 1248 clocks. An XC2VP70 contains
just over 65K LUTS, should hold about 1,270 RC5 engines to check a solution
roughly every clock at a speed of over 150mhz. Or roughly 153MKeys/sec.
Currently d.net is solving 114,719 GKeys/sec. So it would take 750 or so
XC2VP70 FPGAs to match d.net's current performance.
I have 33M LUT's going into my design, which should net about 635K engines,
at an average speed of about 120mhz, for about 61,057,155,000 Keys/sec
after burning about 17.5KW per hour. $750/mo if I let it run very long,
which is highly unlikely, as the ROI is nearly zero.
> The delay between rounds for the S values would need to be 1664 bits
> (64*26). So the two delays totaling 3228 bits can be replaced by
> regenerating 3 rounds as shown above using 320 bits totaling only 960
> bits.
Every way I looked at regen with the folded design, it comes out more
expensive in several ways.
> The actual encryption add about 160 fifo bits per stage. The S0
> constants can be generated inline with a 1 bit adder per stage which is
> probably cheaper than trying to clock the constants in from a 32 bit
> register. The whole shebang will be around 54K fifo bits plus a few
> logic gates. Clock rates of over 4 gig could probably be expected
> (assuming the heat can be dissipated).
I ended up with 52 LUT's for the S vector, 38 LUT's for the keygen, and
15 LUT's for the encryption after factoring duplicate logic into the
controller, which is only needed once per chip.
John
More information about the Hardware
mailing list