[Hardware] The market of ASICs (One GigaKey / Second?)

Dan Oetting dan_oetting at uswest.net
Sun Aug 8 02:41:19 EDT 2004


On Aug 7, 2004, at 6:26 PM, jbass at dmsd.com wrote:

>> But have you got the optimum design considering time, space, power and
>> maximum clock rate? The FIFOs for the S terms could be eliminated by
>> recomputing the previous round in parallel with the current round.
>> Instead of two 26 stage FIFOs you would add 12 adders and 3 barrel
>> rolls.
>
> Fifo's seem much cheaper than duplicate adders and barrel shifters.
>
> Ok, clue me in here. I come up with a LOT more. How do you plan to
> recomputer the S and L terms for 52 stages with only 12 adders
> and 3 barrel rolls? I figured at min 3 adders (I=A+B, S+I, L+I) and
> two barrel rolls (S and L) per stage to recompute stage 1 S and L
> terms. Recomputing Stage 2 S and L terms is much more interesting.
> Plus regen of the key values.
>
> Maybe a quick ascii block chart for a round 2 and round 3 stage?

Here's the overview. S[] is passed to the right, A, B and L[] are 
passed down. Each Round is a vertical stack of 26 stages. Each stage is 
1 itteration consisting of 4 adds, 1 fixed rotate and 1 variable 
rotate.

	                       Key -> Round1
	                                |
	                                V
	          Key+26 -> Round1 -> Round2     P0/P1
	                      |         |          |
	                      V         V          V
	Key+52 -> Round1 -> Round2 -> Round3 -> Encrypt
                                                |
	                                           V
	                                         C0/C1



>
Each stage in round 2 needs 1 repeated Round 1 stage to provided the 
delayed S values. The Round3 stages need the delayed Round 2 stage 
which in turn needs a twice delayed Round1 stage. The cost for each 
stage of the 3 extra Rounds is 12 adds and 3 variable rotates (the 
fixed rotates are free in hardware). This is compared to the cost of 2 
FIFOs 26 stages long. If a barrel roll is equivalent to 5 FIFO stages 
and an Add is equivalent to 2 FIFO stages that would be a total 
equivalence of 39 verses 52. The actual costs may be higher in the 
parallel implementation.

> I was playing with it sometime back in Xilinx FPGA's. The fifo's
> were relatively cheap as two 16 bit LUT shift registers cascaded
> per bit with a short latency ... so 64 LUT's per S term, a fraction
> of the LUT's for a 32 bit barrel shifter, and about the same as
> a three term 32 bit adder. So at least with FPGA's shifters are cheap.
> With VLSI a 26 bit serial shifter is dead cheap.
>
>> Another alternative, instead of 32 bits wide go 1 bit serial. The
>> complexity of the fast adder with carry lookahead almost vanishes. The
>> barrel rolls are not much more than a 32 bit fifo. And while the 
>> barrel
>> rolls delay the processing of successive stages, another key can be
>> processed on the same circuitry in the gaps, so the entire pipeline
>> will process 1 key every 32 clocks. With the vastly simpler logic much
>> higher clock speeds can be obtained. And more parallel pipes can be
>> built in the same space.
>
> Hmm ... I'd like to see a rough prototype algorithm for that design.

The overall block diagram looks much like the parallel version. The 
details for each iteration stage will look considerably different. The 
adders are just a handful of gates and implement carry as a 1 bit time 
delay feeding back into the adder. Rotates require a 32 bit time delay 
and up to 64 bits of storage and 10 selector gates for the variable 
rotate. To compensate for the rotate delays all the other terms carried 
through the stage also need to be delayed. Each stage will contain 
about 320 bits of fifo.

The delay between rounds for the S values would need to be 1664 bits 
(64*26). So the two delays totaling 3228 bits can be replaced by 
regenerating 3 rounds as shown above using 320 bits totaling only 960 
bits.

The actual encryption add about 160 fifo bits per stage. The S0 
constants can be generated inline with a 1 bit adder per stage which is 
probably cheaper than trying to clock the constants in from a 32 bit 
register. The whole shebang will be around 54K fifo bits plus a few 
logic gates. Clock rates of over 4 gig could probably be expected 
(assuming the heat can be dissipated).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 4427 bytes
Desc: not available
Url : http://lists.distributed.net/pipermail/hardware/attachments/20040808/bc5b9317/attachment.bin


More information about the Hardware mailing list