[Hardware] The market of ASICs (One GigaKey / Second?)

jbass at dmsd.com jbass at dmsd.com
Sat Aug 7 20:26:33 EDT 2004

"Dan Oetting" <dan_oetting at uswest.net> writes:
> All the unrolling and optimizations are simply mechanical operations 
> that tend to obfuscate. It sounded like the original poster wanted to 
> know what the algorithm was so he could wrap his head around it. The 
> optimized source code apparently only confused him.

I assumed he also looked at the RC5 spec, and was hoping to find
optimized code a better starting point.

> But have you got the optimum design considering time, space, power and 
> maximum clock rate? The FIFOs for the S terms could be eliminated by 
> recomputing the previous round in parallel with the current round. 
> Instead of two 26 stage FIFOs you would add 12 adders and 3 barrel 
> rolls.

Fifo's seem much cheaper than duplicate adders and barrel shifters.

Ok, clue me in here. I come up with a LOT more. How do you plan to
recomputer the S and L terms for 52 stages with only 12 adders
and 3 barrel rolls? I figured at min 3 adders (I=A+B, S+I, L+I) and
two barrel rolls (S and L) per stage to recompute stage 1 S and L
terms. Recomputing Stage 2 S and L terms is much more interesting.
Plus regen of the key values.

Maybe a quick ascii block chart for a round 2 and round 3 stage?

I was playing with it sometime back in Xilinx FPGA's. The fifo's
were relatively cheap as two 16 bit LUT shift registers cascaded
per bit with a short latency ... so 64 LUT's per S term, a fraction
of the LUT's for a 32 bit barrel shifter, and about the same as
a three term 32 bit adder. So at least with FPGA's shifters are cheap.
With VLSI a 26 bit serial shifter is dead cheap.

> Another alternative, instead of 32 bits wide go 1 bit serial. The 
> complexity of the fast adder with carry lookahead almost vanishes. The 
> barrel rolls are not much more than a 32 bit fifo. And while the barrel 
> rolls delay the processing of successive stages, another key can be 
> processed on the same circuitry in the gaps, so the entire pipeline 
> will process 1 key every 32 clocks. With the vastly simpler logic much 
> higher clock speeds can be obtained. And more parallel pipes can be 
> built in the same space.

Hmm ... I'd like to see a rough prototype algorithm for that design.


More information about the Hardware mailing list