[Hardware] The market of ASICs (One GigaKey / Second?)

jbass at dmsd.com jbass at dmsd.com
Sat Aug 7 17:24:30 EDT 2004


"Dan Oetting" <dan_oetting at uswest.net> writes:
> What type of analysis are you looking for? One of the first thing I did 
> when starting to work on optimizing the cores was build a simple 
> reference core to understand what was going on. Here is my RC5-64 
> reference core. The only major change for RC5-72 is the key and L[ ] 
> need to be 3 words (big enough to hold 72 bits) instead of 2 and the 
> index j in the inner loop needs to cycle 0,1,2,0,1,2... This should 
> give you an idea of what the hardware needs to do.

[ ... sample code deleted ...]

That form while it's a direct expression of the algorithm, is
horrible from both a software and hardware performance point of
view.

The inner L[j] assignment can be replaced with B, and corrected
for switches at the outer loop with an assignment of B=L[j];
With loop unrolling, most of the B + B terms, can be removed
since that is equiv to 2B or B<<1, which can be eliminated with
signal term selection.

Since round 0 only calculates constants, it should be factored
with partial evaluation, replacing the P and Q terms with a vector
of constants. Unrolling just round 1 then allows constant terms
to replace the initial S terms. This removes 25% of the memory
writes in the loop. Combining the encryption loop with round 3
removes one more set of S writes, removing another 25% of the
original memory writes. This also provides locality for the
intermediate terms, minimizing routing.

The double reverse bytes should be factored out, leaving a single
case where L is initialized in the interation loop.

It's hard to visualize the dependencies in this form, so you pretty
much have to unroll all the loops and create new intermidate terms
for A and B everywhere to create a pipeline.

Even at that the S dependencies between rounds seem critically limiting,
with this very simple algorithm, but replacing the remaining S terms with
two sets of 26 32bit fifos (total of 832 bits wide) that are 26 deep.
Presto the algorithm becomes a single long pipeline, one clock per
term.

The wide barrel shifters are the crunch, along with poor locality for
term routing.

It's been a long time since I tried this, I might even still have the
VHDL kicking around.

John


More information about the Hardware mailing list