[Hardware] The market of ASICs (One GigaKey / Second?)
jbass at dmsd.com
jbass at dmsd.com
Sat Aug 7 17:24:30 EDT 2004
"Dan Oetting" <dan_oetting at uswest.net> writes:
> What type of analysis are you looking for? One of the first thing I did
> when starting to work on optimizing the cores was build a simple
> reference core to understand what was going on. Here is my RC5-64
> reference core. The only major change for RC5-72 is the key and L[ ]
> need to be 3 words (big enough to hold 72 bits) instead of 2 and the
> index j in the inner loop needs to cycle 0,1,2,0,1,2... This should
> give you an idea of what the hardware needs to do.
[ ... sample code deleted ...]
That form while it's a direct expression of the algorithm, is
horrible from both a software and hardware performance point of
view.
The inner L[j] assignment can be replaced with B, and corrected
for switches at the outer loop with an assignment of B=L[j];
With loop unrolling, most of the B + B terms, can be removed
since that is equiv to 2B or B<<1, which can be eliminated with
signal term selection.
Since round 0 only calculates constants, it should be factored
with partial evaluation, replacing the P and Q terms with a vector
of constants. Unrolling just round 1 then allows constant terms
to replace the initial S terms. This removes 25% of the memory
writes in the loop. Combining the encryption loop with round 3
removes one more set of S writes, removing another 25% of the
original memory writes. This also provides locality for the
intermediate terms, minimizing routing.
The double reverse bytes should be factored out, leaving a single
case where L is initialized in the interation loop.
It's hard to visualize the dependencies in this form, so you pretty
much have to unroll all the loops and create new intermidate terms
for A and B everywhere to create a pipeline.
Even at that the S dependencies between rounds seem critically limiting,
with this very simple algorithm, but replacing the remaining S terms with
two sets of 26 32bit fifos (total of 832 bits wide) that are 26 deep.
Presto the algorithm becomes a single long pipeline, one clock per
term.
The wide barrel shifters are the crunch, along with poor locality for
term routing.
It's been a long time since I tried this, I might even still have the
VHDL kicking around.
John
More information about the Hardware
mailing list