[Hardware] The market of ASICs (One GigaKey / Second?)

Dan Oetting dan_oetting at uswest.net
Sat Aug 7 20:24:07 EDT 2004

On Aug 7, 2004, at 3:24 PM, jbass at dmsd.com wrote:

> "Dan Oetting" <dan_oetting at uswest.net> writes:
>> What type of analysis are you looking for? One of the first thing I 
>> did
>> when starting to work on optimizing the cores was build a simple
>> reference core to understand what was going on. Here is my RC5-64
>> reference core. The only major change for RC5-72 is the key and L[ ]
>> need to be 3 words (big enough to hold 72 bits) instead of 2 and the
>> index j in the inner loop needs to cycle 0,1,2,0,1,2... This should
>> give you an idea of what the hardware needs to do.
> [ ... sample code deleted ...]
> That form while it's a direct expression of the algorithm, is
> horrible from both a software and hardware performance point of
> view.

The reference is exactly that, a direct expression of the algorithm. I 
used this reference extensively when designing my own optimized cores. 
Prints can be inserted to generate a table of all the intermediate 
results for a given input which can be invaluable while single stepping 
through the optimized assembly core looking for that elusive typo. The 
reference can be used to generate the cypher text for any key and plain 
text and built into an automated test generator for validating the 
optimized cores. And a modified version of the reference can be used to 
search for adjacent partial results to test a very critical but often 
overlooked piece of the logic in parallel cores.

All the unrolling and optimizations are simply mechanical operations 
that tend to obfuscate. It sounded like the original poster wanted to 
know what the algorithm was so he could wrap his head around it. The 
optimized source code apparently only confused him.

> The double reverse bytes should be factored out, leaving a single
> case where L is initialized in the interation loop.

The double reverse bytes is a historical blunder that never should have 
been there in the first place.

> It's hard to visualize the dependencies in this form, so you pretty
> much have to unroll all the loops and create new intermidate terms
> for A and B everywhere to create a pipeline.
> Even at that the S dependencies between rounds seem critically 
> limiting,
> with this very simple algorithm, but replacing the remaining S terms 
> with
> two sets of 26 32bit fifos (total of 832 bits wide) that are 26 deep.
> Presto the algorithm becomes a single long pipeline, one clock per
> term.

But have you got the optimum design considering time, space, power and 
maximum clock rate? The FIFOs for the S terms could be eliminated by 
recomputing the previous round in parallel with the current round. 
Instead of two 26 stage FIFOs you would add 12 adders and 3 barrel 

Another alternative, instead of 32 bits wide go 1 bit serial. The 
complexity of the fast adder with carry lookahead almost vanishes. The 
barrel rolls are not much more than a 32 bit fifo. And while the barrel 
rolls delay the processing of successive stages, another key can be 
processed on the same circuitry in the gaps, so the entire pipeline 
will process 1 key every 32 clocks. With the vastly simpler logic much 
higher clock speeds can be obtained. And more parallel pipes can be 
built in the same space.

More information about the Hardware mailing list