[Hardware] The market of ASICs (One GigaKey / Second?)
Dan Oetting
dan_oetting at uswest.net
Sat Aug 7 20:24:07 EDT 2004
On Aug 7, 2004, at 3:24 PM, jbass at dmsd.com wrote:
> "Dan Oetting" <dan_oetting at uswest.net> writes:
>> What type of analysis are you looking for? One of the first thing I
>> did
>> when starting to work on optimizing the cores was build a simple
>> reference core to understand what was going on. Here is my RC5-64
>> reference core. The only major change for RC5-72 is the key and L[ ]
>> need to be 3 words (big enough to hold 72 bits) instead of 2 and the
>> index j in the inner loop needs to cycle 0,1,2,0,1,2... This should
>> give you an idea of what the hardware needs to do.
>
> [ ... sample code deleted ...]
>
> That form while it's a direct expression of the algorithm, is
> horrible from both a software and hardware performance point of
> view.
The reference is exactly that, a direct expression of the algorithm. I
used this reference extensively when designing my own optimized cores.
Prints can be inserted to generate a table of all the intermediate
results for a given input which can be invaluable while single stepping
through the optimized assembly core looking for that elusive typo. The
reference can be used to generate the cypher text for any key and plain
text and built into an automated test generator for validating the
optimized cores. And a modified version of the reference can be used to
search for adjacent partial results to test a very critical but often
overlooked piece of the logic in parallel cores.
All the unrolling and optimizations are simply mechanical operations
that tend to obfuscate. It sounded like the original poster wanted to
know what the algorithm was so he could wrap his head around it. The
optimized source code apparently only confused him.
> The double reverse bytes should be factored out, leaving a single
> case where L is initialized in the interation loop.
The double reverse bytes is a historical blunder that never should have
been there in the first place.
> It's hard to visualize the dependencies in this form, so you pretty
> much have to unroll all the loops and create new intermidate terms
> for A and B everywhere to create a pipeline.
>
> Even at that the S dependencies between rounds seem critically
> limiting,
> with this very simple algorithm, but replacing the remaining S terms
> with
> two sets of 26 32bit fifos (total of 832 bits wide) that are 26 deep.
> Presto the algorithm becomes a single long pipeline, one clock per
> term.
But have you got the optimum design considering time, space, power and
maximum clock rate? The FIFOs for the S terms could be eliminated by
recomputing the previous round in parallel with the current round.
Instead of two 26 stage FIFOs you would add 12 adders and 3 barrel
rolls.
Another alternative, instead of 32 bits wide go 1 bit serial. The
complexity of the fast adder with carry lookahead almost vanishes. The
barrel rolls are not much more than a 32 bit fifo. And while the barrel
rolls delay the processing of successive stages, another key can be
processed on the same circuitry in the gaps, so the entire pipeline
will process 1 key every 32 clocks. With the vastly simpler logic much
higher clock speeds can be obtained. And more parallel pipes can be
built in the same space.
More information about the Hardware
mailing list