[Hardware] partially unrolled?
John L. Bass
jbass at dmsd.com
Tue Nov 28 12:13:53 EST 2006
The key is incremented starting with the most significant byte. This
allows some optimization in the first few stages for the software
cores. The hardware cores will save power and may save a few gates if
you can load the pre-computed constants.
While not in the core I posted, that is something I did in testing
a couple years ago. Actually allows complete removal of the first
few stages to save a few hundred LUTs. Since the next size up FPGA
to get the core to fit, has a few thousand extra LUTs, the savings
isn't particularly important.
Since round zero produces a constant SBox, there are some serious
optimizations for all of round 1 allowing short cuts from all the
zeros in the SBox constants. Unfortunately, it takes some serious
computation to take advantage of that in the form of stage specific
transfer functions and layout removing unnecessary logic. The total
gain for this is under 10-12%, and probably not worth much.
Work assignments were originally handed out in blocks of 2^28 keys.
This may have been increased to 2^32 keys. Hardware cores will want
much larger blocks or you will spend all of your time restarting the
core.
At Guerric's 220MKey/sec it takes about 16 seconds to burn thru
2^32 keys. I was already assuming we would need blocks around
2^48 keys.
But that isn't strictly needed, as the low bits could be pipelined
and automatically advance when the counter rolls over. It does mean
that you will want to start the FPGA after getting a few hundred
work blocks ... or a few thousand if you are going to be disconnect
from the net for a while.
John
More information about the Hardware
mailing list