[Hardware] partially unrolled?

John L. Bass jbass at dmsd.com
Tue Nov 28 12:13:53 EST 2006


	The key is incremented starting with the most significant byte. This  
	allows some optimization in the first few stages for the software  
	cores. The hardware cores will save power and may save a few gates if  
	you can load the pre-computed constants.

While not in the core I posted, that is something I did in testing
a couple years ago. Actually allows complete removal of the first
few stages to save a few hundred LUTs. Since the next size up FPGA
to get the core to fit, has a few thousand extra LUTs, the savings
isn't particularly important.

Since round zero produces a constant SBox, there are some serious
optimizations for all of round 1 allowing short cuts from all the
zeros in the SBox constants. Unfortunately, it takes some serious
computation to take advantage of that in the form of stage specific
transfer functions and layout removing unnecessary logic. The total
gain for this is under 10-12%, and probably not worth much.

	Work assignments were originally handed out in blocks of 2^28 keys.  
	This may have been increased to 2^32 keys. Hardware cores will want  
	much larger blocks or you will spend all of your time restarting the  
	core.

At Guerric's 220MKey/sec it takes about 16 seconds to burn thru
2^32 keys. I was already assuming we would need blocks around
2^48 keys.

But that isn't strictly needed, as the low bits could be pipelined
and automatically advance when the counter rolls over. It does mean
that you will want to start the FPGA after getting a few hundred
work blocks ... or a few thousand if you are going to be disconnect
from the net for a while.

John


More information about the Hardware mailing list