[Hardware] The market of ASICs (One GigaKey / Second?)
Michael Meeuwisse
mickeymeeuw at hotmail.com
Sun Aug 8 19:52:20 EDT 2004
[cut]
>The inner L[j] assignment can be replaced with B, and corrected
>for switches at the outer loop with an assignment of B=L[j];
>With loop unrolling, most of the B + B terms, can be removed
>since that is equiv to 2B or B<<1, which can be eliminated with
>signal term selection.
Good point, but it never happens. B = B + L(j) + A, and L(j) can't be
replaced with B. Or I'm overlooking something (prove me wrong please :)
>Since round 0 only calculates constants, it should be factored
>with partial evaluation, replacing the P and Q terms with a vector
>of constants. Unrolling just round 1 then allows constant terms
>to replace the initial S terms.
Agreed, it's easily calculated in the following way, and can be hardwired
afterwards:
#include <stdio.h>
#include <stdlib.h>
unsigned long int S[26];
unsigned long int P=0xb7e15163, Q=0x9e3779b9;
int main () {
int i;
for (S[0]=P, i=1; i<26; i++) S[i] = S[i-1]+Q;
for (i=0;i<26;i++) printf("0x%08X,",S[i]);
return 0;
}
>This removes 25% of the memory
>writes in the loop. Combining the encryption loop with round 3
>removes one more set of S writes, removing another 25% of the
>original memory writes. This also provides locality for the
>intermediate terms, minimizing routing.
That's interesting, have to look into that. Although I have the feeling it
won't work since you can't calculate S(25) without first calculating S(24)
etc, and for the decrypting part you first need S(25), and all the way at
the end S(1) and S(0).
>The double reverse bytes should be factored out, leaving a single
>case where L is initialized in the interation loop.
L is initialized to null, reverse bytes can be hardwired. Not really
interesting anyways.
>It's hard to visualize the dependencies in this form, so you pretty
>much have to unroll all the loops and create new intermidate terms
>for A and B everywhere to create a pipeline.
Yup.
>Even at that the S dependencies between rounds seem critically limiting,
>with this very simple algorithm, but replacing the remaining S terms with
>two sets of 26 32bit fifos (total of 832 bits wide) that are 26 deep.
>Presto the algorithm becomes a single long pipeline, one clock per
>term.
I'm currently looking into that too, xilinx fpgas (I only look at spartan3
at the moment, forgive me if it's not in all fpgas) have several units of
block ram which can be dual ported 32 bits wide fifos with 512 addresses
(giving to fifos of 256 deep). 26 deep is not true for the point I mentioned
earlier, you have to buffer more than one round (actually 6*3*25 and 4*11
for S(0)) but since you've got space for 256 values, it shouldn't be an
issue. Note: I calculate the hash in 6 steps, the decrypt in 4. Hence the
rather large numbers.)
>The wide barrel shifters are the crunch, along with poor locality for
>term routing.
You mentioned later:
>a fraction
>of the LUT's for a 32 bit barrel shifter, and about the same as
>a three term 32 bit adder. So at least with FPGA's shifters are cheap.
>With VLSI a 26 bit serial shifter is dead cheap.
Can you give more details? This solution sounds good, but still a bit
blurry. I haven't worked out this part yet for my design.
>It's been a long time since I tried this, I might even still have the
>VHDL kicking around.
Find it! :)
>John
Thanks for some interesting suggestions, and excuse me if I'm somewhat rude
on certain points.
Michael
_________________________________________________________________
MSN Zoeken, voor duidelijke zoekresultaten! http://search.msn.nl
More information about the Hardware
mailing list