[Hardware] The market of ASICs (One GigaKey / Second?)

Elektron elektron_rc5 at yahoo.ca
Sun Aug 8 23:33:49 EDT 2004

>> This removes 25% of the memory
>> writes in the loop. Combining the encryption loop with round 3
>> removes one more set of S writes, removing another 25% of the
>> original memory writes. This also provides locality for the
>> intermediate terms, minimizing routing.
> That's interesting, have to look into that. Although I have the 
> feeling it won't work since you can't calculate S(25) without first 
> calculating S(24) etc, and for the decrypting part you first need 
> S(25), and all the way at the end S(1) and S(0).

Tweaking r72-ref.cpp, (imagine we have extra registers C,D)

Imagine loops are unrolled, since the % is incredibly wasteful. The 
B=L[0] line may be able to have a hard-coded value of A in some systems 
for a marginal speed increase (not PPC though). We don't decrypt. We 
encrypt the plaintext to check that it matches the ciphertext (most 
cores do this, and either way it's just as fast, except as mentioned, 
the last round is faster).

Shame that recompiling everything will take forever (can't I tell it 
that I want both the ansi cores and the ppc cores?). You can also 
eliminate the last three writes to L (when hard-coding the loop, 
anyway). All the counters are unnecessary, so this only needs 4 
registers per pipe (A,B,C,D), though you probably want more to store 
intermediate results. For speed, you can do D=Q and just use C+=D at 
the beginning (I think loading an entire word from the instruction 
should be slower, but that's just me).

This appears to eliminate half of the s-box writes, which is 1/4 of the 
memory access. It also increases parallelism, which is a good thing (C 
= ROTL(C^D,D)+A; takes at least 3 clock cycles due to data dependencies 
anyway, so it'd be faster even if the s-boxes were all in registers).

That said, memory access is not a major bottleneck (on a PowerPC), 
since you can load into a temporary register several instructions ahead 
with minimal penalty, so it may be faster to keep the S-boxes in RAM, 
preload them, and use the extra registers for another pipe (the 
bottleneck here is that the PPC can only load once per clock cycle).

The byte-reversed-add can be avoided by rewriting the client (which I 
strongly recommend), since the smallest blocksize is 2^32 anyway.

//round 0,1
A = S[0] = ROTL3(C);
B = L[0] = ROTL(rc5_72unitwork->L0.lo+A,A);
for (i = 1; i < 26; i++)
       C += Q;
       A = S[i] = ROTL3(C+(A+B));
       B = L[i%3]
// round 2
     for (i=26; i < 2*26; i++)
       A = S[i-26] = ROTL3(S[i-26]+(A+B));
       B = L[i%3] = ROTL(L[i%3]+(A+B),(A+B));
// round 3, encrypt
     A = /* S[0] = */ ROTL3(S[0]+(A+B));
     B = L[26%3] = ROTL(L[26%3]+(A+B),(A+B));
     C = rc5_72unitwork->plain.lo + A;
     A = /* S[1] = */ ROTL3(S[1]+(A+B));
     B = L[27%3] = ROTL(L[27%3]+(A+B),(A+B));
     D = rc5_72unitwork->plain.hi + A;
     for (i=1; i<=12; i++)
       A = /* S[(2*i)] */ = ROTL3(S[2*i]+(A+B));
       B = L[(2*i+26)%3] = ROTL(L[(2*i+26)%3]+(A+B),(A+B));
       C = ROTL(C^D,D)+A;
       A = /* S[(2*i+1)] */ = ROTL3(S[2*i+1]+(A+B));
       B = L[(2*i+27)%3] = ROTL(L[(2*i+27)%3]+(A+B),(A+B));
       D = ROTL(D^C,C)+A;

More information about the Hardware mailing list