[RC5] AMD X86-64 port

Décio Luiz Gazzoni Filho decio at revistapcs.com.br
Sun Apr 27 13:32:45 EDT 2003


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> I'm most interested to see what those extra registers in 64-bit mode can do
> for RC5-72 performance.

A lot, surely. Basically, for each key, the most used variables are a scalar 
called A and an array called L (3 entries long, was 2 entries long in 
RC5-64). One of the x86 registers, ECX, is reserved for rotate counts of the 
variable rotate instruction, since the x86 instruction set only allows 
specifying variable rotate counts in this register. So a core has 4 
frequently accessed variables per key processed (pipe), the ECX register 
cannot be used, and the ESP register stores the stack pointer and cannot be 
fiddled with. Thus, out of 8 registers, only 6 are usable, and the current 
2-pipe x86 cores need to store the L arrays on memory -- in RC5-64, they were 
stored in registers.

With the Hammer's aditional registers, it would be possible to store all 9 
values (4 per key + ECX) on registers, and use the remaining 6 registers to 
load variables needed for the initialization/test tasks, which are less 
frequently used. A possible arrangement is to store the RSA-supplied 
plaintext and ciphertext blocks (that would take 4 registers), the number of 
loop iterations already performed (another register), and the pointer to the 
data structure that holds the remaining values (the last register).

Another reasonable arrangement would be to hold the key value for the current 
iteration (that would take 3 registers), the number of iterations performed 
(another register), the cheating countermeasure match count (another 
register), and the pointer to the data structure that holds the remaining 
values (the last register).

Also, observing that the second block of ciphertext is rarely used (once per 
block, expectedly), a good arrangement is to store the two RSA-supplied 
plaintext blocks and the first ciphertext block (3 registers) and the key 
value for the current iteration (the remaining 3 registers). Since ECX is 
never used when the remaining values need to be read from memory, it could be 
overwritten when needed by the pointer to the data structure already 
mentioned. Another idea that occurred to me is to prefetch values of the S 
array, which is 26 entries long and wouldn't fit on registers anyway. There 
are endless possibilites.

But probably the most profitable choice would be to move to a 3-pipe design, 
since the Hammer's core has 3 ALUs (so did the Athlons, but the scarcity of 
registers didn't help). That would require 12 registers, and ECX is tied up, 
so there remain two registers. Good choices of variables to store in these 
registers would be the low value of the current iteration's key, the first 
RSA-supplied ciphertext block, the iteration count, and the pointer. I'd 
personally go for the first two, with the pointer stored at ECX when needed.

As you can see, it's going to be a lot of work to tune it perfectly for the 
Hammer, but there's going to be a performance boost right away, and further 
boosts as we iron out the kinks.

Décio
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+q/gkce3VljctsGsRAsRlAJ0Qzoq/MJU/wKAwgQ5ZMiG+PwwQewCgjlRs
uPZ0EbNc5KTAfDHgaXnb8Oo=
=SLSE
-----END PGP SIGNATURE-----


--
To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net
rc5-digest subscribers replace rc5 with rc5-digest



More information about the rc5 mailing list