[RC5] AMD X86-64 port
Décio Luiz Gazzoni Filho
decio at revistapcs.com.br
Sun Apr 27 13:32:45 EDT 2003
-----BEGIN PGP SIGNED MESSAGE-----
> I'm most interested to see what those extra registers in 64-bit mode can do
> for RC5-72 performance.
A lot, surely. Basically, for each key, the most used variables are a scalar
called A and an array called L (3 entries long, was 2 entries long in
RC5-64). One of the x86 registers, ECX, is reserved for rotate counts of the
variable rotate instruction, since the x86 instruction set only allows
specifying variable rotate counts in this register. So a core has 4
frequently accessed variables per key processed (pipe), the ECX register
cannot be used, and the ESP register stores the stack pointer and cannot be
fiddled with. Thus, out of 8 registers, only 6 are usable, and the current
2-pipe x86 cores need to store the L arrays on memory -- in RC5-64, they were
stored in registers.
With the Hammer's aditional registers, it would be possible to store all 9
values (4 per key + ECX) on registers, and use the remaining 6 registers to
load variables needed for the initialization/test tasks, which are less
frequently used. A possible arrangement is to store the RSA-supplied
plaintext and ciphertext blocks (that would take 4 registers), the number of
loop iterations already performed (another register), and the pointer to the
data structure that holds the remaining values (the last register).
Another reasonable arrangement would be to hold the key value for the current
iteration (that would take 3 registers), the number of iterations performed
(another register), the cheating countermeasure match count (another
register), and the pointer to the data structure that holds the remaining
values (the last register).
Also, observing that the second block of ciphertext is rarely used (once per
block, expectedly), a good arrangement is to store the two RSA-supplied
plaintext blocks and the first ciphertext block (3 registers) and the key
value for the current iteration (the remaining 3 registers). Since ECX is
never used when the remaining values need to be read from memory, it could be
overwritten when needed by the pointer to the data structure already
mentioned. Another idea that occurred to me is to prefetch values of the S
array, which is 26 entries long and wouldn't fit on registers anyway. There
are endless possibilites.
But probably the most profitable choice would be to move to a 3-pipe design,
since the Hammer's core has 3 ALUs (so did the Athlons, but the scarcity of
registers didn't help). That would require 12 registers, and ECX is tied up,
so there remain two registers. Good choices of variables to store in these
registers would be the low value of the current iteration's key, the first
RSA-supplied ciphertext block, the iteration count, and the pointer. I'd
personally go for the first two, with the pointer stored at ECX when needed.
As you can see, it's going to be a lot of work to tune it perfectly for the
Hammer, but there's going to be a performance boost right away, and further
boosts as we iron out the kinks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
-----END PGP SIGNATURE-----
To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net
rc5-digest subscribers replace rc5 with rc5-digest
More information about the rc5