[RC5] Athlon core

Bruce Ford b.ford at qut.edu.au
Thu Sep 30 18:02:21 EDT 1999

WARNING: Highly technical detail follows

As Dan Oetting has thrown down the challenge of the fastest RC5 
core, I would like to investigate a faster core for the AMD Athlon/K7 

The PPC core currently takes about 304 clocks per key.  I believe 
that the Athlon could take ~250 clocks per key by using a mix of 
the x86 integer and MMX code.

My reasoning is this.

The PPro/PII/PIII core takes ~340 clocks per key.  The MMX core 
takes ~460 clocks/key.

The Athlon can decode 3 DirectPath instuctions per clock and can 
issue 3 ALU, 3 Address Generation and 3 MMX instuctions per 

Due to the limitations of the x86 register set it is difficult to use the 
third ALU unit effectively.  However the MMX units can be used as 
unlike the PII/PIII/K6-2/K6-3 they do not use the same issue slots 
as the ALUs.

Without going into detail, there are 4 stages to the RC5 algorithm; 
3 rounds of key expansion and a round where the plaintext is 
mixed with the expanded key.  My proposal is to have the MMX 
units process part of the first round of the key expansion of the 
next key pair while the ALUs do the rest on the current key pair.

The current MMX code processes 4 keys at a time giving 3680 
paired instructions per loop. The integer code does 2 keys in 
parallel giving 1360 paired instuctions per loop. (I know this is not 
exactly right and I know I could go off and count them but for now 
just let it ride please.)

This has to be combined into a number of instruction triples, 2 
integer and 1 MMX.  Note that our limitation here is that the 
decoders can only process 3 instructions per clock.

In a loop processing 2 keys containing n triples the fraction we 
need to complete in the integer code is n/680 and the fraction in 
the MMX code is n/1840.

So n/680+n/1840=1   =>   n=496

Since this is for 2 keys thats 248 clocks per key.

Now why am I telling you all this and not going and doing it myself:
1. I don't have access to an Athlon processor.
2. My time for assembler programming is very limited right now.
3. Someone else might use these ideas or work with me to bring 
them to fruition.
4. Someone may already be working on an Athlon core and have 
better ideas or prefer not to reinvent wheels.

Comments and corrections welcome.

Bruce Ford                                      b.ford at qut.edu.au
Systems Programmer
Teaching and Learning Support Services          Ph: +61 7 3864 1178
Queensland University of Technology

To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net
rc5-digest subscribers replace rc5 with rc5-digest

More information about the rc5 mailing list