[RC5] K6-2 MMX RC5 core

Bruce Ford b.ford at qut.edu.au
Mon Nov 15 13:52:53 EST 1999


>Date: Sun, 7 Nov 1999 20:16:09 +0300
>From: "Dmitri Besedin" <nickms at com2com.ru>
>Subject: [RC5] K6-2 MMX RC5 core
>
>Greetings,
>
>Anybody knows how it's going on with the RC5 MMX core for the AMD K6-2 CPU.
>I've just upgraded my 2nd PC, K6 233 to K6-2 266 which I've overclocked to
>333 MHz. Now it works a little bit faster, making 578 Kkeys/sec (the keyrate
>was 405 Kkeys/sec before). I've read previously that implementing the MMX
>core for K6-2 could make it faster at RC5. Is it true, and how much gain
>could we expect? Were there any attempts to implement this core in the RC5
>clients?

There have been a number of attempts to get the K6-2/K6-III to use the
RC5-MMX core at the same clocks/key as the P5MMX does (~463 clocks/key).
The most noteable was a "contest" between Steve Porter (now of dcypher.org)
and "AMDBob".  They managed to get close to parity with the ALU K6 core
(~610 clocks/key) but could not improve it further.

Personally I have made about 3 attempts and managed to get the first
"cycle" of the RC5 algorithm to execute with a fully paired instruction
count.  Please note that counting clocks and "pairings" is difficult on the
out-of-order processors and this result may be dubious.  The interesting
thing was that when repeating the instruction sequence for the next RC5
"cycle" a two clock delay was introduced which I could not remove.

My current theory is that the extra clocks are due to the lack of pre-fetch
decode information in the instructions which span across the 32 byte cache
line boundaries.  On the K6-2/K6-III the MMX instructions are short decodes
but must have the pre-decode information available in order to pair
correctly at the decode stage.  If the pre-decode logic is not present the
decode reverts to an unpaired long decode.

So all we have to do is make all the instruction sequences line up on 32
byte boundaries without breaking the logic or causing resource contention
for the shifter or affecting the latency issues with the memory loads.

>From other posts which have followed there appears to be some confusion
between the RC5 MMX code and the DES MMX code.  The RC5 MMX code is not
bitslice code.  It processes 4 keys in parallel using duplication and shift
to emulate the rotate left instruction.

DES MMX is however bitslice code which means it processes 64 keys in
parallel by emulating a series of electronic gates (to put it simply).  To
bitslice RC5 so that it would be faster than the standard algorithm would
require registers 192 bits wide  the last time anyone looked at it.

FWIW CSC is bitslice code.


Bruce Ford                                      b.ford at qut.edu.au
Systems Programmer
Teaching and Learning Support Services          Ph: +61 7 3864 1178
Queensland University of Technology

--
To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net
rc5-digest subscribers replace rc5 with rc5-digest



More information about the rc5 mailing list