[RC5] Hardware rotate and (OT) OGR Performance on P4 & Athlon

Julian Ruhe ruheejih at calvados.zrz.tu-berlin.de
Thu Dec 12 13:38:55 EST 2002


on my way to optimize RC5-72 for Athlon I took the first look into 
RC5-72 and OGR yesterday.
First of all: The P4 DOES have a hardware rotate ("rotl and rotr", 
Port1, "Integer Operation"), but it can process only one per cycle and 
it has the very high latency of 4. Intel does not recommend it, when 
latency is important, but  it exists agaist the claim in the RC5-72 FAQ. 
What all x86 CPUs do NOT have, is a Altivec like vector rotate. That is 
the secret. As long as I have not enough information about how RC5-72 
works, I cannot say if there is a fast workarount using MMX or SSE2 
shifts, which brings me to OGR:
If my assumtion is right, that the assembly parts in the undocumented 
macro-chaos OGR are the hotspots of the code, then I know why Athlon and 
P4 are slower than they could. The SHLD (Bitwise Double-Precision 
Shifts) instruction used there  is a Vector path intruction on Athlon. 
These are microcoded instructions that stall almost everything in the 
CPU. The Athlon needs 6 cycles to complete the operation. In the P4 
optimization guidelines the instruction is not even listed, so that I 
think that it behaves desasterous, too. Again- as long as I did not 
understand the algorithm, I don't know if there is a fast workaround, 
but the existence of fast MMX and SSE shifs make me very optimistic.



To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net
rc5-digest subscribers replace rc5 with rc5-digest

More information about the rc5 mailing list