[RC5] Hardware rotate and (OT) OGR Performance on P4 & Athlon
ruheejih at calvados.zrz.tu-berlin.de
Thu Dec 12 13:38:55 EST 2002
on my way to optimize RC5-72 for Athlon I took the first look into
RC5-72 and OGR yesterday.
First of all: The P4 DOES have a hardware rotate ("rotl and rotr",
Port1, "Integer Operation"), but it can process only one per cycle and
it has the very high latency of 4. Intel does not recommend it, when
latency is important, but it exists agaist the claim in the RC5-72 FAQ.
What all x86 CPUs do NOT have, is a Altivec like vector rotate. That is
the secret. As long as I have not enough information about how RC5-72
works, I cannot say if there is a fast workarount using MMX or SSE2
shifts, which brings me to OGR:
If my assumtion is right, that the assembly parts in the undocumented
macro-chaos OGR are the hotspots of the code, then I know why Athlon and
P4 are slower than they could. The SHLD (Bitwise Double-Precision
Shifts) instruction used there is a Vector path intruction on Athlon.
These are microcoded instructions that stall almost everything in the
CPU. The Athlon needs 6 cycles to complete the operation. In the P4
optimization guidelines the instruction is not even listed, so that I
think that it behaves desasterous, too. Again- as long as I did not
understand the algorithm, I don't know if there is a fast workaround,
but the existence of fast MMX and SSE shifs make me very optimistic.
To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net
rc5-digest subscribers replace rc5 with rc5-digest
More information about the rc5