[rc5] V2 Speed Probs on x86?

root marcus at dfwmm.net
Mon Jun 30 00:57:50 EDT 1997


Ronald Van Iwaarden wrote:
> 
> It is interesting to note that the AMD K5 got the biggest increase followed by
> the Cyrix 6x86.  The K6 did not get as great an increase which is a bit of a
> mystery to me.  My first instinct was that the branch prediction routines of the
> K5 and 6x86 were their biggest benefactor.  These routines probably make them act
> more like P6's under the present code rather than P5's.  This would lead me to
> believe that there is a great deal of opportunity for hand optimization of the
> code...  Of course, I could be just blowing all of this out my (insert (least)
> favorite body part).

I think that you're blowing it all out of your most obnoxious body part.
;-P

There were not very many branches in the v1 code(minimum of two
conditionals
per key as I recall, max of 8-10) and I doubt that the optimization of
even 20
branches and their pipeline flushes(all of the ifs are bunched together)
in
relation to the ~1000 or so instructions that are required to "check a
key"
would produce a 50% performance boost.

My best guess is that they figured out how to make better use of the two
parallel integer pipelines and that helped every(x86)body. The big jump
for
AMD and Cyrix must be from the larger caches on the non-intel machines
which
allow all of the inner loop code to reside in the L1 cache. Compare how
rapidly _any_ code that is executed entirely at processor clock speed,
say
133-200Mhz, will run in comparison to how much more slowly it would be
if
every 4th or perhaps 8th instruction fetch produced a pipeline stall
while
waiting for a cache line fill from the L2 cache at a mere 66Mhz.

See how the ppros excel with their big fast caches as well.

Lastly, if the code is still "loop-unrolled", as I imagine it is, the L1
cache just won't help that much unless the whole thing will fit. The
fetches
for code at the end of the loop will start refilling the cache lines
that
held the code from the start of the loop so there is only the benefit of
the
L2->L1 burst transfers. So start boosting your bus speed and turning
down the
cpu clock multiplier!
----
To unsubscribe, send email to majordomo at llamas.net with 'unsubscribe rc5' in the body.



More information about the rc5 mailing list