[rc5] Identical results, two CPUs

root marcus at dfwmm.net
Sat Jul 26 21:49:18 EDT 1997


Tom Wheeler wrote:
> 
> >Ah, think again, "mostly" means there is little benefit at all. The rc5
> >is almost linear code that runs straight through from start to end. The
> 
> I think I'm missing something here.  When you say, "\"mostly\" means
> there is little benefit at all", do you mean there is little benefit if
> the client doesn't fit entirely in L1 (because of its being nearly
> linear)?

Yes, the cache uses a "Least Recently Used" algorithm to decide
which currently used cache lines(storage units) to overwrite with
new(more recently used-like now) data. If, for instance, the cache
is 8k and the inner code loop is 8.1k, the cache will start filling
with the code fetches as is progresses through the code. As it nears
the end of the loop, the cache is filled with the inner loop code
and so something must be replaced. The LRU algorithm picks the oldest
stuff just as it has always been doing and so junks the data from
the beginning of the inner loop and replaces it with the code from
the end. You see what is going to happen when the loop returns to
the top? The code from the beginning of the inner loop will no
longer be in the cache and thus it proceeds. Some folks call this
worthless cache behavior "thrashing" as it is constantly moving
data in and then immediately(almost) junking and reloading. This
is one example of the optimising technique for fixed iteration loops
called "loop unrolling" being detrimental to performance.

I wonder if Remi's P optimized v1 client has folded the unrolled
code back into a loop? Has anyone compared the performance of intel
machines with loops and unrolled code in the RC6_KEY_CHECK function?

> Do you happen to know if the client fits in a P200's L1 cache?

Don't know, and can't find out for sure without the source or greater
effort. The difference between the 6x86 and intel architectures is
twofold. The total cache space is larger in the Cyrix chips as well as
the cache being a unified code/data form rather than the separate code
and data spaces that intel chose. I do remember that the mmx procs have
a larger(16k each code/data I think) than the classic P so it might
fit, but apparently does not.
----
To unsubscribe, send email to majordomo at llamas.net with 'unsubscribe rc5' in the body.



More information about the rc5 mailing list