[RC5] K6-2/350 vs PII 266 mobile performance

Eric Gindrup gindrup at okway.okstate.edu
Mon Jan 4 22:03:54 EST 1999

     Although what you have said is usually in a FAQ somewhere,
     (   http://www.distributed.net/FAQ/dcti-faq.html#TD0080    ought to 
     wirk, but is currently broken)
     , it is not exactly relevant to the question asked.
     The AMD K6-2 Processor Data Sheet (as of November 1998), also 
     available at http://www.amd.com/K6/k6docs/pdf/21850e.pdf, indicates 
     in section 2.6 (p. 16, "physical page" 34) that the Integer (and 
     MMX) Shifts have a latency of 1 cycle (and can start on every 
     cycle).  (The K6 also has this property.)
     The K6-2 can decode one move or shift instruction per clock (using 
     the short decoder unless, for some odd reason, the instruction is 
     longer than 7 bytes) and feed it to the Integer X execution unit.  
     This unit produces exactly one (RISC86) microcode instruction (to 
     perform the shift) and that operation requires only one cycle to 
     execute and then to retire.
     Although it sounds as if this should take ~4 cycles, the magic of 
     pipelining makes it appear (when all is well-tuned) that this 
     instruction takes one cycle to perform.  (It starts one cycle after 
     the preceding instruction and finishes one cycle after it.  (At 
     least, as far as the software can tell.))  The P5, PPro, Celeron, 
     PII, and Xeon all have the same property.
     The listed machines' largest performance gap is bus speed.  The K6 
     is necessarily running on a 100MHz bus and the PII on a 66MHz bus.  
     This would seem to cause a 50% increase, *but* remember that the L2 
     on the PII is running at 133MHz while the L2 for the K6-2 is at 
     100MHz.  L2 cache sizes can also have impact (if they are 
     significantly different).
     An interesting architectural point that will probably result in a 
     great deal of disagreement is that the 350MHz processor and the bus 
     frequently (about 50% of the time) are a half (bus) cycle out of 
     sync and a half (bus) cycle latency is inserted at "odd times".  (On 
     long string operations, this can equalize a K6-2/300 and a /350 
     since the cache operations will occur at exactly the same times.  
     Only (over-) tuned insertion of additional opcodes to take up the 
     single (processor) cycle delays can show the difference and *these* 
     instructions have to fetch, read, and write entirely from the L1 
        The RC5 client is generally believed to run its "tight loop" in 
     the L1 cache and there is reason to believe that the cores are 
     unrolled just enough to almost fill the L1 instruction caches on 
     pretty much all the clients.  So this shouldn't matter.
     The PII rate is in line with what I'm used to seeing in my clients.  
     (A little high, but not "surprising".)  The K6-2 stat is a little 
     lower than what I'm used to seeing.  (Again, not "surprisingly" so.)
     I have not heard an actual, convincing argument as to why the K6s 
     scale with clock as a P5.  The internal architecture is somewhere 
     between that of the PPro and PII.  The K6-2 is in many respects a 
     more performance-oriented processor than the PII.  (Of course, there 
     are those who disagree...)  I also haven't heard a valid argument 
     for its performance (somewhere between a P5 and PPro) on RC5.
        This is not to say that there isn't some transparently obvious 
     reason.  I just haven't heard it, nor have I had an indicative 
     thought that went "all the way".  I'd really like to know why the 
     K6-2s scale in an architecturally expected manner in (almost) all 
     things but RC5 (with respect to Intel procs).
            -- Eric Gindrup ! gindrup at okway.okstate.edu

______________________________ Reply Separator _________________________________
Subject: Re: [RC5] K6-2/350 vs PII 266 mobile performance 
Author:  <rc5 at lists.distributed.net> at SMTP
Date:    12/28/98 10:52 AM

John "Chris" Wren wrote:
>         I have a question about performance issues.  I have a K6-2/350 
> that benchmarks about 25% higher than my PII 266 in every arena,
> except for RC5 keyrates.  I'm seeing about 590kkps on the K6, and 
> 700kkps on the PII.
>         Both systems are selecting the correct cores, etc.  Why is the 
> K6 core performing so poorly?
Sigh, this has got to be in a FAQ somewhere.
The intel x86 chips have a hardware-single-cycle-rotate 
instruction. That is, bits can be rotated around in the 
registers by any arbitrary number. The RC5 algorithm 
depends very heavily on this kind of bit manipulating 
operation. Most processors do not support this. Either 
the multi-bit ROTate is microcoded(native instruction 
is supported but takes more than a single cycle) or it 
must be emulated in the software by multiple SHIFTs to 
the left and right and then the results of those shifts 
are merged with an AND. Multiple cycle operations, 
whether done in the 'hardware' or in the software take 
longer than single cycles.
Intel chips(and Cyrix and Power PC I think) support the 
single cycle rotate. Most others do not.
To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net 
rc5-digest subscribers replace rc5 with rc5-digest

To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net
rc5-digest subscribers replace rc5 with rc5-digest

More information about the rc5 mailing list