[RC5] Some real info on the P4

Peter Cordes peter at llama.nslug.ns.ca
Mon May 14 15:02:14 EDT 2001

 There's been a lot of half-informed discussion on this list recently.  I'm
going to claim to know what I'm talking about, but I suggest that people
read some web pages, especially Silicon Insder (see URL below), rather than
believe everything I say.  (some of it's a bit fuzzy, esp. the bit about

On Thu, May 10, 2001 at 03:51:45PM +0100, Bjoern Martin wrote:
> >Hey I've got some sad sad sad statistics.
> >[...]
> >Hope somebody can explain this....
> Doesn't make much sense to run statistics now, as the P-IV has
> to be specially supported for - kind of at least. I only got 1
> cent for this one as I'm no hardware guru:
> The pipeline of the P-IV is pretty long (i think 12 clocks). This
> means, every instruction takes 12 clocks. That's not a problem,
> as when the first instruction moved on 1 clock, the second one
> can start.

 Right, this is what pipelining is all about.  You have multiple instruction
"in the pipe", you put one in every clock, and you take one out every clock.
If an instruction in the pipe depends on the results of an instruction that
hasn't got out of the pipe yet, you have to forward results between pipeline
stages.  Nothing new about that.  Note that, like most modern CPUs,
including the P6, k6-2, k7, UltraSPARC, Alpha >= 21264, etc., the pentium 4
is a superscalar, out-of-order execution design.  It looks at the incoming
instructions, and figures out which ones can be run at the same time because
they don't depend on each other.

 BTW, the P4's pipeline is 20 stages long.  The P6's pipeline takes at least
12 clocks for instructions to traverse.  See
(and PageNum=4).

> So you might have a max of 12 instructions running the
> same time. BUT: if the software isn't aware of this (i.e.,
> compiled for older processors, as nearly 100% of today's software),
> this isn't an advantage, but quite the opposite.

 No, that's wrong.  The P6 core has a minimum issue->retire latency of 12
clocks, but it can have many more in flight at once, because it is
superscalar (i.e. more than one instruction at once).  Optimizing for
Pentium 4 is not a matter of writing down the instructions such that 12 or
20 can be in flight at once.  The CPU takes care of figuring out how to
execute multiple instructions at once.

 Optimization stuff is more like using instructions that the P4 is good at
handling, i.e. not bit-shifts or rotates.  For floating point, IIRC the P4
does a better job when you use SSE and SSE2, instead of the incredibly lame
x87 stack-based floating point, which is hard to implement and hard use as a
compile target.  It's one of _the_ biggest reason why IA32 CPUs lose on
floating point relative to Alpha.  Another big reason is lack of registers,
preventing stuff like software pipelining.  AMD's k8 has the right idea,
introduce a whole new floating-point unit with lots of registers.

> Nowadays it's
> pretty normat that an Athlon running 1,2 GHz outruns a P-IV 1,4
> easily. This will improve in favor of the P-IV if you recompile
> with optimized compilers (Intel's words), but Dnet's cores are
> normally optimized in assembler. So just be a little patient.
> So what do the gurus say, is this guess true? :)

 Mostly.  The P4 is not good at shifts and rotates (the instructions for
those ops are a lot slower than on the P6 core), but RC5 heavily depends
on them.  Taking this into account, it's probably possible to write some
assembler that will run fast on the P4, and at least do better than the
current cores.  For general purpose stuff, Intel's compiler should help a lot.

 If you want to know what you're talking about, read 
http://sandpile.org/, and read some of Paul DeMone's Silicon Insider columns:

 For a sensible evaluation of the P4, look here:

 As for the Pentium 4's thermal throttling, that doesn't make it run slower.
Under normal conditions (i.e. playing quake) the CPU spends a fair amount of
time waiting for memory access and stuff.  Someone on /. (sorry, can't find
the reference) pointed out that CPU designers always have to limit their
CPUs so running the absolute hottest sequence of instructions (which most
likely doesn't calculate anything useful!) doesn't burn out the CPU.  By
throttling it, they can make a faster CPU without making it possible to burn
it out.  Someone else pointed out that floating point instructions tend to
be hottest, because they make more transistors switch.  This is good news
for integer projects like all of d.net's stuff so far.  Tight loops that
don't touch memory, like d.net's rc5 core, can make CPUs run pretty hot.
(esp on the G4+altivec, I've heard).  I guess it's possible that d.net might
make your CPU throttle down unless you had some serious cooling, but for any
other use, the throttling is only a good thing.  Read

 As for whether it's worth buying or not, I personally wouldn't buy one
unless I had a specific application in mind, and the P4 ran it quickly.
I've got nothing against Intel (except that they let their marketing
department order around their engineers so they sometimes end up releasing
crap), they just aren't making the best stuff for the best prices right now.

#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter at llama.nslug. , ns.ca)

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BCE
To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net
rc5-digest subscribers replace rc5 with rc5-digest

More information about the rc5 mailing list