[HARDWARE] Overclocking CPUs (more long rant)

taniwha taniwha at taniwha.com
Thu Sep 23 02:02:22 EDT 1999


>Robert Norton said:
>> I said:
>> What I'm trying to get at here is that what will fail first when
>> you overclock is unknown (from your point of view) it may be a path
>> that you don't use much - maybe you can run quake, rc5 and compile
>> the limux kernel over and over untill you are blue in the face
>> without any problems .... but a speed path in the multiplier will
>> mean that it silently screws up your taxes ....

>Just for general info, if you crank a chip up until it just runs on the edge, how
>far back do you have to take it (as a guess) on the average to get stable
>operation?  Like 10% slower?

well I think that the point of my rant was that you 
just can't tell - NO one really knows - we're deally with
something so complex that it's not possible to tell. 
Us designers are designing to a statistical model of the 
process - the back-end guys who to burn-in and characterisation
are sampling stuff from the real process (they get wafers
which have been pushed into the various process-corners 
so they can skew their stats to get information in the 
interesting places on the bell curves - for them what 
happens in the middle is boring - they want to find out 
how stuff fails at the extremes).

The problem is that there are so many ways a chip can 
fail you don't know whether you're hitting all of them 
(and you can't try all a modern CPU's state transitions
in any reasonable time - like the lifetime of the universe)

So whatever speed you test it at there may be some path 
or combination of paths that will fail - you just don't 
know because you don't know that you've 'tested the system')
10% less will always be better (assuming you have no dynamic 
nodes)

[another aside while we're talking about how you know
whether a CPU or system will work]

Finally - there's metastability - which basicly sais that
when you move a signal from one clock to another sometimes
the flops miss it or end up in a halfways oscilating state.
You can't avoid metastability - you just have to mitigate
against it - you do the math about how good the flops are,
what the clock ratios are and you get a MTBF (mean time 
between failures) and you multiply by the number of metastable
paths in your design to get the MTBF for the whole chip,
then you start adding more flops (and latency which 
often makes your performance crap out) untill the whole-chip
MTBF is low enough (years) that no one's ever going to care.
I build display controllers their dot-clock speeds are 
programable they are never a multiple of the memory clock 
or a PCI clock - no one cares if a line of pixels on the screen
goes wierd on 1 frame a year ..... or even if you need to
reboot the display controller once evey century ..... but 
if you're building life support systems ......

Anyway metastability is one of those dirty secrets no one wants
to talk about or deal with .... I've had to help fix 
the bugs when someone didn't do the math and things started 
failing every hour or so :-( BTW I'm way off topic here,
most CPUs aren't oing to suffer from metastability 
because their internal clocks are all running synchronously 
wrt each other - on the other hand if you speed up the PCI
or the memory interface then chances are there will be some 
cross-clock synchronisation and all that nice MTBF math goes
out the window

Having said all this - I design chips - they work in the 
field - it IS actually possible to build reliable (enough) 
chips that customers are happy.

	Paul Campbell aka Taniwha
--
To unsubscribe, send 'unsubscribe hardware' to majordomo at lists.distributed.net



More information about the Hardware mailing list