[RC5] Use of FPU on Intel 486 and P5 processors

Bruce Ford b.ford at qut.edu.au
Wed Jan 7 11:41:12 EST 1998

WARNING: Highly technical detail follows

In answer to Seth and Eric,

Seth first:

> Was this tested under a Multi-Tasking/Multi-Threading operating
> system?

Good old single tasking DOS 6.22 using WASM 10.5 assembler and
DOS4GW v1.97 extender.  

On the P5 the clock count was calculated using the RDTSC (0Fh 31h) 
instruction.  On the 486 it was calulated by doing 16 million loops 
and inferring clocks from the time difference on a 66MHz system 
(done multiple times and the minimum clock count taken).

Now Eric:

>     5.6.1 Using Integer Instructions to Hide Latencies of
>     Floating-Point Instructions

Sometimes you just can't believe the manual Eric, you have to get 
your hands dirty and test it!  Makes me wonder if Intel ever tested 

After doing some optimization on the P5 and 486 cores (available at 
http://wwwperso.hol.fr/~guyom001/ ) I tested the possibility of 
parallel execution by the following.

On subroutine entry:


    fsave   fpu_state
    fstcw   word ptr cw                     ; Set rounding control
    or      word ptr cw, 0000111100100000B  ; to truncate
    fldcw   word ptr cw




    ; 1399 clocks on the P5, 2011 clocks on 486 in the inner loop

    ; After the last rotate of the second key an fscale was added

    rol esi, cl   ; Takes 4 clocks on a P5, 3 clocks on a 486

    fscale        ; Minumum of 20 clocks on a P5, 30 clocks on a 486

    add esi, ebx


    dec _LOOPS
    jz   exit

    ...           ; Increment the key
    jnc  next_key

On subroutine exit:
   frstor  fpu_state    

As you can see there is no memory access in the inner loop, the 
fscale command has the entire key loop to execute and is the only FPU 
command in the inner loop.  No exceptions occur as the fscale command 
performs the equivalent of 1.0*2^0 and places the result (1.0) back 
on the top of the stack and leaves the 0 in ST(1).

fscale is important as it is used in the FPU code to perform the 
equivalent of the "rol reg, cl" after multiplying by 1+2^-32.

The fscale was added after the rol instruction as the integer
pipeline is already stalled at that point and I was hopeful that 
that the decoder would submit the instruction to the FPU and no 
clocks would be wasted. I have also tried placing the fscale in 
other locations.  I checked the executable code (in hex) to confirm 
that there was no wait/fwait (9Bh) before the fscale and that the 
fscale was inline and not emulated.

After this change the inner loop took 1419 clocks (+20) on a P5 and 
2041 clocks (+30) on a 486.

I also tested an "fstp st(1)" instruction and got +1 and +3 clocks 
for the P5 and 486 respectively.  These happen to be the clock counts 
of these instuctions.  From this I concluded that on a P5 and 486 the 
FPU stalls the integer pipeline.

To do 1 cycle of round 1 of the key expansion takes 14 clocks on a P5 
and 22 clocks on a 486.  A stall of 20 and 30 clocks therefore makes 
the FPU useless for this exercise.  As the FPU takes 34 steps to do 
one cycle, a stall of even 1 clock per step makes parallel execution 
no longer viable.

As I stated earlier, this information is 486 and P5 specific, later 
Intel chips and chips from other manufacturers may not have this 
restriction.  I don't have access to these chips, but I suggest that 
if you do that you test it rather than believe the manual.

I will see if Remi will put the FPU instructions that do the key 
expansion up on his web site.  It is incomplete (and at present 
useless) but does do 1 cycle of round 1 and may provide inspiration 
for others in some way.

If there is something I have overlooked in the above I am sure you 
will feel free to comment :)

P.S. The rounding control on subroutine entry is for completeness.  
It was in the code, though it performs no useful function in this 
instance.  It is required for the full key expansion as floating 
point numbers must be trucated to integers at various points.

Bruce Ford
To unsubcribe, send 'unsubscribe rc5' to majordomo at llamas.net
rc5-digest subscribers replace rc5 with rc5-digest

More information about the rc5 mailing list