[RC5] Use of FPU on Intel 486 and P5 processors
b.ford at qut.edu.au
Wed Jan 7 11:41:12 EST 1998
WARNING: Highly technical detail follows
In answer to Seth and Eric,
> Was this tested under a Multi-Tasking/Multi-Threading operating
Good old single tasking DOS 6.22 using WASM 10.5 assembler and
DOS4GW v1.97 extender.
On the P5 the clock count was calculated using the RDTSC (0Fh 31h)
instruction. On the 486 it was calulated by doing 16 million loops
and inferring clocks from the time difference on a 66MHz system
(done multiple times and the minimum clock count taken).
> 5.6.1 Using Integer Instructions to Hide Latencies of
> Floating-Point Instructions
Sometimes you just can't believe the manual Eric, you have to get
your hands dirty and test it! Makes me wonder if Intel ever tested
After doing some optimization on the P5 and 486 cores (available at
http://wwwperso.hol.fr/~guyom001/ ) I tested the possibility of
parallel execution by the following.
On subroutine entry:
fstcw word ptr cw ; Set rounding control
or word ptr cw, 0000111100100000B ; to truncate
fldcw word ptr cw
; 1399 clocks on the P5, 2011 clocks on 486 in the inner loop
; After the last rotate of the second key an fscale was added
rol esi, cl ; Takes 4 clocks on a P5, 3 clocks on a 486
fscale ; Minumum of 20 clocks on a P5, 30 clocks on a 486
add esi, ebx
... ; Increment the key
On subroutine exit:
As you can see there is no memory access in the inner loop, the
fscale command has the entire key loop to execute and is the only FPU
command in the inner loop. No exceptions occur as the fscale command
performs the equivalent of 1.0*2^0 and places the result (1.0) back
on the top of the stack and leaves the 0 in ST(1).
fscale is important as it is used in the FPU code to perform the
equivalent of the "rol reg, cl" after multiplying by 1+2^-32.
The fscale was added after the rol instruction as the integer
pipeline is already stalled at that point and I was hopeful that
that the decoder would submit the instruction to the FPU and no
clocks would be wasted. I have also tried placing the fscale in
other locations. I checked the executable code (in hex) to confirm
that there was no wait/fwait (9Bh) before the fscale and that the
fscale was inline and not emulated.
After this change the inner loop took 1419 clocks (+20) on a P5 and
2041 clocks (+30) on a 486.
I also tested an "fstp st(1)" instruction and got +1 and +3 clocks
for the P5 and 486 respectively. These happen to be the clock counts
of these instuctions. From this I concluded that on a P5 and 486 the
FPU stalls the integer pipeline.
To do 1 cycle of round 1 of the key expansion takes 14 clocks on a P5
and 22 clocks on a 486. A stall of 20 and 30 clocks therefore makes
the FPU useless for this exercise. As the FPU takes 34 steps to do
one cycle, a stall of even 1 clock per step makes parallel execution
no longer viable.
As I stated earlier, this information is 486 and P5 specific, later
Intel chips and chips from other manufacturers may not have this
restriction. I don't have access to these chips, but I suggest that
if you do that you test it rather than believe the manual.
I will see if Remi will put the FPU instructions that do the key
expansion up on his web site. It is incomplete (and at present
useless) but does do 1 cycle of round 1 and may provide inspiration
for others in some way.
If there is something I have overlooked in the above I am sure you
will feel free to comment :)
P.S. The rounding control on subroutine entry is for completeness.
It was in the code, though it performs no useful function in this
instance. It is required for the full key expansion as floating
point numbers must be trucated to integers at various points.
To unsubcribe, send 'unsubscribe rc5' to majordomo at llamas.net
rc5-digest subscribers replace rc5 with rc5-digest
More information about the rc5