[RC5] Use of FPU on Intel 486 and P5 processors

gindrup at okway.okstate.edu gindrup at okway.okstate.edu
Wed Jan 7 13:52:01 EST 1998

     page 47, Table 3.1, "shift/rot by 1" and "shift imm" (immediate 
     argument) are pairable if they are in the U-pipe.
        Unpairable Instructions (NP)
            1. Shift or rotate instructions with the shift count in the CL 
            2. Long arithmetic instructions, for example: MUL, DIV.
            3. Extended instructions, for example: RET, ENTER, PUSHA, MOVS, 
            4. Some floating-point instructions, for example: FSCALE, 
     FLDCW, FST.
            5. Inter-segment instructions, for example: PUSH, sreg, CALL 
        Pairable Instructions Issued to U or V Pipes (UV)
            1. Most 8/32 bit ALU operations, for example: ADD, INC, XOR.
            2. All 8/32 bit compare instructions, for example: CMP, TEST.
            3. All 8/32 bit stack operations using registers, for example: 
     PUSH reg, POP reg.
        Pairable Instructions Issued to U Pipe (PU)
     These instructions must be issued to the U-pipe and can pair with a 
     suitable instruction in the V-Pipe. These instructions never execute 
     in the V-pipe.
            1. Carry and borrow instructions, for example: ADC, SBB.
            2. Prefixed instructions (see next section).
            3. Shift with immediate.
            4. Some floating-point operations, for example: FADD, FMUL, 
        Pairable Instructions Issued to V Pipe (PV)
     These instructions can execute in either the U-pipe or the V-pipe but 
     they are only paired when they are in the V-pipe. Since these 
     instructions change the instruction pointer (eip), they cannot pair in 
     the U-pipe since the next instruction may not be adjacent. Even when a
     branch in the U-pipe is predicted to be not taken, it will not pair 
     with the following instruction.
            1. Simple control transfer instructions, for example: call 
     near, jmp near, jcc. This includes both the jcc short and the jcc near 
     (which has a 0f prefix) versions of the conditional jump instructions.
            2. The fxch instruction.
     My guess for the latencies that are observed is still the locality of 
     FPU and integer references, the relevant section is:
        There are some pairs that may be issued simultaneously but will not 
     execute in parallel:
        1. If both instructions access the same data-cache memory bank then 
     the second request (V-pipe) must wait for the first request to 
     complete. A bank conflict occurs when bits 2 through 4 are the same in 
     the two physical addresses. A bank conflict incurs a one clock
     penalty on the V-pipe instruction.
        2. Inter-pipe concurrency in execution preserves memory-access 
     ordering. A multi-cycle instruction in the U-pipe will execute alone 
     until its last memory access.
            add eax, meml
            add ebx, mem2 ; 1
            (add) (add) ; 2 2-cycle
     The instructions above add the contents of the register and the value 
     at the memory location, then put the result in the register. An add 
     with a memory operand takes two clocks to execute. The first clock 
     loads the value from cache, and the second clock performs the 
     addition. Since there is only one memory access in the U-pipe 
     instruction, the add in the V-pipe can start in the same cycle.
            add meml, eax ; 1
            (add) ; 2
            (add)add mem2, ebx ; 3
            (add) ; 4
            (add) ; 5
     The instructions above add the contents of the register to the memory 
     location and store the result at the memory location. An add with a 
     memory result takes three clocks to execute.  The first clock loads 
     the value, the second performs the addition, and the third stores the
     result. When paired, the last cycle of the U-pipe instruction overlaps 
     with the first cycle of the V-pipe instruction execution.
        No other instructions can begin execution until the instructions 
     already executing have completed.
        To expose the opportunities for scheduling and pairing, it is 
     better to issue a sequence of simple instructions rather than a 
     complex instruction that takes the same number of cycles.  The simple 
     instruction sequence can take advantage of more issue slots. The 
     load/store style of code generation requires more registers and 
     increases code size. This impacts Intel486 processor performance, 
     although only as a second-order effect. To compensate for the extra
     registers needed, extra effort should be put into register allocation 
     and instruction scheduling so that extra registers are used only when 
     parallelism increases.
     3.6.3 General Pairing Rules
        Pentium processors with MMX technology have relaxed some of the 
     general pairing rules:
     *  Pentium processors do not pair two instructions if either of them 
     is longer than seven bytes. Pentium processors with MMX technology do 
     not pair two instructions if the first instruction is longer than 
     eleven bytes or the second instruction is longer than seven bytes. 
     Prefixes are not counted.
     *  On Pentium processors, prefixed instructions are pairable only in 
     the U-pipe. On Pentium processors with MMX technology, instructions 
     with 0Fh, 66H or 67H prefixes are also pairable in the V-pipe.
        In both of the above cases, stalls at the entrance to the FIFO, on 
     Pentium processors with MMX technology, will prevent pairing.
     The subsequent section describes pairing in the PPro and PII.  I don't 
     find *anything* in the 486 or Pentium documentation that indicates 
     that stalling on FPU instructions is correct behaviour.
     The most likely way to fix the problem I'm expecting is to make sure 
     that the counter for the current key being handled by the integer unit 
     must be in a different cache line form the  line containing the 
     counter for the integer unit.  If these two counters are in the same 
     cache line, there will be considerable slowdown because of the 
     alternating data size memory accesses.
            -- Eric Gindrup ! gindrup at Okway.okstate.edu
     ______________________________ Reply Separator 
     Subject: Re: [RC5] Use of FPU on Intel 486 and P5 processors  
     Author:  <rc5 at llamas.net > at SMTP
     Date:    1/6/98 3:30 PM
     I had a look at the feasibility of using the FPU to do keys in 
     parallel with with regular integer pipelines on x86 processors, 
     looking specially at the Cyrix 6x86, since that is the processor I 
     have.  The biggest problem that I could see is that the FPU 
     instructions still have to go through the integer pipelines for 
     decoding, and the instruction prefetch unit is capable of delivering 
     two instructions per cycle, one to each pipeline, and FPU instructions 
     can only go to the X pipeline.  Almost all of the instructions inside 
     the key cycle loop are rotls, adds, and xors, and with the exception 
     of rotl by register, all are only 1 cycle in length.  rotl by register 
     is a 2 cycle instruction.  Those instructions that make memory 
     accesses *might* add additional cycles, but most
     involve registers.  Using the FPU is no win unless the FPU 
     instructions can be scheduled behind a multi-cycle instruction that is 
     known to be going to the X pipeline.  I don't have the pentium data 
     book, but
     I have been told that the pentium rotl requires both pipelines, so 
     sticking FPU instructions behind the pentium rotl's might solve the 
     integer stall problem.  But if you're sticking an FPU instruction into 
     a pipeline behind a one cycle instruction, the pipeline will stall 
     because the next instruction it was expecting to execute has been sent 
     to the FPU, and the execution unit will have to wait an additional 
     clock cycle for another instruction to arrive.
On Tue, 6 Jan 1998 gindrup at okway.okstate.edu wrote:
>      Although not entirely resolving in this issue, 
>      ftp://download.intel.com/design/mmx/manuals/24281603.PDF 
>      has:
>      5.6.1 Using Integer Instructions to Hide Latencies of Floating-Point 
>      Instructions
>         When a floating-point instruction depends on the result of the 
>      immediately preceding instruction, and it is also a floating-point 
>      instruction, it is advantageous to move integer instructions between 
>      the two FP instructions, even if the integer instructions perform loop 
>      control. The following example restructures a loop in this manner:
>      From reading around in that document, my best guess for the cause of 
>      the stalls you're seeing is that the integer instructions are 
>      referencing the same "piece" of memory as the FP instructions, and 
>      there are significant penalties for quickly changing the access width 
>      of memory operations.
>      I'd be curious to know how much of the FP stack you're using since it 
>      might be possible to schedule more than one key in the FPU with 
>      staggered execution (to hide the latency of the other FPU key). 
>      I'll be looking in my '486 manual tonight to see if the FPU is 
>      supposed to stall the integer unit on that processor.  (Although it 
>      might be throwing "Not an Instruction" exceptions on the '486).
>             -- Eric Gindrup ! gindrup at Okway.okstate.edu 
> ______________________________ Reply Separator 
> Subject: [RC5] Use of FPU on Intel 486 and P5 processors 
> Author:  <rc5 at llamas.net > at SMTP
> Date:    1/6/98 10:48 AM
> There has been some comment on the possibility of using the floating point 
> (FPU) on Intel processors in parallel with the integer unit to help process 
> keys.
> I have tested this idea on a 486 and a P5 and found that instructions to the 
> stall the integer pipeline for the duration of the FPU instruction.  
> It may be different on other chips, or even later Intel chips, but I don't 
> access to those.  If after testing you discover that the FPU does execute in 
> parallel without stalling the integer pipeline, I have available a 34 step FPU
> sequence that will do one cycle of 
> round 1 of the key expansion.  Foolishly, though partly as an intellectual 
> exercise, I developed this before testing the pipeline stalling.
> As MMX instructions use the FPU registers I suspect that they too will stall 
> integer pipeline but this should really be tested. 
> Just to add some facts to this discussion. 
> Bruce Ford                                      b.ford at qut.edu.au Systems 
> Programmer
> Teaching and Learning Support Services          Ph: +61 7 3864 3383 Queensland
> University of Technology
> --
> To unsubcribe, send 'unsubscribe rc5' to majordomo at llamas.net rc5-digest 
> subscribers replace rc5 with rc5-digest
> --
> To unsubcribe, send 'unsubscribe rc5' to majordomo at llamas.net 
> rc5-digest subscribers replace rc5 with rc5-digest
To unsubcribe, send 'unsubscribe rc5' to majordomo at llamas.net 
rc5-digest subscribers replace rc5 with rc5-digest

To unsubcribe, send 'unsubscribe rc5' to majordomo at llamas.net
rc5-digest subscribers replace rc5 with rc5-digest

More information about the rc5 mailing list