# [RC5] newbie question

Slawek sgp at telsatgp.com.pl
Fri Aug 29 21:12:04 EDT 2003

```Elektron wrote:

>> Writing to L1 cache and reading from it at least on P4M
>> is 1 clock cycle each. Doing XORs is 1 cycle as well.
>> (or was it even 0.5 cycle? I can't remember).
>
> Things don't take half a cycle - rather, you can sometimes
> begin more than one instruction on the same clock cycle.

Not exactly. On P4M you have two arithmetic units which can
execute two basic arithmetic microops per cycle _each_.

Result of any operation from first half-cycle can become
a parameter for any operation taken in second half-cycle.
(in fact that is not always true, but being true sometimes is
enough to say that we have got operations taking 0.5 cycle)

> > Let's say we've got a 3-pipe loop - we need 2 memory
> > accesses and 3 xors which is 5 cycles per 3 keys.
> >
> > This gives only 1 and 2/3 clock cycle per key.
>
> The fastest I can come up with (results a,b,c, residual in *p, extra
> register r) is { r = *p; b ^= c; r ^= a; r ^= b; *p = r; } which takes
> four clock cycles, since you can do the first two at the same time.
>
> It seems to take four clock cycles on a P133 anyway.

It all depends on processor. On P4M you have two separate units
which can do: one read from L1 cache and 1 write to L1 cache.

Whose are independent from other processor operations.

So for example a read from L1 cache can be done at the same time
that last instructions of decription, and write at the same time than
decreasing counter of left blocks.

It's somewhat blocked by decoding unit which can generate up to
3 microops per cycle, but with hyperthreading there are two such
units in fact.

As far as I know no other x86 processor has a separate units
doing cache read / writes so this logic may be P4 psecific.
Well, in fact looking at rc5 cores I think they can be somewhat
optimised for P4M (and earlier P4's as well).

> > Now... how many cycles does it take to decrypt one key?
> >
> > On P4 it's _very_ slow because of lack of barrel shifter
> > so I should probably check on Celeron or P III.
> >
> > As far as I remember it was somewhere around 200
> > clock cycles per key (somebody correct me if I'm wrong,
> > I don't have any P III handy here).
>
> 1400000000/2941147 (the PIII 1400 in the cpu database)
> gives 476 cycles.
>
> I'm not sure that the PIII has enough registers for a 3-pipe
> core though.

It doesn't. You must backup with memory.

But I'm not sure if using 3-pipe core using L1 cache wouldn't
be faster than 2-pipe core. As far as I remember at least on
Athlons using 3-pipe core was faster.

> > Does Hyper Threading support priorities?
> >
> > You know: in case two processes of different priorities
> > running simultanously. They are executed simultanously,
> > but aren't they equal when fighting for processor internals?
>
>  From what I've heard on this list, the 'second' processor
> comes second.
> I don't think a processor which 'fought for internals' would
> run properly.

There is a scheduler which can take microops from both
logic processors and send it to be done on each of processing
units.

Does it favour one of the processors?

>From my basic experiments it looks like it doesn't,
but I can't find it in Intel's documentation.

--
Slawek Piotrowski

```