[RC5] More about 64bit clients

Décio Luiz Gazzoni Filho decio at decpp.net
Wed Aug 24 19:15:58 EDT 2005


First, here's an answer to your other post:

 > First of all, thank you for providing a useful reply, rather unlike
 > the angry one from Decio. I'm not sure how I managed to anger him.

Here's a clue: you previously said `I've never seen a 64-bit  
application work faster than a 31-bit application on the same  
hardware. Never.' I explained why one would see an improvement in  
x86-64. Now you come back and answer `Actually, I'm not surprised.  
Often, the 64-bit version of applications performs slower than the 32- 
bit version.' as though I had never explained why x86-64 would see an  
improvement. Now, I occasionally get mad at people when that sort of  
thing never happens and I make sure to be very incisive in my  
explanation next time, to make sure they get the point.

Now onto the actual technical issues.

On Aug 24, 2005, at 7:17 PM, Fuzzy Logic wrote:

> Look, there is no need to be rude. I do developement on both 32- and
> 64-bit platforms and had observed the general rule of better
> performance on 32-bit platforms.

All being equal I would expect the same as well. However, all is not  
equal. x86-64 has twice as many registers as x86, which was very  
register-starved to begin with. Didier's RC5 code can afford to load  
everything in registers, including the key schedule arrays. x86 RC5  
code has to content itself with loading the two working registers A  
and B. With x86-64 one could, for a 2-pipe core, load the small key  
schedule array L[3]. Or maybe perform the last 26 stages of key  
scheduling simultaneously with encryption, as Didier's core does.

> Yes, I read your posts. You surmised that it was due to larger
> registers. You NEVER mentioned that the cores were optimized for
> particular processors and I hadn't considered that in my thinking
> process, since when I looked at the cores, I didn't notice a
> differentiation for different x86 compatible processor types.

Perhaps you should wonder what's the reason behind so many different  
cores in the client. Each performs better in a different processor.  
And considering the highly different architectures of available  
processors using the same instruction set (x86 in this case), that's  
almost expected.

> You said:
>
> "Do your 64-bit platforms have twice as many registers as the 31-bit
> platforms?"
>
> and then when I replied that they had the same number, you said:
>
> "Well that's the point. x86-64 has 16 registers while ordinary x86 has
> 8. (actually one is the stack pointer, so it's more like 15 vs. 7).
>
> There's the source of extra performance in x86-64."
>
> Hardly what I would call any sort of authoritative answer.

Not looking to toot my own horn, but having written a couple of RC5  
cores myself, I'm well aware of their weaknesses.

> More
> registers doesn't necessarily result in faster code. It's how you USE
> them that counts.

Tell me about it.

>
> My thinking was based on the following:
>
> 64-bit instructions are longer (generally) so fewer will fit into a
> 1st level instruction cache, which would make it harder to optimize a
> narrow loop. Having to hit the second level cache for instructions is
> much slower.

Here's my recollection of the situation in x86-64 (from reading docs  
a few years ago, so I might very well be wrong): in a 64-bit  
operating mode, the default instruction width is 32 bits. You can  
include an instruction prefix to change it to 64 bits, but if you're  
just doing 32-bit operations (which is what RC5 should be doing), you  
won't see code expansion. I do have to wonder how the extra two bits  
required to encode 16 instead of 8 registers plays a part.

Even if that were true, it's not like cores are particularly strapped  
for cache. They can even afford the luxury of performing a complete  
loop unroll on the core (78 key setup rounds plus 12 encryption  
rounds). All that while doing the same operations three times in  
parallel, and doing all the loads/stores involved to compensate for  
the lack of registers. x86-64 might even improve code size of cores  
given that less load/stores (or instructions with complex addressing  
modes) will be required.

Lastly, even in the case of a cache overflow, I wouldn't be worried.  
Most modern processors have large instruction windows (I believe 72  
instructions in the Athlon and 128 in the P4). They'll be looking a  
few dozens of instructions ahead and making sure to prefetch anything  
that's needed. You might say `what if it predicts a branch  
incorrectly [of which there are only 3 or 4, I believe]?' The  
branches here are usually taken/not-taken with probability 1-2^(-32)  
or some equally huge number, so don't worry.

>
> 64-bit instructions read and write larger chunks of data, which
> effectively shrinks the size of the first and second level cache (in
> much the same way as cache misses for instructions).

But RC5 cores don't use 64-bit instructions. They use 32-bit  
instructions and take advantage of extra registers.

>
> 64-bit instructions often take more cycles to complete than similar
> 32-bit instructions.

Agreed on the P4, probably not in the Athlon 64. However that's a non- 
sequitur, because RC5 cores don't use 64-bit instruction.

>
> Now, please remember in the future that just because someone makes a
> general comment, they are not necessarily ignorant about the subject
> at hand.

Perhaps ignorant was the wrong choice of words. However, I couldn't  
think of a word for someone who repeats the same untrue comment after  
being corrected about it. (There is a rather offensive word for it in  
Portuguese though.)

> I have many years of experience in hardware, and only a few
> years experience with d.net, and only a few minutes worth of
> experience looking at the source code for the cores.



> I stand by my statement. In general, 64-bit is slower than 32-bit.
> However, that doesn't mean that it is ALWAYS slower.

And I stand by my implied statement that, everything else being  
equal, an architecture with more registers will perform faster than  
an architecture with fewer registers (and mostly everything else is  
equal between x86 and x86-64 when implemented in the same processor,  
at least as far as RC5 is concerned).

The difference between our statements is that mine is actually  
relevant to the issue at hand.

>
> Also, something I find interesting... you cut the portion of my
> message where I stated that I was unsure because I hadn't tried it for
> myself. Perhaps that would have made your reply look bad? Something to
> think about.

In my personal system of values, if there's one thing I despise more  
than someone who states something false, is someone who admits to  
being ignorant about the subject, and despite that, still goes on to  
state something false.

I was just trying to spare you some embarassment, but apparently  
you're actually proud of it.

Décio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 186 bytes
Desc: This is a digitally signed message part
Url : http://lists.distributed.net/pipermail/rc5/attachments/20050824/3189c0d5/PGP-0001.bin


More information about the rc5 mailing list