[RC5] More about 64bit clients
Décio Luiz Gazzoni Filho
decio at decpp.net
Wed Aug 24 19:15:58 EDT 2005
First, here's an answer to your other post:
> First of all, thank you for providing a useful reply, rather unlike
> the angry one from Decio. I'm not sure how I managed to anger him.
Here's a clue: you previously said `I've never seen a 64-bit
application work faster than a 31-bit application on the same
hardware. Never.' I explained why one would see an improvement in
x86-64. Now you come back and answer `Actually, I'm not surprised.
Often, the 64-bit version of applications performs slower than the 32-
bit version.' as though I had never explained why x86-64 would see an
improvement. Now, I occasionally get mad at people when that sort of
thing never happens and I make sure to be very incisive in my
explanation next time, to make sure they get the point.
Now onto the actual technical issues.
On Aug 24, 2005, at 7:17 PM, Fuzzy Logic wrote:
> Look, there is no need to be rude. I do developement on both 32- and
> 64-bit platforms and had observed the general rule of better
> performance on 32-bit platforms.
All being equal I would expect the same as well. However, all is not
equal. x86-64 has twice as many registers as x86, which was very
register-starved to begin with. Didier's RC5 code can afford to load
everything in registers, including the key schedule arrays. x86 RC5
code has to content itself with loading the two working registers A
and B. With x86-64 one could, for a 2-pipe core, load the small key
schedule array L. Or maybe perform the last 26 stages of key
scheduling simultaneously with encryption, as Didier's core does.
> Yes, I read your posts. You surmised that it was due to larger
> registers. You NEVER mentioned that the cores were optimized for
> particular processors and I hadn't considered that in my thinking
> process, since when I looked at the cores, I didn't notice a
> differentiation for different x86 compatible processor types.
Perhaps you should wonder what's the reason behind so many different
cores in the client. Each performs better in a different processor.
And considering the highly different architectures of available
processors using the same instruction set (x86 in this case), that's
> You said:
> "Do your 64-bit platforms have twice as many registers as the 31-bit
> and then when I replied that they had the same number, you said:
> "Well that's the point. x86-64 has 16 registers while ordinary x86 has
> 8. (actually one is the stack pointer, so it's more like 15 vs. 7).
> There's the source of extra performance in x86-64."
> Hardly what I would call any sort of authoritative answer.
Not looking to toot my own horn, but having written a couple of RC5
cores myself, I'm well aware of their weaknesses.
> registers doesn't necessarily result in faster code. It's how you USE
> them that counts.
Tell me about it.
> My thinking was based on the following:
> 64-bit instructions are longer (generally) so fewer will fit into a
> 1st level instruction cache, which would make it harder to optimize a
> narrow loop. Having to hit the second level cache for instructions is
> much slower.
Here's my recollection of the situation in x86-64 (from reading docs
a few years ago, so I might very well be wrong): in a 64-bit
operating mode, the default instruction width is 32 bits. You can
include an instruction prefix to change it to 64 bits, but if you're
just doing 32-bit operations (which is what RC5 should be doing), you
won't see code expansion. I do have to wonder how the extra two bits
required to encode 16 instead of 8 registers plays a part.
Even if that were true, it's not like cores are particularly strapped
for cache. They can even afford the luxury of performing a complete
loop unroll on the core (78 key setup rounds plus 12 encryption
rounds). All that while doing the same operations three times in
parallel, and doing all the loads/stores involved to compensate for
the lack of registers. x86-64 might even improve code size of cores
given that less load/stores (or instructions with complex addressing
modes) will be required.
Lastly, even in the case of a cache overflow, I wouldn't be worried.
Most modern processors have large instruction windows (I believe 72
instructions in the Athlon and 128 in the P4). They'll be looking a
few dozens of instructions ahead and making sure to prefetch anything
that's needed. You might say `what if it predicts a branch
incorrectly [of which there are only 3 or 4, I believe]?' The
branches here are usually taken/not-taken with probability 1-2^(-32)
or some equally huge number, so don't worry.
> 64-bit instructions read and write larger chunks of data, which
> effectively shrinks the size of the first and second level cache (in
> much the same way as cache misses for instructions).
But RC5 cores don't use 64-bit instructions. They use 32-bit
instructions and take advantage of extra registers.
> 64-bit instructions often take more cycles to complete than similar
> 32-bit instructions.
Agreed on the P4, probably not in the Athlon 64. However that's a non-
sequitur, because RC5 cores don't use 64-bit instruction.
> Now, please remember in the future that just because someone makes a
> general comment, they are not necessarily ignorant about the subject
> at hand.
Perhaps ignorant was the wrong choice of words. However, I couldn't
think of a word for someone who repeats the same untrue comment after
being corrected about it. (There is a rather offensive word for it in
> I have many years of experience in hardware, and only a few
> years experience with d.net, and only a few minutes worth of
> experience looking at the source code for the cores.
> I stand by my statement. In general, 64-bit is slower than 32-bit.
> However, that doesn't mean that it is ALWAYS slower.
And I stand by my implied statement that, everything else being
equal, an architecture with more registers will perform faster than
an architecture with fewer registers (and mostly everything else is
equal between x86 and x86-64 when implemented in the same processor,
at least as far as RC5 is concerned).
The difference between our statements is that mine is actually
relevant to the issue at hand.
> Also, something I find interesting... you cut the portion of my
> message where I stated that I was unsure because I hadn't tried it for
> myself. Perhaps that would have made your reply look bad? Something to
> think about.
In my personal system of values, if there's one thing I despise more
than someone who states something false, is someone who admits to
being ignorant about the subject, and despite that, still goes on to
state something false.
I was just trying to spare you some embarassment, but apparently
you're actually proud of it.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 186 bytes
Desc: This is a digitally signed message part
Url : http://lists.distributed.net/pipermail/rc5/attachments/20050824/3189c0d5/PGP-0001.bin
More information about the rc5