[Hardware] The market of ASICs (One GigaKey / Second?)

Elektron elektron_rc5 at yahoo.ca
Mon Aug 9 15:03:38 EDT 2004


Is replying to 4 mails at once bad etiquette?

> I tried that. With a parallel core I was never able to schedule all 
> the loads and stores to keep from blocking and slowing down the pipe. 
> What I did hit on is that the PPC has enough registers so you don't 
> have to write anything to RAM. Since Round 3 and the encryption free 
> up 1 S register each stage and Round 1 uses 1 more S register each 
> stage they can be wrapped together to start processing the next key as 
> the previous key is being finished. It worked out that Rounds 1 and 3 
> could be perfectly meshed to keep the execution units busy and every 
> dispatch cycle full. Round 2 had 1 dispatch hole per stage and all but 
> 2 of these were filled with housekeeping instructions such as 
> incrementing the key for the next iteration.

2 or 3 instructions dispatched per cycle? My computer (I think it's a 
7455) can do three. Making sure you dispatch enough instructions is a 
pain in the neck.

...

> Awesome!! don't meet many cycle counters these days - I had almost 
> thought
> that was becoming a lost art! I dropped out of that biz some six years 
> back
> because nobody was willing to pay for performance anymore.

I'm a cycle counter on a good day. 16 cycles per byte of CRC. Actually, 
it was 16 or 17, depending on how the instructions were aligned (adding 
nops to the end could bring it down to 16, but the branch target wasn't 
a multiple of 16). Perhaps the lag caused by the misalignment helped.

> On the PIII's and later how close did you get? What's the speedup over 
> gcc's
> best effort of:
>
>         <snip>
>
> for each stage?

First, I'll say that the register names are all icky (I don't do x86 
asm), and it also depends on which stage (round 0+1, round 2, or round 
3+enc?). I also suspect there's no loop unrolling there (I may be 
wrong), and I spy some potential instruction reordering (the 3rd and 
4th). I'm not sure what ANDing with 31 is supposed to do. You could, of 
course, look at the RC5 source code (or I could try compiling it on a 
P133, which may not be much help, since it sucks).

> Did a long google search for serial RC5 cores yesterday, doesn't seem 
> that
> anybody wants to publish any respectable RC5 cores at all, even though 
> RC5
> appears to now be a standard fare fpga class assignment and a number of
> people are trying to do thesis work around it.

I'm not sure if (and how) a serial core would work.

...

> I don't do pentiums. My core would process 1 key every 296 clock 
> cycles on a PowerPC 603 processor. It may be 30-40% faster then the 
> best C code on the same CPU.

I'm not sure, but with decent register allocation (gcc comes to mind), 
you should be able to make decent C code (albeit looking exactly like 
the assembly code, except in C).

...

> Cool ... I haven't done any PPC projects yet, except for setting up my 
> kids
> on some Mac's. Anything fancy like packing things into the FP register 
> file,
> or using the FP processor to offload the integer unit pipeline?

You can't XOR or rotate floating point numbers, for one thing. This 
makes it practically useless in an RC5 sense, even if the overhead of 
converting to/from integers was minimised. Using all 32 registers, 
rigorous instruction scheduling and using the CTR is always a good 
place to start, though. If it's a recently new mac, you should be able 
to get gcc on it (using some release of the developer tools). If not, 
you can always try MPW or CodeWarrior.

That said, PowerPC assembly is a dream come true.

123



More information about the Hardware mailing list