# [Hardware] The market of ASICs (One GigaKey / Second?)

jbass at dmsd.com jbass at dmsd.com
Mon Aug 9 16:08:55 EDT 2004

```Elektron <elektron_rc5 at yahoo.ca>:
> I'm not sure if (and how) a serial core would work.

Dan seemed to have a VERY good idea, but in mulling it over it's not
quite clear that it's a big win for FPGA's. I was hoping to find
a partial solution to see if there were any tricks to reducing the
number of buffer shifters and LUT's needed to reorder the bit stream
for the variable rotates.

> You can't XOR or rotate floating point numbers, for one thing. This
> makes it practically useless in an RC5 sense, even if the overhead of
> converting to/from integers was minimised. Using all 32 registers,

XOR is a bit tricky in FP, but not impossible ... highly impractical
unless you were unfortunate enough to have an FP only machine.

rotates take a few cycles, but are very possible, so having the FP
side process the first few stages of round 1 can clearly be useful
to remove the cycles from the integer pipelines. Consider:

L0 = key0 = count;
L1 = key1 = count>>32;

This can be maintained in the FP to take the outside loop
counters out of integer registers, or avoid the memory load
store cycles.

S26 = ROTL3(0xb7e15163);

is a constant assignment, as the ROTL3 can be pre-evaluated.

L0 = ROTL(L0 + S26, S26);

The add for L0 to S26 is clearly doable in FP

Likewise, since S26 is a constant, this rotate can be constructed
statically with a short sequence of mod, divide, multiply and add.

But a more interesting solution is to avoid the rotate completely,
by maintaining key0 (or a copy) already rotated and incrementing
it by 8 instead of 1.

S27 = ROTL3(0x5618cb1c + S26 + L0);

The 0x5618cb1c + S26 is another constant that can be pre-evaled,
which leaves the L0 term, a clearly easy derivative of key0 which
makes S27 another static rotation, of the static rotation of L0,
which can be simulated by maintaining a copy of key0 and incrementing
it by 8.

L1 = ROTL(L1 + S27 + L0, S27 + L0);

The S27 + L0 term can easily be constructed in the FP, and again
the static rotation can be simulated by maintaining another copy that
is appropriately incremented.

This process can be repeated till you run out of FP register space,
which can be delayed a small amount by stacking several of these
counter copies into the same bit FP word with a few buffer zeros and
chewing up a few more FP cycles to unpack with div/mod sequences.

S28 = ROTL3(0xf45044d5 + S27 + L1);

Ditto here.

Now domes the fun part, the first real ROL that can not be
avoided.
L0 = ROTL(L0 + S28 + L1, S28 + L1);
S29 = ROTL3(0x9287be8e + S28 + L0);

So ... while slightly off topic for this list, the general exercise gives
some interesting enlightment into optimizations that can also be
applied to the hardware RC5 engine.

John
```