[HARDWARE] Just Courious......]]

stoney at sequent.com stoney at sequent.com
Mon Sep 20 20:08:15 EDT 1999

Sorry about the tone.  I became frustated w/ some of the bad information
and number of posts that were off the subject.

I was attempting to design a 'prototype' as a PCI card that used either
exclusively used fpga's or used a mixture of embedded microprocessors
and fpga's.  I would also put SDRAM or SRAM that would not be shared by
the different units.  I would start w/ a development board before 
having boards made for a prototype.

If I unroll the loops in the fpga's it gets ridiculous, if I unroll the
loop w/ an embedded microprocessor in RAM it is feasible.  I looked at
some embedded procs such as the strongARM vr4 w/ the thumb instruction 
set included.  there is a ROR that will rotate a register a single time,
but this is not good enough.  I havn't checked the PPC401 instruction
set.  I believe that fully unrolling the loop in RAM is OK, but it is
definitely a waste in fpgas.  So, I think we had a miscommunication due
to the differences in our implementation.

Paul Cambell aka Taniwha; mentioned that a DSP proc 'sharc' has a 32 bit
rotate instruction.  

It would be nice to have special purpose
embedded proc w/ instructions especially for crypto like the DSP has the
single cycle MAC unit, butterfly index, and no overhead circular buffers
which are tailored to the algorithms for IIR, FIR, and FFT, etc.

After I looked at the RC5 subkey generation and decryption algorithm's
more closely I noticed that I wouldn't gain much from parallel execution
units (superscalar) in 1 engine due to the data dependencies for a contained
unit '1 key to 1 engine' I didn't even think about re partitioning the problem
to 'multiple keys to 1 engine'.  What I don't understand is how do you
benefit by solving multiple keys w/ one engine if the subkeys So->S(2r+1)
are dependent on the key? 

(2nd loop of sub key generation)

       do 3n, n = max((2(r+1)),c)
         A = S subscript(i) = (S subscript(i) + A + B) <<< 3
         B = L subscript(j) = (L subscript(j) + A + B) <<< (A+B)

(pipes for single iteration)

 A pipe,
  1    2    3
  ADD, ADD, fixed ROL, store A'
 B pipe,
       1    2            3             4
       ADD, ADD (w/ A'), ROL upto 2^5, store B'
 B ROL operand unit,
            ADD A'+B, feed to B pipe

My choice to decrypt the known ciphertext and compare it to 'a range of
plaintext' versus encrypting the known plaintext and comparing it to 
the known ciphertext is that I would have much more flexibility in 
applying the implementation to actual problems.  It seems to me that it
would be much easier to adapt to slight changes to plaintext so that
you are not limited to the strict known plaintext and known cipertext
The following is an excerpt quoted from

Schneier, Bruce, "Applied Cryptography:  Protocols, Algorithms,
                  and Source Code in C", Wiley ,1996 2nd edition
pages 344-346

  , where  'w' is word size, 'r' is number of rounds,
    and 'b' is length of key in bytes

>>> operator is ROR
addition and subtraction are mod 2^32

II. Decrypt RC5

  A. Create array of S sub keys

       Copy bytes of key into array, L, of 'c' 32-bit words
       padding the final word w/ 0s if necessary.
       P is binary rep of 'e'   P() = 0xb7e15163
       Q is binary rep of 'phi' Q() = 0x9e3779b9

       S subscript(0) = P
       For i = 1 to 2(r + 1)-1:
         S subscript(i) = (S subscript(i - 1) mod 2^32

       do 3n, n = max((2(r+1)),c)
         A = S subscript(i) = (S subscript(i) + A + B) <<< 3
         B = L subscript(j) = (L subscript(j) + A + B) <<< (A+B)

       i = (i+1) mod 2(r+1)
       j = (j+1) moc c

  B. Decrypt using array S

       For i = r down to 1:
         B = ((B - S subscript(2i + 1)) >>> A) XOR A
         A = ((A - S subscript(2i))     >>> B) XOR B

       B = B - S subscript(1)
       A = A - S subscript(0)

On Mon, 20 Sep 1999, Dan Oetting wrote:

> At 12:42 -0700 9/20/1999, stoney at sequent.com wrote:
> >You will only be able to partially overlap the assignments to A and B
> >in the for loops because of the data dependencies.  You cannot feasibly
> >unroll the decryption loop
> Do you know what you are talking about? I fully unrolled the loops in the
> PowerPC cores and eliminated the A and B assignments since they are
> redundant when you have 32 registers to work with.  You are confusing
> unrolling loops with parallel execution in a super scalar architecture.
> Although there is little overlap with 1 key its possible to fold the loop
> and completely overlap the first and last rounds requiring only 2
> additional registers. And anybody that has worked with this code knows that
> you encrypt instead of decrypt because the encryption can be folded into
> the last round of key generation and you don't need to generate the final
> S[25] inside the loop.
> --
> Dan Oetting <dan_oetting at comug.com>
> PowerPC 603/604/750 -- Still the fastest core on the net.
> --
> To unsubscribe, send 'unsubscribe hardware' to majordomo at lists.distributed.net

To unsubscribe, send 'unsubscribe hardware' to majordomo at lists.distributed.net

More information about the Hardware mailing list