[Hardware] RC5 algorithm

John L. Bass jbass at dmsd.com
Sun Dec 25 05:34:17 EST 2005


Martin Klingensmith <martin at nnytech.net> writes:
> I believe it /may/ be possible to pipeline this into something that cranks
> through one key each clock cycle but I'm not so much a professional VLSI
> designer, I don't know how that works in real designs.

Actually Martin, the pipelining is a piece of cake with the right tools,
and it does work out to one key per clock. It takes a pretty fair sized
FPGA, and there are fit problems because of that. "All you need to do,
is just unroll the loop" (hehehe ... haven't we heard that before ... lol).

Actually, I did just that about a year ago, and it took about a day,
using a hacked version of TMCC.

You will find Fpga C on sourceforge.net:

	http://sourceforge.net/projects/fpgac

A project I moved to sf.net about two months ago. It will compile your
unrolled c as a pure statement level pipeline with very little change
to your code or algorithm. Replace the multiply by 2's with shift left
by 1. Remove the array references (which are serial bottlenecks), by
enumerating the variables by their subscripts (IE s[0] becomes s0, and
s[13] becomes s13, etc ...). That will remove the need for the mod
function in the array subscript arithmetic (FpgaC doesn't support %).

There were a few minor issues with my hacked TMCC that required a
script to fix the net list, that actually burned a half day, but it
was VERY quick compared to doing this in VHDL.

Next is a very cute little trick. Reverse all the unrolled code blocks
front to back to force pipelined registers. C is by definition, sequential.
So all C to netlist compilers have to preserve that sequential operation,
and frequently capitalize on it by building flattened combinatorials for
each block. That however, creates deep combinatorials, which slow the
clock rate. Reversing the blocks however, uses the sequential semantics
to our benifit by creating a pipeline:

	b += a;
	c += b;
	d += c;

creates a combinatorial chain that ripples from a to d. Now if we reverse
the statements:

	d += c;
	c += b;
	b += a;

we break that combinatorial chain, creating a pipeline. It takes a little
more work to initialize it, but after doing so, one result per clock flows
out the pipe (d in this case).

So, my first pass at the RC5 core after unrolling the loops was something like:

                L0 = key0 = count;
                L1 = key1 = count>>32;
                S26 = ROTL3(0xb7e15163);
                L2 = ROTL(L0 + S26, S26);
                S27 = ROTL3(0x5618cb1c + S26 + L2);
                L3 = ROTL(L1 + S27 + L2, S27 + L2);
                S28 = ROTL3(0xf45044d5 + S27 + L3);
                L4 = ROTL(L2 + S28 + L3, S28 + L3);
                S29 = ROTL3(0x9287be8e + S28 + L4);
                L5 = ROTL(L3 + S29 + L4, S29 + L4);
                S30 = ROTL3(0x30bf3847 + S29 + L5);
                L6 = ROTL(L4 + S30 + L5, S30 + L5);
                S31 = ROTL3(0xcef6b200 + S30 + L6);
                L7 = ROTL(L5 + S31 + L6, S31 + L6);
                S32 = ROTL3(0x6d2e2bb9 + S31 + L7);
                L8 = ROTL(L6 + S32 + L7, S32 + L7);
                S33 = ROTL3(0x0b65a572 + S32 + L8);
                L9 = ROTL(L7 + S33 + L8, S33 + L8);
                S34 = ROTL3(0xa99d1f2b + S33 + L9);
                L10 = ROTL(L8 + S34 + L9, S34 + L9);
                S35 = ROTL3(0x47d498e4 + S34 + L10);
                L11 = ROTL(L9 + S35 + L10, S35 + L10);
                S36 = ROTL3(0xe60c129d + S35 + L11);
                L12 = ROTL(L10 + S36 + L11, S36 + L11);

           [...]
                E0 = ROTL(E0 ^ E1, E1) + S96;
                S97 = ROTL3(S71 + S96 + L72);
                L73 = ROTL(L71 + S97 + L72, S97 + L72);
                E1 = ROTL(E0 ^ E1, E0) + S97;
                S98 = ROTL3(S72 + S97 + L73);
                L74 = ROTL(L72 + S98 + L73, S98 + L73);
                E0 = ROTL(E0 ^ E1, E1) + S98;
                S99 = ROTL3(S73 + S98 + L74);
                L75 = ROTL(L73 + S99 + L74, S99 + L74);
                E1 = ROTL(E0 ^ E1, E0) + S99;
                S100 = ROTL3(S74 + S99 + L75);
                L76 = ROTL(L74 + S100 + L75, S100 + L75);
                E0 = ROTL(E0 ^ E1, E1) + S100;
                S101 = ROTL3(S75 + S100 + L76);
                L77 = ROTL(L75 + S101 + L76, S101 + L76);
                E1 = ROTL(E0 ^ E1, E0) + S101;
                S102 = ROTL3(S76 + S101 + L77);
                L78 = ROTL(L76 + S102 + L77, S102 + L77);
                E0 = ROTL(E0 ^ E1, E1) + S102;
                S103 = ROTL3(S77 + S102 + L78);
                E1 = ROTL(E0 ^ E1, E0) + S103;

which can be verified by wrapping a loop around it, and executing it on
a traditional processor. E0 and E1 must also be renumbered so the
var is not reused between pipeline stages for an FpgaC version.

For the current challenge, the slightly larger key width changes
the unrolled var numbering a bit, so what you see above is just
a conceptional model left over from the last challenge.

Then invert the statements from top to bottom, and fix the pipeline
initialization issues (like the first 180 or so output are garbage
to be ignored). Wrap a test harness check around it, and some communication
with the external world, and you are done ... something like this:

    #define ROTL3(x) ((x << 3) | (x >> 29))
    #define ROTL(x,n) (((x) << (n)) | ((x) >> (32-(n))))

        while (++count < max)
        {
                unsigned long E0, E1;
                unsigned long L0, L1, L2, L3, L4, L5, L6, L7, L8, L9;
                unsigned long L10, L11, L12, L13, L14, L15, L16, L17, L18, L19;
                unsigned long L20, L21, L22, L23, L24, L25, L26, L27, L28, L29;
                unsigned long L30, L31, L32, L33, L34, L35, L36, L37, L38, L39;
                unsigned long L40, L41, L42, L43, L44, L45, L46, L47, L48, L49;
                unsigned long L50, L51, L52, L53, L54, L55, L56, L57, L58, L59;
                unsigned long L60, L61, L62, L63, L64, L65, L66, L67, L68, L69;
                unsigned long L70, L71, L72, L73, L74, L75, L76, L77, L78, L79;
                unsigned long S26, S27, S28, S29;
                unsigned long S30, S31, S32, S33, S34, S35, S36, S37, S38, S39;
                unsigned long S40, S41, S42, S43, S44, S45, S46, S47, S48, S49;
                unsigned long S50, S51, S52, S53, S54, S55, S56, S57, S58, S59;
                unsigned long S60, S61, S62, S63, S64, S65, S66, S67, S68, S69;
                unsigned long S70, S71, S72, S73, S74, S75, S76, S77, S78, S79;
                unsigned long S80, S81, S82, S83, S84, S85, S86, S87, S88, S89;
                unsigned long S90, S91, S92, S93, S94, S95, S96, S97, S98, S99;
                unsigned long S100, S101, S102, S103;

#include "RC5Core.c"

                /* test for found key */
                if (cypher0 == E0 && cypher1 == E1)
                        break;
        }

I think this was posted to this list a little over a year ago. I also did
some serial verions which took A LOT more design energy - a couple weeks each.
There are some interesting space time tradeoff's.

Have fun,
John

	Date: Sun, 25 Dec 2005 00:00:46 -0500
	From: Martin Klingensmith <martin at nnytech.net>
	To: Hardware <hardware at lists.distributed.net>
	Subject: Re: [Hardware] RC5 algorithm

	Dan Oetting wrote:

	> You are right that RC5 doesn't care what comes after the current  
	> block. But it does care about what came before. The RSA contests use  
	> RC5-CBC so you are given an IV that represents the result from the  
	> previous block that needs to be XOR'd with the plain text for the  
	> current block before applying the RC5 encryption. The resulting  
	> cypher text is then used as the IV for the next block.
	>
	> Here is a description with sample code: <http://www.ietf.org/rfc/ 
	> rfc2040.txt>
	>
	> -- Dan O. 


	Thanks for the tip, Dan.
	I did a lot of studying and reducing the rfc2040.txt into something that 
	can fit on a page and is specifically for RC5-32/12/9. There are a few 
	examples of this out on the 'net but I thought I'd send it anyway in 
	case someone is interested in looking at it. The next step is to figure 
	out what I need to do to make an FPGA run this as fast as possible. I 
	believe it /may/ be possible to pipeline this into something that cranks 
	through one key each clock cycle but I'm not so much a professional VLSI 
	designer, I don't know how that works in real designs.

	Serially, the following happens:
	---------------------------
	The key expansion loop for: S[i] = S[i-1] + Qw; has to iterate 25 times
	The S table creation loop iterates 78 times
	The actual block encryption routine only runs through 12 rounds
	---------------------------
	If you have any wise advice for how to pipeline this, I would much 
	appreciate it.


	Here is the C code. It is not mine, I just hacked it up from rfc2040.c 
	which started out at around 500 lines:
	////////////////
	#include <stdio.h>

	#define Pw  0xb7e15163
	#define Qw  0x9e3779b9
	#define RC5_WORD     unsigned int
	#define SHL(x,s)    ((RC5_WORD)((x)<<((s)&31)))
	#define SHR(x,s,w)  ((RC5_WORD)((x)>>((w)-((s)&31))))
	#define ROTL(x,s,w) ((RC5_WORD)(SHL((x),(s))|SHR((x),(s),(w))))

	int main(){
	    RC5_WORD    S[42];
	    RC5_WORD    L[3];  /* Based on max key size */
	    RC5_WORD    A, B, ivA, ivB;
	    int i, k;   
	    int        j=0;

	    // iv = 07 ce 59 1f 86 14 9a 41
	    ivA = 0x1f59ce07;
	    ivB = 0x419a1486;
	   
	    // key = c9 0c 03 53 c0 d4 e1 fe 85
	    L[0] = 0x53030cc9;
	    L[1] = 0xfee1d4c0;
	    L[2] = 0x85;


	    S[0] = Pw;
	    for(i=1;i<26;i++)
	        S[i] = S[i-1] + Qw;
	   
	    i = j = 0;
	    A = B = 0;

	    for (k = 0 ; k < 3*26 ; k++)  {
	        A = ROTL(S[i] + A + B, 3, 32);
	        S[i] = A;
	        B = ROTL(L[j] + A + B, A + B, 32);
	        L[j] = B;
	        i = (i + 1) % 26;
	        j = (j + 1) % 3;
	    }

	    // p = 54 68 65 20 75 6e 6b 6e = "The Unkn"
	    A = 0x20656854;
	    B = 0x6e6b6e75;
	   
	    // RC5-CBC requires the IV to be XORed with the input block
	    A = (A ^ ivA);
	    B = (B ^ ivB);
	   
	    printf("\nA: %x\nB: %x\n",A,B);

	    A = A + S[0];
	    B = B + S[1];

	    for(i=1;i<=12;i++){
	        A = A ^ B;
	        A = ROTL(A, B, 32) + S[2*i];
	        B = B ^ A;
	        B = ROTL(B, A, 32) + S[2*i+1];
	    }   
	   
	    printf("\nA: %x\nB: %x\n",A,B);

	    return (0);
	}

	////////////////

	-- 
	---
	Martin Klingensmith
	nnytech.net
	infoarchive.net


More information about the Hardware mailing list