[Hardware] RC5 algorithm .... Re: "success"

John L. Bass jbass at dmsd.com
Thu Oct 19 03:52:36 EDT 2006


david fleischer <cilantro_il at yahoo.com> on Thu, 19 Oct 2006 00:31:36 wrote:
> Subject: Re: [Hardware] "success"
> To: "John L. Bass" <jbass at dmsd.com>
> 
> Martin,
> What is YOUR estimate?
> 
> Regards,
> David

Well David, since you sent this to me, I guess I'll reply and Martin
can follow up ... but he vary clearly said a few hours ago:

Martin Klingensmith <martin at nnytech.net> on Wed, 18 Oct 2006 18:54:02 -0400 wrote:
> Well, how is the data verified and sent back? (conceptually, I'm not
> looking for details)
> There could always be some sort of software layer on a host computer
> with any number of client FPGAs attached.
> 
> By the way, I made a simple top module and tried to synthesize the code
> today. It started to work but I tried to re-run the synthesis and
> Synplify Pro just sat there for hours running 100%. I'm not sure what's
> going on with that - maybe it just needs a reboot. I don't know how much
> space it will take up in an FPGA. It's a 155 stage 32 bit pipeline which
> seems large by my standards. It is for this reason that I don't know
> which device it will fit in (I use Xilinx S3-200 right now) or how fast
> it will run.
> 
> I will give out the code once I figure out if there are any reasons not
> to do so. It's basically the code I posted to this list 10 months ago
> converted to Verilog and printed out with tac. It's also part of my
> graduate thesis.

Which is pretty clear ... he doesn't know yet.

Since I helped him via email a bit last year, I do remember a few more
details, including what he posted 10 months ago as his design, and what
I posted as mine ... both 10 months ago, and the year before that. Not
to make it too easy for this school project, and purposefully didn't
give him exactly what he needed, but an older version.

John


	From jbass at dmsd.com  Sun Dec 25 03:34:19 2005
	Date: Sun, 25 Dec 2005 03:34:17 -0700
	From: "John L. Bass" <jbass at dmsd.com>
	To: hardware at lists.distributed.net
	Subject: Re: [Hardware] RC5 algorithm

	Martin Klingensmith <martin at nnytech.net> writes:
	> I believe it /may/ be possible to pipeline this into something that cranks
	> through one key each clock cycle but I'm not so much a professional VLSI
	> designer, I don't know how that works in real designs.

	Actually Martin, the pipelining is a piece of cake with the right tools,
	and it does work out to one key per clock. It takes a pretty fair sized
	FPGA, and there are fit problems because of that. "All you need to do,
	is just unroll the loop" (hehehe ... haven't we heard that before ... lol).

	Actually, I did just that about a year ago, and it took about a day,
	using a hacked version of TMCC.

	You will find Fpga C on sourceforge.net:

		http://sourceforge.net/projects/fpgac

	A project I moved to sf.net about two months ago. It will compile your
	unrolled c as a pure statement level pipeline with very little change
	to your code or algorithm. Replace the multiply by 2's with shift left
	by 1. Remove the array references (which are serial bottlenecks), by
	enumerating the variables by their subscripts (IE s[0] becomes s0, and
	s[13] becomes s13, etc ...). That will remove the need for the mod
	function in the array subscript arithmetic (FpgaC doesn't support %).

	There were a few minor issues with my hacked TMCC that required a
	script to fix the net list, that actually burned a half day, but it
	was VERY quick compared to doing this in VHDL.

	Next is a very cute little trick. Reverse all the unrolled code blocks
	front to back to force pipelined registers. C is by definition, sequential.
	So all C to netlist compilers have to preserve that sequential operation,
	and frequently capitalize on it by building flattened combinatorials for
	each block. That however, creates deep combinatorials, which slow the
	clock rate. Reversing the blocks however, uses the sequential semantics
	to our benifit by creating a pipeline:

		b += a;
		c += b;
		d += c;

	creates a combinatorial chain that ripples from a to d. Now if we reverse
	the statements:

		d += c;
		c += b;
		b += a;

	we break that combinatorial chain, creating a pipeline. It takes a little
	more work to initialize it, but after doing so, one result per clock flows
	out the pipe (d in this case).

	So, my first pass at the RC5 core after unrolling the loops was something like:

	                L0 = key0 = count;
	                L1 = key1 = count>>32;
	                S26 = ROTL3(0xb7e15163);
	                L2 = ROTL(L0 + S26, S26);
	                S27 = ROTL3(0x5618cb1c + S26 + L2);
	                L3 = ROTL(L1 + S27 + L2, S27 + L2);
	                S28 = ROTL3(0xf45044d5 + S27 + L3);
	                L4 = ROTL(L2 + S28 + L3, S28 + L3);
	                S29 = ROTL3(0x9287be8e + S28 + L4);
	                L5 = ROTL(L3 + S29 + L4, S29 + L4);
	                S30 = ROTL3(0x30bf3847 + S29 + L5);
	                L6 = ROTL(L4 + S30 + L5, S30 + L5);
	                S31 = ROTL3(0xcef6b200 + S30 + L6);
	                L7 = ROTL(L5 + S31 + L6, S31 + L6);
	                S32 = ROTL3(0x6d2e2bb9 + S31 + L7);
	                L8 = ROTL(L6 + S32 + L7, S32 + L7);
	                S33 = ROTL3(0x0b65a572 + S32 + L8);
	                L9 = ROTL(L7 + S33 + L8, S33 + L8);
	                S34 = ROTL3(0xa99d1f2b + S33 + L9);
	                L10 = ROTL(L8 + S34 + L9, S34 + L9);
	                S35 = ROTL3(0x47d498e4 + S34 + L10);
	                L11 = ROTL(L9 + S35 + L10, S35 + L10);
	                S36 = ROTL3(0xe60c129d + S35 + L11);
	                L12 = ROTL(L10 + S36 + L11, S36 + L11);

	           [...]
	                E0 = ROTL(E0 ^ E1, E1) + S96;
	                S97 = ROTL3(S71 + S96 + L72);
	                L73 = ROTL(L71 + S97 + L72, S97 + L72);
	                E1 = ROTL(E0 ^ E1, E0) + S97;
	                S98 = ROTL3(S72 + S97 + L73);
	                L74 = ROTL(L72 + S98 + L73, S98 + L73);
	                E0 = ROTL(E0 ^ E1, E1) + S98;
	                S99 = ROTL3(S73 + S98 + L74);
	                L75 = ROTL(L73 + S99 + L74, S99 + L74);
	                E1 = ROTL(E0 ^ E1, E0) + S99;
	                S100 = ROTL3(S74 + S99 + L75);
	                L76 = ROTL(L74 + S100 + L75, S100 + L75);
	                E0 = ROTL(E0 ^ E1, E1) + S100;
	                S101 = ROTL3(S75 + S100 + L76);
	                L77 = ROTL(L75 + S101 + L76, S101 + L76);
	                E1 = ROTL(E0 ^ E1, E0) + S101;
	                S102 = ROTL3(S76 + S101 + L77);
	                L78 = ROTL(L76 + S102 + L77, S102 + L77);
	                E0 = ROTL(E0 ^ E1, E1) + S102;
	                S103 = ROTL3(S77 + S102 + L78);
	                E1 = ROTL(E0 ^ E1, E0) + S103;

	which can be verified by wrapping a loop around it, and executing it on
	a traditional processor. E0 and E1 must also be renumbered so the
	var is not reused between pipeline stages for an FpgaC version.

	For the current challenge, the slightly larger key width changes
	the unrolled var numbering a bit, so what you see above is just
	a conceptional model left over from the last challenge.

	Then invert the statements from top to bottom, and fix the pipeline
	initialization issues (like the first 180 or so output are garbage
	to be ignored). Wrap a test harness check around it, and some communication
	with the external world, and you are done ... something like this:

	    #define ROTL3(x) ((x << 3) | (x >> 29))
	    #define ROTL(x,n) (((x) << (n)) | ((x) >> (32-(n))))

	        while (++count < max)
	        {
	                unsigned long E0, E1;
	                unsigned long L0, L1, L2, L3, L4, L5, L6, L7, L8, L9;
	                unsigned long L10, L11, L12, L13, L14, L15, L16, L17, L18, L19;
	                unsigned long L20, L21, L22, L23, L24, L25, L26, L27, L28, L29;
	                unsigned long L30, L31, L32, L33, L34, L35, L36, L37, L38, L39;
	                unsigned long L40, L41, L42, L43, L44, L45, L46, L47, L48, L49;
	                unsigned long L50, L51, L52, L53, L54, L55, L56, L57, L58, L59;
	                unsigned long L60, L61, L62, L63, L64, L65, L66, L67, L68, L69;
	                unsigned long L70, L71, L72, L73, L74, L75, L76, L77, L78, L79;
	                unsigned long S26, S27, S28, S29;
	                unsigned long S30, S31, S32, S33, S34, S35, S36, S37, S38, S39;
	                unsigned long S40, S41, S42, S43, S44, S45, S46, S47, S48, S49;
	                unsigned long S50, S51, S52, S53, S54, S55, S56, S57, S58, S59;
	                unsigned long S60, S61, S62, S63, S64, S65, S66, S67, S68, S69;
	                unsigned long S70, S71, S72, S73, S74, S75, S76, S77, S78, S79;
	                unsigned long S80, S81, S82, S83, S84, S85, S86, S87, S88, S89;
	                unsigned long S90, S91, S92, S93, S94, S95, S96, S97, S98, S99;
	                unsigned long S100, S101, S102, S103;

	#include "RC5Core.c"

	                /* test for found key */
	                if (cypher0 == E0 && cypher1 == E1)
	                        break;
	        }

	I think this was posted to this list a little over a year ago. I also did
	some serial verions which took A LOT more design energy - a couple weeks each.
	There are some interesting space time tradeoff's.

	Have fun,
	John

		Date: Sun, 25 Dec 2005 00:00:46 -0500
		From: Martin Klingensmith <martin at nnytech.net>
		To: Hardware <hardware at lists.distributed.net>
		Subject: Re: [Hardware] RC5 algorithm

		Dan Oetting wrote:

		> You are right that RC5 doesn't care what comes after the current  
		> block. But it does care about what came before. The RSA contests use  
		> RC5-CBC so you are given an IV that represents the result from the  
		> previous block that needs to be XOR'd with the plain text for the  
		> current block before applying the RC5 encryption. The resulting  
		> cypher text is then used as the IV for the next block.
		>
		> Here is a description with sample code: <http://www.ietf.org/rfc/ 
		> rfc2040.txt>
		>
		> -- Dan O. 


		Thanks for the tip, Dan.
		I did a lot of studying and reducing the rfc2040.txt into something that 
		can fit on a page and is specifically for RC5-32/12/9. There are a few 
		examples of this out on the 'net but I thought I'd send it anyway in 
		case someone is interested in looking at it. The next step is to figure 
		out what I need to do to make an FPGA run this as fast as possible. I 
		believe it /may/ be possible to pipeline this into something that cranks 
		through one key each clock cycle but I'm not so much a professional VLSI 
		designer, I don't know how that works in real designs.

		Serially, the following happens:
		---------------------------
		The key expansion loop for: S[i] = S[i-1] + Qw; has to iterate 25 times
		The S table creation loop iterates 78 times
		The actual block encryption routine only runs through 12 rounds
		---------------------------
		If you have any wise advice for how to pipeline this, I would much 
		appreciate it.


		Here is the C code. It is not mine, I just hacked it up from rfc2040.c 
		which started out at around 500 lines:
		////////////////
		#include <stdio.h>

		#define Pw  0xb7e15163
		#define Qw  0x9e3779b9
		#define RC5_WORD     unsigned int
		#define SHL(x,s)    ((RC5_WORD)((x)<<((s)&31)))
		#define SHR(x,s,w)  ((RC5_WORD)((x)>>((w)-((s)&31))))
		#define ROTL(x,s,w) ((RC5_WORD)(SHL((x),(s))|SHR((x),(s),(w))))

		int main(){
		    RC5_WORD    S[42];
		    RC5_WORD    L[3];  /* Based on max key size */
		    RC5_WORD    A, B, ivA, ivB;
		    int i, k;   
		    int        j=0;

		    // iv = 07 ce 59 1f 86 14 9a 41
		    ivA = 0x1f59ce07;
		    ivB = 0x419a1486;
		   
		    // key = c9 0c 03 53 c0 d4 e1 fe 85
		    L[0] = 0x53030cc9;
		    L[1] = 0xfee1d4c0;
		    L[2] = 0x85;


		    S[0] = Pw;
		    for(i=1;i<26;i++)
		        S[i] = S[i-1] + Qw;
		   
		    i = j = 0;
		    A = B = 0;

		    for (k = 0 ; k < 3*26 ; k++)  {
		        A = ROTL(S[i] + A + B, 3, 32);
		        S[i] = A;
		        B = ROTL(L[j] + A + B, A + B, 32);
		        L[j] = B;
		        i = (i + 1) % 26;
		        j = (j + 1) % 3;
		    }

		    // p = 54 68 65 20 75 6e 6b 6e = "The Unkn"
		    A = 0x20656854;
		    B = 0x6e6b6e75;
		   
		    // RC5-CBC requires the IV to be XORed with the input block
		    A = (A ^ ivA);
		    B = (B ^ ivB);
		   
		    printf("\nA: %x\nB: %x\n",A,B);

		    A = A + S[0];
		    B = B + S[1];

		    for(i=1;i<=12;i++){
		        A = A ^ B;
		        A = ROTL(A, B, 32) + S[2*i];
		        B = B ^ A;
		        B = ROTL(B, A, 32) + S[2*i+1];
		    }   
		   
		    printf("\nA: %x\nB: %x\n",A,B);

		    return (0);
		}

		////////////////

		-- 
		---
		Martin Klingensmith
		nnytech.net
		infoarchive.net



More information about the Hardware mailing list