[Hardware] Notes on packing serial cores into Xilinx devices
jbass at dmsd.com
jbass at dmsd.com
Thu Aug 12 21:24:42 EDT 2004
Like any programming project, writing in a high level language to abstract
hardware will give you a simple and easily portable solution. Getting all
the performance out of a specific hardware architecture requires getting
"down and dirty" with the target architectures inner design.
This applies to designing for ASIC and FPGA hardware targets too. FPGA Synthesis
tools do a pretty good job targeting hardware, but frequently miss the little
tricks that can create an additional 1-200% in performance due to increased
locality and lower routing/interconnect latency.
Let's start with Xilinx XCV Virtex devices:
Note pages 4 & 5 carefully. Every Xilinx FPGA is similar in design, as each
has a LUT and Register, but has critical differences in the supporting logic
around these in the forms of carry chain implementation and interconnect
The RC5 engine needs to implement the following operations 78 times;
S = ROTL3(S[-25] + S + L);
L = ROTL(L[-2] + S + L, S + L);
and for the last 26 terms, concurrent with L:
E = ROTL(E[-1] ^ E, E) + S;
In a serial RC5-72 implementation, S is a shift register 26*32 bits, L is a
shift register 3*32 bits, and E is a shift register 2*32 bits. Because of
the rotate function, S and L can not be concurrent, but are serial to
each other with the latency of the rotate function apart in time. There are
several solutions to correct this problem for best case utilization. The
easiest is to fix the rotate functions to 32 clock times, double/triple the
size of the S and L shift registers, and process two/three independent key
trials as interleaved words.
Since the last 26 terms are not used once L and E are calculated, the
initialization of S for the next round can occur concurrently.
You can unroll the loop and replicate 32 times to get one solution per
clock, but it requires the same aggregate amount of shift register storage
and function blocks. The only thing that is saved, are the mux's and control
logic needed for the initialization of S, L, and E, offset against the high
probablity that the unrolled solution will not fit evenly into the FPGA and
leave some wasted space.
The serial rotate function for Xilinx Vertex devices can be constructed
from two LUT based shift 16 bit registers, and a mux. The ROTL3 function is
input--->(32bit shift register)---
| Mux----> Rotated output
where the mux alternately selects the lower bypass for three clocks,
and the upper stream delayed by 32 clocks for 29 clocks. The output
data stream has a 29 clock latency delay. This mux control function
can/should be factored out to the system controller logic. The variable
rotate function is a bit more complex, and several interesting solutions
are possible depending on the global architecture of the RC5 engine.
Three basic strategies exist, counter controlled mux similar to ROTL3,
variably tapped shift registers. and work pool routing into a tree of
fixed rotate functions (a MUCH more complex RISC like execution oriented
architecture with task specific pipelines).
With the BX selects, the F or G LUT output is routed to the F5 output.
The X or Y output can then be the XOR of the LUT and carry term. This can
either form the E0 ^ E1 terms for free, or be the first part of an
adder. If the F1 function input to a shifter can be manipulated low,
then E = ROTL + S term sum can be generated for free by registering
XB back into CIN. The Virtex Pro devices, don't allow this, but have
other nice MUX's at the slice and CLB levels.
Shifter LUT's can be cascaded or summed by directing the outputs to
the X or Y outputs, then the register can be used independently with
BY or BX inputs. This frees up DFF's for building discrete serial to
parallel converters necessary to sample and hold the intermediate low
five bits for the rotate amount input to the ROTL functions.
More information about the Hardware