[Hardware] Notes on packing serial cores into Xilinx devices

david fleischer cilantro_il at yahoo.com
Fri Aug 13 03:50:51 EDT 2004

A few general notes:
Running a project is much more than sitting down and
compiling code. There needs to be an infrastructure to
organize the development and distribute the data.
In the end it would be very sad if every one ended up
with an FPGA board next to the solar panel on their
roof and no way to coordinate the data.

I read the pointer to the Virtex device; in order for
it to be worth-while, I think it ideally needs to cost
about20$. This is about the cost of the Spartan chips?

>From what I understand, there is a dependency of the
key test on the key table generation... this pretty
much rules out a pipeline (IMHO), since it would have
to have huge storage requirements. Better to exploit
the fact that RC5 is meant to have little storage and
implement it as a cascade of combinational blocks.
Incidentally, there need to be at least two sections;
the key table generation (perhaps implemented in 3
passes), and the key test.
In order for this to be fast enough, the serial
architecture is pretty much ruled out. The shifters in
particular need to be implemented as a 32-bit mux per
each bit.

Regarding what said above, I think D.net is doing a
pretty good job. We should be so lucky to have a
hardware board integrated into their project.



Also, the serial architecture saves on the
combinational logic only, the storage requirement is

--- jbass at dmsd.com wrote:

> Like any programming project, writing in a high
> level language to abstract
> hardware will give you a simple and easily portable
> solution. Getting all
> the performance out of a specific hardware
> architecture requires getting
> "down and dirty" with the target architectures inner
> design.
> This applies to designing for ASIC and FPGA hardware
> targets too. FPGA Synthesis
> tools do a pretty good job targeting hardware, but
> frequently miss the little
> tricks that can create an additional 1-200% in
> performance due to increased
> locality and lower routing/interconnect latency.
> Let's start with Xilinx XCV Virtex devices:
> Note pages 4 & 5 carefully. Every Xilinx FPGA is
> similar in design, as each
> has a LUT and Register, but has critical differences
> in the supporting logic
> around these in the forms of carry chain
> implementation and interconnect
> support.
> The RC5 engine needs to implement the following
> operations 78 times;
>                 S = ROTL3(S[-25] + S + L);
>                 L = ROTL(L[-2] + S + L, S + L);
> and for the last 26 terms, concurrent with L:
>                 E = ROTL(E[-1] ^ E, E) + S;
> In a serial RC5-72 implementation, S is a shift
> register 26*32 bits, L is a
> shift register 3*32 bits, and E is a shift register
> 2*32 bits. Because of
> the rotate function, S and L can not be concurrent,
> but are serial to
> each other with the latency of the rotate function
> apart in time. There are
> several solutions to correct this problem for best
> case utilization. The
> easiest is to fix the rotate functions to 32 clock
> times, double/triple the
> size of the S and L shift registers, and process
> two/three independent key
> trials as interleaved words.
> Since the last 26 terms are not used once L and E
> are calculated, the
> initialization of S for the next round can occur
> concurrently.
> You can unroll the loop and replicate 32 times to
> get one solution per
> clock, but it requires the same aggregate amount of
> shift register storage
> and function blocks. The only thing that is saved,
> are the mux's and control
> logic needed for the initialization of S, L, and E,
> offset against the high
> probablity that the unrolled solution will not fit
> evenly into the FPGA and
> leave some wasted space.
> The serial rotate function for Xilinx Vertex devices
> can be constructed
> from two LUT based shift 16 bit registers, and a
> mux. The ROTL3 function is
> the easiest:
> 	input--->(32bit shift register)---
>               |                          |
>               |                         Mux---->
> Rotated output
>               |                          |
>               ----------------------------
> where the mux alternately selects the lower bypass
> for three clocks,
> and the upper stream delayed by 32 clocks for 29
> clocks. The output
> data stream has a 29 clock latency delay. This mux
> control function
> can/should be factored out to the system controller
> logic. The variable
> rotate function is a bit more complex, and several
> interesting solutions
> are possible depending on the global architecture of
> the RC5 engine.
> Three basic strategies exist, counter controlled mux
> similar to ROTL3,
> variably tapped shift registers. and work pool
> routing into a tree of
> fixed rotate functions (a MUCH more complex RISC
> like execution oriented
> architecture with task specific pipelines).
> With the BX selects, the F or G LUT output is routed
> to the F5 output.
> The X or Y output can then be the XOR of the LUT and
> carry term. This can
> either form the E0 ^ E1 terms for free, or be the
> first part of an
> adder. If the F1 function input to a shifter can be
> manipulated low,
> then E = ROTL + S term sum can be generated for free
> by registering
> XB back into CIN.  The Virtex Pro devices, don't
> allow this, but have
> other nice MUX's at the slice and CLB levels.
> Shifter LUT's can be cascaded or summed by directing
> the outputs to
> the X or Y outputs, then the register can be used
> independently with
> BY or BX inputs. This frees up DFF's for building
> discrete serial to
> parallel converters necessary to sample and hold the
> intermediate low
> five bits for the rotate amount input to the ROTL
> functions.
> John
> _______________________________________________
> Hardware mailing list
> Hardware at lists.distributed.net

More information about the Hardware mailing list