[Hardware] Notes on packing serial cores into Xilinx devices

jbass at dmsd.com jbass at dmsd.com
Fri Aug 13 05:47:20 EDT 2004

david fleischer <cilantro_il at yahoo.com> writes:
> Running a project is much more than sitting down and
> compiling code. There needs to be an infrastructure to
> organize the development and distribute the data.
> In the end it would be very sad if every one ended up
> with an FPGA board next to the solar panel on their
> roof and no way to coordinate the data.

I have a fair clue what is needed to coordinate a project this size,
and a good understanding of organization dynamics and game theory
as it pertains to establishing strategies and rules to invoke the
desired results.

I've started and  run non-profit cluhs, cooperatives, and for-profit
businesses. I've been in the computer industry since 1968 in a lot
of roles, including standards committee projects.

Secret based organizations have a huge problem with acceptance and
coordination, as more people are quietly rejected for being an
outsider, than are embraced and included. The current standard of
participation with d.net blessed clients does not fit the evolving
role of high performance hardware cracking engines where their is
no standard, high volume platform to design around.

Earlier, I gave some rough order of magnitude performance numbers
based on paper designs for current high end FPGA chips ... they
were conservative. With a good micropipeline design these toys
can run at much closer to 500Mhz or more, and the next generation
will be much faster. I'm pretty sure the number of top of the line
xilinx parts today to match d.net RC5-72 performance is under 500

> I read the pointer to the Virtex device; in order for
> it to be worth-while, I think it ideally needs to cost
> about20$. This is about the cost of the Spartan chips?

Spartan chips are small and fast, and are what's on most of
the FPGA student boards produced, but they are not where the
performance/price point will be. That will be something around
a million gate device (next generation of the 25K LUT XC2V1000),
where $25 for a single device in volume, that will deliver about
50MKey/sec with a well tuned core design with reasonable thermal
and clocking tradeoff's. That should be later this year.

> From what I understand, there is a dependency of the
> key test on the key table generation... this pretty
> much rules out a pipeline (IMHO), since it would have
> to have huge storage requirements. Better to exploit
> the fact that RC5 is meant to have little storage and
> implement it as a cascade of combinational blocks.

The key table generation is 80% of the workload, and has as part
of the design a data dependency of holding at minimum 26 S table
entries with a tightly folded loop (like the algorithm specification
and sample code). The littlest storage block is 33 32-bit words,
26 S terms, 3 L terms, and two E terms, to support a single instance
of the three S/L/T processing elements, or 11 32bit storage words per
processing element.

The interesting point, is that unrolling the loop requires nearly
the same storage to processing elements as the tightly rolled version.
Relatively the unrolled version has higher administrative overhead per
key, so there are small, but significant efficiency gains to be had by
unrolling the design with heavy micro-pipelining. In fact unrolling allows
for some optimizations at the first several stages to actually short cut
the key building a bit and pick up additional performance by needing
less LUT/CLBs and fewer cycles from start to finish.

	Incidentally, there need to be at least two sections;
	the key table generation (perhaps implemented in 3
	passes), and the key test.
	In order for this to be fast enough, the serial
	architecture is pretty much ruled out. The shifters in
	particular need to be implemented as a 32-bit mux per
	each bit.

Not at all. I've done a tight XC4085XL/XCV1000 parallel design several
years ago. With Dan's suggestion, I revisited this with serial designs.
There is not that much difference in system level performance, but the
serial design is much easier to port/optimize, and MUCH easier to tightly
fit to a specific device. I knocked out a schematic version of the RC5
serial processor core in a single night, but have not had time to build
the controller for it, and actually run on a board. I hope to get into
the lab and do so in the next week or so. When tested it can be turned
into a core or device targeted VHDL to make device packing easier.

	Regarding what said above, I think D.net is doing a
	pretty good job. We should be so lucky to have a
	hardware board integrated into their project.

I think that d.net needs to re-invent itself, or we will. Closed core
architectures and FPGA's just will not mix. Even fast NT/Linux boxes
are not even in the same processing league as small to modest sized
FPGA based engines.

Maybe as Dan suggested, our cooperation is just sharing key space


More information about the Hardware mailing list