[RC5] Big Blue vs. Protein Folding

mls mls at essex1.com
Sat Jan 27 15:00:05 EST 2001


Big Blue's big bet
Dec 7th 2000
>From The Economist Technology Quarterly

Researchers at IBM are building the world's most powerful computer in an
attempt to solve one of the biggest problems in biology-and, in the process,
some of the thorniest problems in computing

SUPERCOMPUTING, like pop music, has its own charts-updated twice a year and
posted on the Internet at www.top500.org. The construction of ever-faster
computers provides a constant stream of new entries, and each machine's
chart position can go down (as faster rivals push it aside) or up (if it is
upgraded). As in pop music, every now and then a new entry crashes in at the
number one spot. If all goes according to plan, that is what a new
supercomputer being built by IBM will do when it is completed sometime in

Yet this machine, called Blue Gene, which is being put together at IBM's
Thomas J. Watson research centre in Yorktown Heights, New York, will be no
ordinary chart-topper. Since the dawn of computing, the fastest machines
have always been in the hands of nuclear physicists, meteorologists,
mathematicians, cosmologists, cryptologists or engineers. Blue Gene,
however, is being built for biologists.

The fact that the fastest computer on earth is intended for use in a field
that is not traditionally associated with high-powered computing underlines
two things. The first is the sudden ascendancy of computing in biology; the
second is the enormous amounts of computing horsepower needed to do anything
useful in the field. Blue Gene is being built to tackle a problem so complex
that it makes simulating a nuclear explosion, or the collision of two
galaxies, look like a picnic in comparison. It is intended to help
biologists explore how proteins fold themselves up into their distinctive

Despite the fact that it will be a hundred times faster than any computer
now in existence, even Blue Gene will have trouble finding the answer.
Indeed, it is not certain that protein-folding can be meaningfully simulated
on a computer at all. But understanding it would have profound implications
for drug design. A drug can be more easily directed at a particular protein
once that protein's characteristics are known.

Besides these biological payoffs, spending five years and $100m building a
protein-folding computer also makes good business sense for IBM. Biologists
have become eager customers for big computers: the amount of genomic data
available doubles every six months, yet computers only double their speed
every 18 months. Simply keeping up with the flow of data from the
biologists, not to mention doing such things as comparing one genome with
another, requires ever more powerful computers.

IBM wants to position itself as the leading supplier of hardware and
software to biologists. It reckons that the market may be worth $9.5 billion
by 2003. Blue Gene is intended to show that it means business. And there
could be benefits for its existing customers, because getting Blue Gene to
work depends on tackling a number of thorny technical problems whose
solutions would have widespread applications in business computing.

Fold here, please
Simulating the folding of a protein sounds straightforward, at least in
theory. A protein is made up of a chain of amino acids, of which there are
20 types in all. (A gene is merely a recipe for a protein, in the form of a
code for the sequence of amino acids.) Given the amino-acid sequence of a
particular protein, the first step is to get a computer to create an
internal representation of it by stringing the amino acids together, rather
like threading beads on to a bendy piece of wire. The atomic structure of
each amino acid is known, so this amino-acid model can then be transformed
into an atomic model.

The next step is to evaluate all the forces between all the atoms, and to
determine how these forces cause the protein's structure to deform within a
short time-step (say, a couple of millionths of a billionth of a second).
Repeat this process 500 trillion times to cover the second that it takes a
real protein to fold, and by the end you should know the protein's final
shape, and also have an example of the route-the "folding pathway"-by which
it got there.

In practice, however, this ab initio approach is extremely problematic.
First, it is technically an "n-body" problem, where everything affects
everything else simultaneously, and such problems cannot be solved exactly.
A typical protein contains thousands of atoms, and the forces on every atom
must be evaluated for each time-step. That requires the calculation of
millions of separate forces. The effect of these forces on the shape of the
protein must then be calculated-again, a huge undertaking. And to complicate
matters further, real proteins fold in the presence of a liquid solvent,
making it necessary to simulate a whole load of water molecules, too. The
result is an extremely demanding computational problem that can bring even
the most powerful of supercomputers to its knees. One such calculation,
running on a Cray T3 supercomputer, took three months to simulate the
behaviour of a small protein for a millionth of a second-at the end of which
it had barely begun to fold up into its 3-D structure.

A number of tricks have been developed to simplify the process. One way to
make life easier, for example, is to model each amino acid and each water
molecule as a single large blob. This dramatically reduces the number of
particles, and hence the number of pairs of particles (which is proportional
to the square of the number of particles). But the dodge yields only an
approximation of what is going on-and nobody is really sure just how
accurate that approximation is.

An alternative approach, called "comparative modelling", is to try to jump
straight to the final structure of the protein by comparing its amino-acid
sequence with the sequences of proteins whose structures are known
experimentally from crystallography. Proteins come in families whose members
have similar functions, structures and sequences. So, if an unknown protein
has a similar sequence to a known one, the chances are that it has a similar
structure and function. The problem with this approach is that, although it
can be used to discover previously unknown members of a protein family, it
cannot uncover entirely new families. So until all such families have been
discovered, there will be proteins whose structure cannot be determined by
comparison with known structures.

Besides, there is more to understanding a protein than knowing its
structure. Understanding the mechanism of the folding process itself
("protein dynamics") is also of fundamental scientific interest. One great
mystery in protein dynamics, known as Levinthal's paradox, was first pointed
out in 1969 by Cyrus Levinthal, a pioneer of computational biology.
Levinthal noted that the number of possible configurations that a protein
could assume was enormous, yet proteins are able to fold into their
characteristic shapes quickly and consistently. Somehow, they seem to know
what the right shape is-they do not go up blind alleys nor fold into
incorrect shapes. To this day, nobody really knows how or why.

In exploring these kinds of questions, however, comparative modelling is no
help. Only processes that probe the folding itself, such as the
starting-from-scratch approach of rigorously modelling every last atom and
every last force between every last pair of atoms, will do. That explains
the basis of Blue Gene's design-to allow more brute-force computer power to
be applied to the problem than ever before.

Soul of a new machine
In theory, this ought to be easy: just take a supercomputer, add lots of
extra processor chips, and divide up the work between them. Double the
number of chips, and the computer will go twice as fast. At least, that is
what one would assume. Unfortunately, that is not the case.

Making such "massively parallel" computers operate efficiently is difficult
for two reasons. On the hardware front, the problem is that, as additional
processors are added, more and more spaghetti-like wiring is needed to let
them talk to each other, and to stop them treading on each other's toes (by
trying to gain access to the same portion of memory simultaneously, for
example). The second difficulty is physical: the more processors there are,
the bigger the computer becomes, and the longer a signal takes to
travel-even when it moves at the speed of light-from one end of it to the

These are classic problems in parallel computing, and there are ways to
avoid them. One approach is to give each processor its own memory, rather
than having a shared memory. But while this works well for problems in which
each processor keeps track of only a small portion of the data (such as
weather forecasting, where each chip handles a different grid-square on the
map), it cannot cope with problems for which all processors need access to
all the data at once. Another much-debated question is how best to
interconnect all the processors. Should each be able to communicate with all
the others, or only with its immediate neighbours? Flexibility will have to
be traded off against performance.

Then there is the question of how to write software that best exploits the
available hardware. To run most efficiently, software for a massively
parallel computer should take the computer's specific hardware design into
account. But that means that the software will have to be rewritten for
every new design that comes along, or whenever the configuration of the
machine is changed. Once again, a trade-off will be necessary.

At present, the most powerful computer in the world is an IBM machine called
ASCI White, which is in the Lawrence Livermore National Laboratory in
California, where it is used to simulate nuclear explosions. ASCI White
consists of an interconnected cluster of 512 machines and covers an area
equivalent to two basketball courts. Within each machine are 16 processors
that share access to a single memory. Together, ASCI White's 8,192
processors can perform 12.3 trillion floating-point operations (flops) per
second-12.3 teraflops, in the industry jargon. This is roughly equivalent to
the combined computing power of around 30,000 desktop PCs. It sounds
impressive, but Blue Gene will be far more ambitious.

A hundred times faster
Blue Gene will have over 1m processors and will run almost 100 times faster
than ASCI White, while taking up only a quarter of the space. It is expected
to be the first machine to exceed 1,000 teraflops (one petaflop), which is
more than 2m times the power of a single PC. Rather than simply scaling up
the design used by ASCI White and other supercomputers, however, Blue Gene
will have an entirely new architecture.

The problem, says Marc Snir, who is in charge of Blue Gene's hardware
design, is that today's computers are held back not by their processor
performance, but by the difficulty of getting the data in and out of the
processors fast enough. So Blue Gene's processors and memory will sit
side-by-side, on the same chips. Indeed, one way to look at Blue Gene is as
an enormous "smart memory"-a collection of memory chips, each with several
embedded processors. If it can be arranged so that the data needed by a
particular processor happen to be nearby, there is less need to move data
around, and everything goes faster. And protein folding, it turns out, is
one problem that can indeed be tackled in this way.

According to the latest plans, Blue Gene will consist of 36,864 chips, each
of which will contain 16 megabytes of memory and 32 processor "cores". These
cores (each of which is, in essence, an independent processor) will share
access to the on-chip memory, and will also be able to communicate with
cores on other chips.

To ensure that cores spend as little time as possible waiting for data to
arrive from elsewhere, each core will run eight separate calculations
(called "threads") at once, rather like a cook preparing eight dishes
simultaneously. They will do this by cycling between threads constantly and
stepping over any threads that are held up waiting for data. Blue Gene will
thus be able to handle over 8m threads; from a programmer's point of view,
it will operate like an 8m-processor machine. This is enough to allocate one
thread to every pair of atoms in a protein-folding calculation.

Each core will be a simple reduced instruction-set computing (RISC)
processor. Although this is not a new idea, says Dr Snir, most of today's
supposed RISC chips have strayed from the original RISC philosophy of
keeping the design as simple as possible. In order to maximise performance,
they use millions of transistors to analyse the programs they are running in
order to find opportunities to do several things at once.

Blue Gene's software, on the other hand, will be written with parallelism in
mind from the start, so there is no need to waste transistors, chip area or
energy looking for short cuts. As a result, Blue Gene will be smaller and
more energy-efficient than supercomputers built using existing
microprocessors. There will be no need for the water- or Freon-based cooling
systems that are required by current supercomputers. In terms of energy per
flop, Blue Gene will be 100 times as efficient. And with no cooling fans and
pumps whirling away, it will also be much quieter.

Processors will be grouped together, 36 at a time, on circuit-boards that
measure two feet by two feet. Four of these boards will be fitted into each
of 256 cube-shaped racks, arranged on a 16 by 16 grid (see figure). Each
rack will thus have almost a third of the power of ASCI White.

The large number of processors means, however, that one is likely to fail,
on average, every four days. In a conventional supercomputer, with thousands
of times fewer processors, a failure every few years is acceptable, and can
be ignored. But Blue Gene must be able to cope with hardware failures
without skipping a beat. The plan is to make each chip fault-tolerant, so
that individual cores can fail but the chip will continue to work. This will
also have the advantage that imperfect chips, not all of whose cores work
properly to begin with, need not be thrown away.

The fault-tolerance will be extended to whole boards of chips, so that
individual boards can be removed and replaced, and the other boards in the
rack can take up the slack. This will be an impressive feat of engineering,
if IBM can actually pull it off-even detecting failures is a challenge in
itself. But the self-imposed five-year deadline for the project is intended
to force the hardware and software engineers to make tough choices and try
things out. The first chips are due late in 2001, and the machine is
expected to be up and running in 2003-by which time the programmers will
have written its software.

A fishing expedition
When IBM started work on Deep Blue, the chess-playing supercomputer that
defeated Garry Kasparov in 1997, the ultimate goal was clear: to win a match
against the world chess champion. For Blue Gene, however, victory will be
harder to define. Ajay Royyuru, a structural biologist at IBM's
Computational Biology Centre, suggests that a satisfactory outcome would be
to have "a significant impact in the field". There are, in fact, several
avenues of research that Blue Gene will be able to explore.

First and foremost, says Bob Germain, a physicist in the same Computational
Biology Centre, Blue Gene will be used to evaluate the validity of the
"force-field" models used in computational biology. These are simplified
mathematical models of molecules consisting of individual particles
connected by springs. According to Tim Hubbard, a computational biologist at
the Sanger Centre in Cambridge, England, the big question in protein
modelling is why ab initio modelling has hitherto been so unsuccessful. Is
it because computers are simply not fast enough, or because the mathematical
models being used are too simplistic to correspond to real-world behaviour?
By throwing more computing horsepower at the problem than ever before and
comparing the simulation results with experimental observations of real
protein behaviour, it should be possible to find out why it has failed in
the past.

If the force-field models do turn out to be accurate reflections of reality,
it will be possible to try some new things. Something that is beyond the
reach of current computers would be to look at folding trajectories for a
single protein. One idea would be to start with a protein in a random
configuration, and then to simulate its folding for a short period. This
process would be repeated many times for different starting configurations.
A single configuration from the resulting cluster would then be chosen and
simulated for a further brief period, and so on. The result would be several
trajectories showing how the initial configuration of a protein determines
the way it folds up. The question is: is there just one folding pathway for
a given protein, or are there several? In which case, how quickly do they
converge on a final protein structure?

There are other questions that Blue Gene could tackle. Given that many
protein structures have the same function, for example, why do proteins use
only some and not others? Are some structures faster-folding or more stable?
Blue Gene could simulate heating a protein, to see how stable it is, or
could apply random mutations to a protein, to see how its ability to fold is
affected. Blue Gene will be deemed a success, says Dr Royyuru, if it can
make progress on any of these fronts at the same time as providing general
insights into the dynamics of the folding process.

An improved understanding of protein dynamics would have many benefits. It
would bring biologists a step closer to being able to predict the final
folded structure of proteins, which would help in working out their
functions-and thus make it easier to design drugs that are aimed at a
particular protein. It might also provide insights into the behaviour of
"prions", the misfolded proteins that cause bovine spongiform encephalopathy
(mad-cow disease) and Creutzfeldt-Jakob disease (its human equivalent). It
could also help with the design and assembly of exotic new materials and of
molecular-scale machines.

Blue Gene is undoubtedly an ambitious and even risky project. But whether or
not this vast new computer results in a breakthrough in computational
biology, IBM hopes to benefit from the lessons learned in its construction.
A parallel project, called Blue Lite, has been set up in order to
commercialise the new ideas that emerge from Blue Gene.

At first, there might not seem to be much of a cross-over between protein
folding and, say, running a web server or a database. But according to Mark
Dean, IBM's vice-president of systems research, there are several areas in
which Blue Gene technology could be applied. Its self-healing architecture,
for example, and the ability to plug in and remove processors while the
machine is running, would have obvious uses in e-commerce, where servers
must be kept online around the clock and the amount of processing power
needed may vary seasonally. The "smart memory" approach, with processors
sprinkled into the data, might be more efficient than current architectures
for data mining or video searching.

The contrast with Deep Blue, the chess-playing machine, is telling. In order
to make it strong enough to defeat the world champion, IBM took a
32-processor RS/6000SP machine and added 512 dedicated chips that had been
specifically designed to evaluate chess positions quickly. These custom
chips did two-thirds of the work of deciding the machine's next move. Even
so, IBM has been selling RS/6000SP machines to its customers as "Deep Blue"
technology ever since, despite the fact that the custom chips are not
included, since they cannot be used for anything except chess.

For protein folding, IBM could have taken a similar approach, and resorted
to custom chips specifically designed to evaluate interatomic forces. (Such
chips exist; a group of Japanese computational biologists is following this
approach.) But by choosing instead to attack the problem in a more general
way, using massive parallelism, IBM is far more likely to be able to exploit
what it learns commercially.

Ultimately, says Dr Dean, the big question is this: now that processing
power has become so cheap, how can lots of processors be made to work
together efficiently, either within a single machine, within a cluster, or
across the Internet? He envisages a new "cellular" computing model that will
require new architectures, new operating systems, and new software. Blue
Gene will allow some of these new ideas to be tried out. And that, perhaps,
lends some justification to the project's name, beyond being an awful pun.
For if it works, Blue Gene could provide the blueprint-the computational
DNA, in effect-for a new approach to computing.

To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net
rc5-digest subscribers replace rc5 with rc5-digest

More information about the rc5 mailing list