[plans] distributed.net .plan update

Plan Man plans at nodezero.distributed.net
Sat Nov 19 19:00:03 EST 2005


distributed .plan updates in the last 24 hours:
---

nugget :: 19-Nov-2005 12:32 CST (Saturday) ::

We made good progress this morning in diagnosing the problems with the
stats server.  As Decibel mentioned last night, we started seeing random
read errors when pulling data off the drives.  Running a SHA1 or MD5 hash
off the PostgreSQL backup file (10GB) twice in a row would never yield
the same hash twice in a row.  Quite creepy to see.

At first we thought we might be dealing with an OS issue, since we'd
taken this downtime as a good opportunity to upgrade the server from
FreeBSD 5.x to 6.0-STABLE, so we got a little sidetracked debugging
UFS2 and newfs options (which we'd also experimented with during the
restore).  In that experimenting, Leto managed to ferret out a weird
bug in FreeBSD 6 where the system will panic if you copy a large 
directory structure to a drive which has been tuned with a large
average filesize parameter.  (Sent PR amd64/89202 to the FreeBSD team)

http://www.freebsd.org/cgi/query-pr.cgi?pr=amd64/89202

Once we moved past that, though, we were still facing the weird read
errors.  This morning I nicked two drives out of the raid10 volume (which
was empty anyway) and plugged them in to a spare 9500S card that we've 
got on hand.  We're unable to repro the read errors off that card, which
would seem to indicate that the problem is indeed the old 3Ware 8506.

Sadly, the 9500S card is only the four port model, so we can't just 
swap it in and start using it, we'll have to order a new card for
the stats server.

I'm quite encouraged that we seem to have isolated the problem to the
controller card.  It's under warranty, but it's a depot repair and 
the vendor won't just cross-ship us a replacement.  We'll have to 
order a new card if we want to get the server back up and running in
a reasonable amount of time.



More information about the plans mailing list