[plans] distributed.net .plan update

Plan Man plans at nodezero.distributed.net
Mon Dec 4 19:00:04 EST 2006


distributed .plan updates in the last 24 hours:
---

bovine :: 04-Dec-2006 23:49 GMT (Monday) ::

Our stats server, Fritz, is currently offline due to its ongoing RAID
issues.  Although the machine is actually back online right now, we
have the webpages turned off until we finish making some more tweaks.

For the technically interested, the problem appears to one of the
following problems:

1) Four of the WDC hard drives (SATA model WD2000JB) we have are
   suspected to possibly be affected by a timeout issue related to
   thermal calibration, or a lack of TLER (Time Limited Error Recovery).

   Western Digital claims the problem only affects certain older ATA
   drives (but ours are SATA) http://lnk.nu/wdc.custhelp.com/c6c.php
   And 3Ware confirms that the ATA version of our model number (but
   not necessarily SATA).  http://lnk.nu/3ware.com/c6d.aspx

   There is a drive firmware update, but only available for ATA
   drives.  We have already opened support tickets 3Ware and WDC more
   than a week ago and are still waiting for responses.

2) Physical drive failure.  We've already had all of the drives RMA'ed
   at least once when we first started having these problems, so we
   don't believe there is a physical failure in the normal sense.  The
   drives report no errors after a reboot.

3) Motherboard compatibility with our RAID controller.  We have a Tyan
   S2882 motherboard, but 3Ware's compatibility page for the
   9550SX-8LP says only Tyan S2880 and S2885 are "officially"
   supported.  http://lnk.nu/3ware.com/c6e.pdf  We don't think this is
   too probable of a cause though.

4) FreeBSD updates.  We're currently on FreeBSD 6.0 stable, but 6.1
   stable has some additional 3Ware driver updates, so tonight we will
   be upgrading to that.  http://lnk.nu/freebsd.org/c6f.html

5) 3Ware RAID firmware updates.  We've already updated to the latest
   firmware a couple weeks ago prior to this most recent outage, so
   the firmware alone is not a fix.

6) 3Ware RAID controller.  Several months ago we tried replacing the
   RAID controller with a slightly different 3Ware model to see if
   that would affect things, but the problem persisted.

We've also just recently purchased a KVM-over-IP solution to allow us
to remotely manage the machine if it becomes inaccessible over the
network.  Unfortunately, this most recent failure wedged the OS
preventing even a keyboard-initiated reboot from working.

If we don't get any further responses from WDC or 3Ware, our next
possible option is to go out and buy 4 new 200GB+ SATA drives from
another manufacturer and see if that improves things.

We might also try moving some of the drives (containing the OS and
swap) to the onboard RAID controller and see if that can avoid
preventing the OS from going down when the data volume goes down.

Thanks for your patience!


More information about the plans mailing list