[nycbug-talk] drive failure?

Mon May 7 22:10:05 EDT 2007

On 5/7/07, Jonathan Vanasco <nycbug-list at 2xlp.com> wrote:
> Greetings all--
>
> Last week a server that I've been preparing for colo went down.
> Everything just stopped working, i was getting a ton of kernel
> messages ( like below ) on the screen.

How about all of the kernel messages, including the dmesg? Its hard
to get an idea of the timing: did a write fail, or a read fail, and at what
location, are all of the errors regarding a handful of block numbers, or
does everything look like poison before it crashes?

> I tried rebooting, as I thought it could have been from too much
> Postgres activity, and everything spiraled out of control.

A lot of postgres activity... disk thrashing.. on a sata? This
better be a sata drive that is rated for 24/7 disk thrashing.

> I ran fsck on startup and got 3 stalls from bad block errors -- but
> it seemed to clear stuff up. I did a real reboot, and the server is
> up and running.

Read below

> My question is this:  I just bought this 1 month ago.

This happens.

> does this look to be software based ?

badblocks will prove to you that it is a hardware failure, down to
exactly which blocks are bad. Especialy in write mode.

Irregardless, my filesystems looked dirty after the machine locked up.
Two fsck's later I was still fixing "errors". I kept telling it to fix
the new errors
until the programs themselves began segfaulting. The culprit was bad ram, giving
everybody bad information.

lesson is fsck on important data is a bad idea until you discover the
root of the issue, fsck actualy ruined more data each pass in the
above scenerio. thank god for a recent tape backup. I'd have lost a
lot.