[nycbug-talk] drive failure?

Miles Nordin carton at Ivy.NET
Tue May 8 00:42:07 EDT 2007


>>>>> "cs" == Charles Sprickman <spork at bway.net> writes:

    cs> smartmontools

seconded.

I actually don't pay any attention to whether the ``overall
assessment'' says PASSED or not.  It seems to always say PASSED.  The
goal is to distinguish between two different problems:

 1. bad driver.  bad card.  bad cable.

 2. bad disk.

You can look at 'smartctl -a'.  If the UDMA_CRC_Error_Count raw count
is increasing, it's a bad cable.  If the Hardware_ECC_Recovered or
Seek_Error_Rate counts are increasing, it's a bad drive.

Another, maybe more decisive, method: you can start 'smartctl -t long'
to tell the drive to test itself.  The output will tell you the
``recommended polling interval,'' which is about how long the test
will take.  This will be about 1 - 4 hours.  smartctl returns
immediately, and the drive tests itself in the background.  Do this
only on a drive that's not mounted.  Then run 'smartctl -a'.  Then run
'smartctl -a' a second time.  Make sure the test is still running.
Sometimes, sending the drive a command will abort the test, and
'smartctl -a' is a command---that's why you run it twice.  If your
tests are getting aborted you'll see something like this:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Offline             Aborted by host               70%         0         -

In that case, (1) your drive's firmware is too old to work well, (2)
your power supply is bad, or (3) you are trying to test a mounted
drive.  You could try starting the test with smartctl, then unplugging
the IDE cable, and leaving the drive connected to power only for about
four hours.

If you can get the test to keep running, then after four hours or so,
do 'smartctl -a' again, and the result of the test will show up at the
bottom like this:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      2373         -

This is how the drive reports the result: by writing it to this
nonvolatile log which you can check later.  There is no other
reporting method.  This positive result shown above mostly proves the
drive is good.  

If 'smartctl -t long' gives a good result, try another test: 'dd
if=/dev/ad0 of=/dev/null bs=512'.  If 'dd' reports an I/O error before
the address of the end of the drive but 'smartctl -t long' reports
good, that means your problem is with driver/card/cable.  (it is
normal on some but not all Unixes for 'dd' to get you an IDE driver
error in 'dmesg' by trying to read past the end of teh disk.  You need
to look at the sector number of the error, and see if it's in the
middle of the drive or if it's past the end.)  

A bad-drive result from 'smartctl -t long' should have a non-empty
LBA_of_first_error like this:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       60%     19687         204786353

I think maybe good drives connected to bad power supplies can possibly
fail this 'smartctl -t long' test, but I just RMA them unconditionally
when they fail and include a copy of the smartctl output.  I have had
bad power supply problems twice.  The first time I spotted it using a
scope (~600mV ripple during disk read/write activity rather than
100-200mV), and the second time by trial-and-error part-swapping.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL: <https://lists.nycbug.org:8443/pipermail/talk/attachments/20070508/e7df2605/attachment.bin>


More information about the talk mailing list