[nycbug-talk] drive failure?

Jonathan Vanasco nycbug-list at 2xlp.com
Tue May 8 11:54:30 EDT 2007


On May 7, 2007, at 10:00 PM, George Rosamond wrote:

>  From google and all. . . it could possibly be related to the
> hardware IRQs. .
>
> Notice anything funky in the dmesg?
>
> It would be good to provide some more info. . . is this FBSD 6.x?
>  From the ad12, I assume you're using SATA . . . what hardware?  No
> raid?

FreeBSD 6.1

I was getting the same messages ( posted below ) in dmesg but only  
before I ran fsck.  then they just stopped.

thats what has me confused - it *looks* like a hardware issue, but it  
stopped when i fsck'd.

I'm using SATA via the onboard controller (intel DQ965GFEKR ).  There  
are 4 sata drives in the machine, 2 configured as a mirror raid via  
the onboard raid controller, 2 not.  this error was on a non-raided  
drive .



On May 7, 2007, at 10:10 PM, Jeff Quast wrote:
> How about all of the kernel messages, including the dmesg? Its hard
> to get an idea of the timing: did a write fail, or a read fail, and  
> at what
> location, are all of the errors regarding a handful of block  
> numbers, or
> does everything look like poison before it crashes?

reads and writes were failing, first when i was running it, then when  
i did a reboot .

>> My question is this:  I just bought this 1 month ago.
> This happens.
tell me about it.

>> does this look to be software based ?
>
> badblocks will prove to you that it is a hardware failure, down to
> exactly which blocks are bad. Especialy in write mode.
>
> Irregardless, my filesystems looked dirty after the machine locked up.
> Two fsck's later I was still fixing "errors". I kept telling it to fix
> the new errors
> until the programs themselves began segfaulting. The culprit was  
> bad ram, giving
> everybody bad information.
>
> lesson is fsck on important data is a bad idea until you discover the
> root of the issue, fsck actualy ruined more data each pass in the
> above scenerio. thank god for a recent tape backup. I'd have lost a
> lot.

yeah, thats happened to me before too on an osx server. not fun.   
thankfully this is only my root drive -- that might sound crazy, but  
all PG storage is on the mirror raid, and my main filestore for files  
the system processes is on a third drive  (its a system that spiders  
social networks, parsing profiles + relationships into standardized  
formats, then does analytics to match them against other network  
profiles )

> I already deleted the original message so I'm adding my $0.02 here,  
> but
> one really quick thing to do to narrow this down is to install
> smartmontools and get a reading on the SMART status of the drives in
> question.  If they are reporting bad, case closed.  If not, then go on
> your way - passing SMART does not always mean the drive is actually  
> good.
>
> You can also just boot the Seagate tools, they have a bootable ISO  
> with
> their SMART checking tools on them.  They also generally work on other
> drives that support SMART.

thats a really good idea.

the last 5 errors were all variations of
	40 51 00 d8 53 53 44  Error: UNC at LBA = XXXX = XXXX


On May 8, 2007, at 12:42 AM, Miles Nordin wrote:
> Another, maybe more decisive, method: you can start 'smartctl -t long'
> to tell the drive to test itself.  The output will tell you the
> ``recommended polling interval,'' which is about how long the test
> will take.  This will be about 1 - 4 hours.  smartctl returns
> immediately, and the drive tests itself in the background.

ok.  i'll install freebsd on another drive so i can -t long this one,  
or try the seagate cd.


This is quite possibly the best series of response emails I have ever  
read on a listserv.  i'm just amazed at the knowledge here.
Thank you all GREATLY.

And I'd like to suggest that this be tossed on a wiki somewhere , for  
supreme google-ability -- because nothing this good is on google  
right now.



=====
/var/log/dmesg.yesterday
=====

Uptime: 6d2h22m59s
Rebooting...
cpu_reset: Stopping other CPUs
Copyright (c) 1992-2007 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
         The Regents of the University of California. All rights  
reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 6.2-RELEASE #0: Fri Jan 12 11:05:30 UTC 2007
     root at dessler.cse.buffalo.edu:/usr/obj/usr/src/sys/SMP
acpi_alloc_wakeup_handler: can't alloc wake memory
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(R) CPU            3040  @ 1.86GHz (1876.00-MHz 686- 
class CPU)
   Origin = "GenuineIntel"  Id = 0x6f6  Stepping = 6
    
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE 
,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
    
Features2=0xe3bd<SSE3,RSVD2,MON,DS_CPL,VMX,EST,TM2,<b9>,CX16,<b14>,<b15> 
 >
   AMD Features=0x20100000<NX,LM>
   AMD Features2=0x1<LAHF>
   Cores per package: 2
real memory  = 3479298048 (3318 MB)
avail memory = 3404574720 (3246 MB)
ACPI APIC Table: <INTEL  DQ965GF >
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
cpu0 (BSP): APIC ID:  0
cpu1 (AP): APIC ID:  1
ioapic0: Changing APIC ID to 2
ioapic0 <Version 2.0> irqs 0-23 on motherboard
kbd1 at kbdmux0
ath_hal: 0.9.17.2 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413,  
RF5413)
acpi0: <INTEL DQ965GF> on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
cpu0: <ACPI CPU> on acpi0
acpi_perf0: <ACPI CPU Frequency Control> on cpu0
acpi_perf0: failed in PERF_STATUS attach
device_attach: acpi_perf0 attach returned 6
acpi_perf0: <ACPI CPU Frequency Control> on cpu0
acpi_perf0: failed in PERF_STATUS attach
device_attach: acpi_perf0 attach returned 6
acpi_throttle0: <ACPI CPU Throttling> on cpu0
cpu1: <ACPI CPU> on acpi0
acpi_perf1: <ACPI CPU Frequency Control> on cpu1
acpi_perf1: failed in PERF_STATUS attach
device_attach: acpi_perf1 attach returned 6
acpi_perf1: <ACPI CPU Frequency Control> on cpu1
acpi_perf1: failed in PERF_STATUS attach
device_attach: acpi_perf1 attach returned 6
acpi_throttle1: <ACPI CPU Throttling> on cpu1
acpi_throttle1: failed to attach P_CNT
device_attach: acpi_throttle1 attach returned 6
acpi_button0: <Sleep Button> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pci0: <display, VGA> at device 2.0 (no driver attached)
pci0: <simple comms> at device 3.0 (no driver attached)
atapci0: <GENERIC ATA controller> port 0x3130-0x3137,0x314c-0x314f, 
0x3128-0x312f,0x3148-0x314b,0x3100-0x310f irq 18 at device 3.2 on pci0
ata2: <ATA channel 0> on atapci0
ata3: <ATA channel 1> on atapci0
pci0: <simple comms, UART> at device 3.3 (no driver attached)
em0: <Intel(R) PRO/1000 Network Connection Version - 6.2.9> port  
0x30e0-0x30ff mem 0xe0300000-0xe031ffff,0xe0320000-0xe0320fff irq 20  
at device 25.0 on pci0
em0: Ethernet address: 00:19:d1:25:0e:66
uhci0: <UHCI (generic) USB controller> port 0x30c0-0x30df irq 16 at  
device 26.0 on pci0
uhci0: [GIANT-LOCKED]
usb0: <UHCI (generic) USB controller> on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: <UHCI (generic) USB controller> port 0x30a0-0x30bf irq 21 at  
device 26.1 on pci0
uhci1: [GIANT-LOCKED]
usb1: <UHCI (generic) USB controller> on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
ehci0: <EHCI (generic) USB 2.0 controller> mem 0xe0322c00-0xe0322fff  
irq 18 at device 26.7 on pci0
ehci0: [GIANT-LOCKED]
usb2: EHCI version 1.0
usb2: companion controllers, 2 ports each: usb0 usb1
usb2: <EHCI (generic) USB 2.0 controller> on ehci0
usb2: USB revision 2.0
uhub2: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub2: 4 ports with 4 removable, self powered
pcib1: <ACPI PCI-PCI bridge> at device 28.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> at device 28.1 on pci0
pci2: <ACPI PCI bus> on pcib2
atapci1: <GENERIC ATA controller> port 0x2018-0x201f, 
0x2024-0x2027,0x2010-0x2017,0x2020-0x2023,0x2000-0x200f mem  
0xe0100000-0xe01001ff irq 17 at device 0.0 on pci2
ata4: <ATA channel 0> on atapci1
ata5: <ATA channel 1> on atapci1
pcib3: <ACPI PCI-PCI bridge> at device 28.2 on pci0
pci3: <ACPI PCI bus> on pcib3
pcib4: <ACPI PCI-PCI bridge> at device 28.3 on pci0
pci4: <ACPI PCI bus> on pcib4
pcib5: <ACPI PCI-PCI bridge> at device 28.4 on pci0
pci5: <ACPI PCI bus> on pcib5
uhci2: <UHCI (generic) USB controller> port 0x3080-0x309f irq 23 at  
device 29.0 on pci0
uhci2: [GIANT-LOCKED]
usb3: <UHCI (generic) USB controller> on uhci2
usb3: USB revision 1.0
uhub3: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub3: 2 ports with 2 removable, self powered
uhci3: <UHCI (generic) USB controller> port 0x3060-0x307f irq 19 at  
device 29.1 on pci0
uhci3: [GIANT-LOCKED]
usb4: <UHCI (generic) USB controller> on uhci3
usb4: USB revision 1.0
uhub4: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub4: 2 ports with 2 removable, self powered
uhci4: <UHCI (generic) USB controller> port 0x3040-0x305f irq 18 at  
device 29.2 on pci0
uhci4: [GIANT-LOCKED]
usb5: <UHCI (generic) USB controller> on uhci4
usb5: USB revision 1.0
uhub5: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub5: 2 ports with 2 removable, self powered
ehci1: <EHCI (generic) USB 2.0 controller> mem 0xe0322800-0xe0322bff  
irq 23 at device 29.7 on pci0
ehci1: [GIANT-LOCKED]
usb6: EHCI version 1.0
usb6: companion controllers, 2 ports each: usb3 usb4 usb5
usb6: <EHCI (generic) USB 2.0 controller> on ehci1
usb6: USB revision 2.0
uhub6: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub6: 6 ports with 6 removable, self powered
pcib6: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci6: <ACPI PCI bus> on pcib6
em1: <Intel(R) PRO/1000 Network Connection Version - 6.2.9> port  
0x1000-0x103f mem 0xe0020000-0xe003ffff,0xe0000000-0xe001ffff irq 22  
at device 1.0 on pci6
em1: Ethernet address: 00:0e:0c:d0:15:c0
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci2: <Intel ICH8 SATA300 controller> port 0x3118-0x311f, 
0x3144-0x3147,0x3110-0x3117,0x3140-0x3143,0x3020-0x303f mem  
0xe0322000-0xe03227ff irq 19 at device 31.2 on pci0
atapci2: AHCI Version 01.10 controller with 6 ports detected
ata6: <ATA channel 0> on atapci2
ata7: <ATA channel 1> on atapci2
ata8: <ATA channel 2> on atapci2
ata9: <ATA channel 3> on atapci2
ata10: <ATA channel 4> on atapci2
ata11: <ATA channel 5> on atapci2
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
ppc0: <ECP parallel printer port> port 0x378-0x37f,0x778-0x77f irq 7  
on acpi0
ppc0: Generic chipset (ECP/PS2/NIBBLE) in COMPATIBLE mode
ppc0: FIFO with 16/16/8 bytes threshold
ppbus0: <Parallel port bus> on ppc0
plip0: <PLIP network interface> on ppbus0
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10  
on acpi0
sio0: type 16550A
pmtimer0 on isa0
orm0: <ISA Option ROM> at iomem 0xd0800-0xd17ff on isa0
ata0 at port 0x1f0-0x1f7,0x3f6 irq 14 on isa0
ata1 at port 0x170-0x177,0x376 irq 15 on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
sio1: configured irq 3 not in bitmap of probed irqs 0
sio1: port may not be enabled
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on  
isa0
Timecounters tick every 1.000 msec
ad12: 76319MB <WDC WD800JD-55MSA1 10.01E01> at ata6-master SATA300
ad14: 476940MB <WDC WD5000YS-01MPB1 09.02E09> at ata7-master SATA300
ad20: 476940MB <WDC WD5000YS-01MPB1 09.02E09> at ata10-master SATA300
ar0: 476937MB <Intel MatrixRAID RAID1> status: READY
ar0: disk0 READY (master) using ad14 at ata7-master
ar0: disk1 READY (mirror) using ad20 at ata10-master
SMP: AP CPU #1 Launched!
Trying to mount root from ufs:/dev/ad12s1a
ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -  
completing request directly
ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -  
completing request directly
ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout -  
completing request directly
ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout -  
completing request directly
ad12: WARNING - SET_MULTI taskqueue timeout - completing request  
directly
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=20542527
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=53742935
ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -  
completing request directly
ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -  
completing request directly
ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout -  
completing request directly
ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout -  
completing request directly
ad12: WARNING - SET_MULTI taskqueue timeout - completing request  
directly
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=148505343
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=73251691
ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -  
completing request directly
ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -  
completing request directly
ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout -  
completing request directly
ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout -  
completing request directly
ad12: WARNING - SET_MULTI taskqueue timeout - completing request  
directly
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=20166335
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=45085775
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=19899015
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=19899019
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=20243439
ad12: TIMEOUT - READ_DMA retrying (0 retries left) LBA=20243439
ad12: FAILURE - READ_DMA timed out LBA=20243439
g_vfs_done():ad12s1a[READ(offset=1774673920, length=2048)]error = 5
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=65391579
ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -  
completing request directly
ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -  
completing request directly
ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout -  
completing request directly
ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout -  
completing request directly
ad12: WARNING - SET_MULTI taskqueue timeout - completing request  
directly
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=64198335
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=42043543
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=23248891
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=20243439
ad12: TIMEOUT - READ_DMA retrying (0 retries left) LBA=20243439
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=81608107
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=19900027
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=19866907
ad12: TIMEOUT - READ_DMA retrying (0 retries left) LBA=19866907
ad12: FAILURE - READ_DMA timed out LBA=19866907
g_vfs_done():ad12s1a[READ(offset=1581889536, length=2048)]error = 5
ad12: TIMEOUT - READ_DMA retrying (1 retry left) LBA=24761387

// Jonathan Vanasco

| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -
| SyndiClick.com
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -
|      FindMeOn.com - The cure for Multiple Web Personality Disorder
|      Web Identity Management and 3D Social Networking
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -
|      RoadSound.com - Tools For Bands, Stuff For Fans
|      Collaborative Online Management And Syndication Tools
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -





More information about the talk mailing list