[nycbug-talk] regular hardware troubleshooting/monitoring
George R.
george
Thu Jun 23 01:33:15 EDT 2005
I have noticed that the further away I am from my youthful hardware
building apprentice days of yore, I am less and less likely to realize,
at a reasonable pace, when I am hitting brick walls that I attribute to
software when it's actually hardware.
This has happened a few times recently, most notably on a FBSD 5.3, 5.4
box that would bomb out while updating src. . . It would bomb out at
different points during the process, and it was killing me. Hardware
seemed like the least likely problem. I was ready to open up a pizza
place, and have my BSD career tombstone end with "victim of 5.3".
I ran memtest and found that one of the brand-new RAM modules was dead,
but I am convinced, maybe wrongly, that in my earlier days, I would have
looked at hardware immediately.
Anyway, Ike and I were discussing this evening, and maybe it would make
sense to figure out some daily tests on production hardware that would
notify the results when an error occurs, say, with memtest or fsck.
Outside of what already comes out in the daily/weekly/monthly emails.
Of course I enjoyed the old uptime project (and isn't there a variant
around today, still free?), but a lack of ping responses doesn't do it,
or I'd sort that out myself with my own infrastructure.
Any thoughts on this? Does it make sense to have regular tests running
on production hardware? Does anyone do this in their own environment,
outside of whatever the various full-scale open and closed source
products already do on hardware monitoring?
George
More information about the talk
mailing list