[nycbug-talk] regular hardware troubleshooting/monitoring

Thu Jun 23 01:33:15 EDT 2005

I have noticed that the further away I am from my youthful hardware 
building apprentice days of yore, I am less and less likely to realize, 
at a reasonable pace, when I am hitting brick walls that I attribute to 
  software when it's actually hardware.

This has happened a few times recently, most notably on a FBSD 5.3, 5.4 
box that would bomb out while updating src. . .  It would bomb out at 
different points during the process, and it was killing me.  Hardware 
seemed like the least likely problem.  I was ready to open up a pizza 
place, and have my BSD career tombstone end with "victim of 5.3".

I ran memtest and found that one of the brand-new RAM modules was dead, 
but I am convinced, maybe wrongly, that in my earlier days, I would have 
looked at hardware immediately.

Anyway, Ike and I were discussing this evening, and maybe it would make 
sense to figure out some daily tests on production hardware that would 
notify the results when an error occurs, say, with memtest or fsck. 
Outside of what already comes out in the daily/weekly/monthly emails.

Of course I enjoyed the old uptime project (and isn't there a variant 
around today, still free?), but a lack of ping responses doesn't do it, 
or I'd sort that out myself with my own infrastructure.

Any thoughts on this?  Does it make sense to have regular tests running 
on production hardware?  Does anyone do this in their own environment, 
outside of whatever the various full-scale open and closed source 
products already do on hardware monitoring?

George