[nycbug-talk] regular hardware troubleshooting/monitoring
ike at lesmuug.org
Thu Jun 23 02:45:01 EDT 2005
Hi David, All,
On Jun 23, 2005, at 1:50 AM, David Rio Deiros wrote:
> On Thu, Jun 23, 2005 at 01:33:15AM -0400, George R. wrote:
>> Anyway, Ike and I were discussing this evening, and maybe it would
>> sense to figure out some daily tests on production hardware that would
>> notify the results when an error occurs, say, with memtest or fsck.
>> Outside of what already comes out in the daily/weekly/monthly emails.
>> Any thoughts on this? Does it make sense to have regular tests
>> on production hardware? Does anyone do this in their own environment,
>> outside of whatever the various full-scale open and closed source
>> products already do on hardware monitoring?
> This is what I think about it:
> I cannot see how to test the memory without rebooting the machine. Even
> if you could do it, you will have to modify the memtest code to send
> you the results via (smtp, http, etc...) something tough considering
> (I am not sure) memtest is written in assembly.
> Regarding to the CPU, pretty much the same.... Well... you can actually
> run programs like cpuburn but those are going to put your CPU to 0%
> idle. Something you don't want in a production server. Besides cpuburn
> has to run a couple of days at least in order to verify that the CPU
> is going to work fine over extreme.
> Rergarding to the Hard drives, you have the SMART feature that
> comes with all the hard drives nowadays, again, if you make
> your hd to run tests that is going to reduce performance. If
> you know when you can tolerate that then you can safely run it.
All valid points- basically that proper hardware tests involve
reducing/imparing system performance... *but* what if these tests were
run on clusters of boxes? i.e. even at my small scale with my
operations, I could totally afford to drop a server for testing, while
running concurrent services on another box- (in my case, jumping jails
Just food for thought on this issue- it seems to me this gets even more
realistic as things like BGP/multihoming and CARP based systems make it
'easy' to run tests which affect hardware performance...
/me thinking out loud...
More information about the talk