[nycbug-talk] regular hardware troubleshooting/monitoring

Isaac Levy ike
Thu Jun 23 02:45:01 EDT 2005

Hi David, All,

On Jun 23, 2005, at 1:50 AM, David Rio Deiros wrote:

> On Thu, Jun 23, 2005 at 01:33:15AM -0400, George R. wrote:
>> Anyway, Ike and I were discussing this evening, and maybe it would 
>> make
>> sense to figure out some daily tests on production hardware that would
>> notify the results when an error occurs, say, with memtest or fsck.
>> Outside of what already comes out in the daily/weekly/monthly emails.
>> Any thoughts on this?  Does it make sense to have regular tests 
>> running
>> on production hardware?  Does anyone do this in their own environment,
>> outside of whatever the various full-scale open and closed source
>> products already do on hardware monitoring?
> This is what I think about it:
> I cannot see how to test the memory without rebooting the machine. Even
> if you could do it, you will have to modify the memtest code to send
> you the results via (smtp, http, etc...) something tough considering
> (I am not sure) memtest is written in assembly.
> Regarding to the CPU, pretty much the same.... Well... you can actually
> run programs like cpuburn but those are going to put your CPU to 0%
> idle. Something you don't want in a production server. Besides cpuburn
> has to run a couple of days at least in order to verify that the CPU
> is going to work fine over extreme.
> Rergarding to the Hard drives, you have the SMART feature that
> comes with all the hard drives nowadays, again, if you make
> your hd to run tests that is going to reduce performance. If
> you know when you can tolerate that then you can safely run it.
> David

All valid points- basically that proper hardware tests involve 
reducing/imparing system performance... *but* what if these tests were 
run on clusters of boxes?  i.e. even at my small scale with my 
operations, I could totally afford to drop a server for testing, while 
running concurrent services on another box- (in my case, jumping jails 
across boxen).

Just food for thought on this issue- it seems to me this gets even more 
realistic as things like BGP/multihoming and CARP based systems make it 
'easy' to run tests which affect hardware performance...

/me thinking out loud...


More information about the talk mailing list