[nycbug-talk] regular hardware troubleshooting/monitoring

Ray nycbug
Thu Jun 23 10:56:20 EDT 2005


On Thu, Jun 23, 2005 at 10:42:45AM -0400, George Georgalis wrote:
> On Thu, Jun 23, 2005 at 10:13:41AM -0400, Ray wrote:
> >On Wed, Jun 22, 2005 at 10:50:45PM -0700, David Rio Deiros wrote:
> >> I cannot see how to test the memory without rebooting the machine.
> 
> recompile a kernel 20 times and pipe stderr/stdout to a file,
> compare files sizes... pretty darn effective, for a running machine

You'd probably want to do some minimal checksumming, too.

> How about a real world problem? I've got a box that I cannot identify
> the problem with. The cause is probably from pushing the bus and clock
> rates to a point it was remained stable, a few years ago. The cpu is an
> AMD 750 clocked to 950 (they are good at that), it's a VIA chipset with
> a little heatsink required on the 'south bridge' chip (disk controller,
> sound, etc). Well that fell off a while ago and didn't cause any
> immediate problems (and I mean in a tight, warm, installation, for many
> months after I discovered it)...
> 
> now, this tower running open, fails (locks up) from 3 to 48 hours,
> during no unusual activity (disk, cpu, etc) the south bridge chip won't
> be hot to touch and I'll often use an SATA and no ATA disk. Memtest86
> won't fail after long runs.
> 
> So how can I test/salvage? My guess is it's the south bridge, but short
> of investing $30 in Artic silver glue to see if the problem goes away
> (which I doubt because that chip doesn't really get hot), I'm not sure
> how to tell, ditto for the cpu, don't want to replace if it's not broke
> and that's got a nice new fan on it... so what is broke? and how can I
> tell?

Swapping parts with spare hardware.




More information about the talk mailing list