Marco Scoffier <marco at metm.org> wrote:
> I have a server, which has been solid for years (yes years)
> I put it into a colo and it has started randomly powering off, 
> yes completely off.

I had a similar case.  The machine powered off after a while after
booting.  First I thought this was a memory problem, so ran
memtest and found it always shuts down at a certain point in the
test, which tempted me to believe this is real. But after
replacing the memories, the problem still persisted.

Actually, I even asked this in nylug list:

Then I was almost giving up the machine. But a week after or so, I
found that the buckle of a heatsink was loosened, and it was not
tightly attached to the processor. This caused the processor
heated too much when the processor load exceeds a certain amount,
which leads it to the sudden death.

After having the buckle lever firmly pressed down, the machine
runs perfectly fine.


