[talk] zfs disk outage and aftermath

Jonathan jonathan at kc8onw.net
Wed Feb 6 18:06:04 EST 2019


I've had a home setup with consumer drives in a horrible high vibration
environment for almost 10 years now with multiple reconfigurations and
migrations for the array. I've probably lost 6 or 7 drives over that
time and I have yet to lose data because of it. I did have a file that
had bitrot and then I lost a drive and had to restore it from original
media but that's why scrubs are so important.

Jonathan

On 2019-02-04 16:56, John C. Vernaleo wrote:
> I read the subject of this had that sinking feeling in my stomach as
> I'm becoming more and more reliant on ZFS and no email with 'disk' in
> the subject line is ever good news.  I was afraid there would be a
> horror story to make me rethink my reliance on ZFS,
> 
> Was pleasantly surprissed to see I was wrong.  And that health check
> script looks like a really good idea.
> 
> John
> 
> -------------------------------------------------------
> John C. Vernaleo, Ph.D.
> www.netpurgatory.com
> john at netpurgatory.com
> -------------------------------------------------------
> 
> On Mon, 4 Feb 2019, N.J. Thomas wrote:
> 
>> Well, yesterday morning I suffered my first drive failure on one of my
>> ZFS boxes (running FreeBSD 12.0), it actually happened on my primary
>> backup server.
>> 
>> "zpool status" showed that my regularly scheduled scrub had found (and
>> fixed) some errors on one of the disks in a mirrored pair.
>> 
>> I made sure that my replica box had up to date snapshots transferred
>> over, shut down the machine, and asked the datacenter team to check.
>> They indeed found that the drive was faulty and replaced it.
>> 
>> It took about 4 hours for the drive to be resilvered, and that was it.
>> Back to normal with almost no issues -- apart from the few minutes 
>> that
>> the machine was down while its drive was being replaced.
>> 
>> My takeaways:
>> 
>>    - use ZFS
>> 
>>    - take regular snapshots
>> 
>>    - replicate your snapshots to another machine
>> 
>>    - scrub your disks regularly (unlike fsck, this can be run while 
>> the
>>      drive is mounted and active)
>> 
>>    - monitor zfs health (I use this script from Calomel.org:
>>      https://calomel.org/zfs_health_check_script.html)
>> 
>> The first three points are kinda obvious, the last two I picked up 
>> from
>> other, more experienced, ZFS users.
>> 
>> I had been waiting for this day since I first started using ZFS years
>> ago and am very happy with that decision to use this filesystem.
>> 
>> Thomas
>> 
>> _______________________________________________
>> talk mailing list
>> talk at lists.nycbug.org
>> http://lists.nycbug.org:8080/mailman/listinfo/talk
>> 
>> 
> 
> _______________________________________________
> talk mailing list
> talk at lists.nycbug.org
> http://lists.nycbug.org:8080/mailman/listinfo/talk



More information about the talk mailing list