[talk] Climate Mirror

Wed Dec 14 17:39:16 EST 2016

On 12/14/2016 12:25 PM, Isaac (.ike) Levy wrote:
> Word,
> 
>> On Dec 14, 2016, at 2:15 PM, Thomas Levine <_ at thomaslevine.com>
>> wrote:
>> 
>> The data don't need to be online;
> 
> Very creative- I like that premise.
> 
>> save them to a redundant bunch of cheap hard drives (or maybe
>> tapes), and distribute them among lots of bookshelves. They can
>> even be slow and small hard drives pulled from old computers; we
>> need to write to each one only once, we might need to read from
>> each one once, and we otherwise only need to turn them on once 
>> every couple years to make sure that they're still intact. Maintain
>> a website with a list of the datasets, the datasets' checksums, and
>> the contact information for the people with the hard drives on
>> their bookshelves.
> 
> Actually, just to get in the weeds and get all BSD on this:
> 
> ZFS mirrors of cheap/crappy old drives would likely go a long way
> toward “on-shelf" preservation. Basically, pulling down the data to a
> ZFS mirror could help mitigate bit-rot, and “checking” the data after
> a year could be literally plugging in the drives, and performing a
> ZFS scrub to look for dead/bad blocks, and repair from mirrored
> block.
> 
> I haven’t thought of ZFS in this offlined context, but this approach
> almost seems too easy.
> 
> Without going nuts on the relative merits of various ZFS block
> replication schemes, can anyone poke holes in why this may
> overcomplicate the idea?  I’m all ears…

Well, it is my understanding that ZFS mitigates bit-rot. But I actually
know nothing about it.

However, offline storage will be essential, I think. There's a lot of
data to replicate. As I argued in another subthread, BT seedboxes could
seed data in 1TB chunks. And once there are other seeders, they could
take their copy offline, and seed the next chunk.

Also, with huge datasets, you can get ~20Gbps by shipping a bunch of
disks or tape.

> (This incidentally is something I could start tonight from home,
> using a drawer full of flaky old 2Tb drives…)
> 
>> 
>> Note that this is my opinion only on how this project could be 
>> implemented. I don't know enough about the datasets or the likely 
>> effects of geopolitics on their implementation in order to comment
>> as to whether I think the project should be implemented.
> 
> I’m with you on that- I’m happy to trust the original project
> directors lists of what’s most important and most relevant to climate
> scientists.  Lots of what’s listed is NOAA and NASA, the value of the
> data seems self-evident to me just by the names of the data sets.
> 
> Best, .ike
> 
> 
> 
> 
> _______________________________________________ talk mailing list 
> talk at lists.nycbug.org http://lists.nycbug.org/mailman/listinfo/talk
>