[talk] Climate Mirror

Wed Dec 14 14:25:16 EST 2016

Word,

> On Dec 14, 2016, at 2:15 PM, Thomas Levine <_ at thomaslevine.com> wrote:
> 
> The data don't need to be online;

Very creative- I like that premise.

> save them to a redundant bunch of
> cheap hard drives (or maybe tapes), and distribute them among lots of
> bookshelves. They can even be slow and small hard drives pulled from old
> computers; we need to write to each one only once, we might need to read
> from each one once, and we otherwise only need to turn them on once
> every couple years to make sure that they're still intact. Maintain a
> website with a list of the datasets, the datasets' checksums, and the
> contact information for the people with the hard drives on their
> bookshelves.

Actually, just to get in the weeds and get all BSD on this:

ZFS mirrors of cheap/crappy old drives would likely go a long way toward “on-shelf" preservation.
Basically, pulling down the data to a ZFS mirror could help mitigate bit-rot, and “checking” the data after a year could be literally plugging in the drives, and performing a ZFS scrub to look for dead/bad blocks, and repair from mirrored block.

I haven’t thought of ZFS in this offlined context, but this approach almost seems too easy.  

Without going nuts on the relative merits of various ZFS block replication schemes, can anyone poke holes in why this may overcomplicate the idea?  I’m all ears…

(This incidentally is something I could start tonight from home, using a drawer full of flaky old 2Tb drives…)

> 
> Note that this is my opinion only on how this project could be
> implemented. I don't know enough about the datasets or the likely
> effects of geopolitics on their implementation in order to comment as to
> whether I think the project should be implemented.

I’m with you on that- I’m happy to trust the original project directors lists of what’s most important and most relevant to climate scientists.  Lots of what’s listed is NOAA and NASA, the value of the data seems self-evident to me just by the names of the data sets.

Best,
.ike