[nycbug-talk] Remote backup services

Miles Nordin carton at Ivy.NET
Tue Mar 10 14:28:25 EDT 2009

>>>>> "il" == Isaac Levy <ike at lesmuug.org> writes:

    il> ZFS Snapshots to Amazon S3:
    il> http://blogs.sun.com/ec2/entry/zfs_snapshots_to_and_from

Aside from the problem that S3 is ludicrously expensive...

you should not be storing 'zfs send' streams, ever.  You may transport
them but not archive them.  The problems:

 * they're atomic.  If one bit in a stream is flipped, the whole
   stream is bad.  

 * they're atomic.  You must restore an entire stream or no
   stream---if you don't have enough space for a stream you can't even
   see what's inside it.

 * the incremental feature is rigid.  If one bit is flipped in the
   backing-store to which the incremental is meant to apply, then it
   won't restore.

 * the upward/downward compatibility across ZFS versions is bad and
   haphazard.  You can only count on it working for the exact same
   release that wrote the stream, though sometimes it's better.  They
   are trying to make a stronger compatibility commitment, but they do
   not seem in control of the current situation so I am not prepared
   to buy whatever commitment they make, especially for something

 * bugs.  kernel panics on recv, endiness bugs.

 * there is no way to test the stream's validity without enough space
   to restore it.  This is important period, but it's more subtle.

   There never will be such a way, because the kernel is involved in
   restoring streams, and one of the ways that streams can be invalid
   is that they panic the kernel upon recv which is not a legitimate
   way, it's an unfixed bug, so it could come and go arbitrarily, and
   nothing but an exact replication of the recv process is a good way
   to test for it since we do not know where the bug is yet.  This bug
   is bad but FAR mroe workable if you're using send|recv to move data
   rather than archive it.

I continually get into arguments with people on the mailing list about
this, and to my view I win every single one of them decisively and the
original suggesters leave ``educated'' not indignant, but someone pops
up later with the same bad idea.  And the Sun people will not update
the si wiki with other than an ambiguous CYA warning ``zfs send is not
an enterprise backup solution,'' and I wrote the wiki guy myself
asking for an account and was ignored so I guess the si wiki is a Sun
mouthpiece like their employee blogs, just hiding under another domain

Partly I think it keeps coming up because it's an obvious idea that's
very subtle in its badness.  And partly Sun is promoting this bad
practice through their viral marketing behemoth to the effect of
making ZFS seem more versatile.

Storing them in S3 probably has fewer of the problems of storing them
on disk or tape, but it's still not great.

The blogger is suggesting something subtly different, using S3 to
``move'' filesystems between instances (safe) or across short
stretches of time (less safe, depends).

Storing them in S3 for the purposes of booting a system might be a
relatively good idea, because in that case you will have a
filesystem-level backup somewhere else (filesystem-level means your
backup must NOT be a duplicate copy of the same S3 stream.  it must be
a copy in cpio/tar/zpool/ext3 form.).  He's just talking about
non-boot data filesystems but...  

I don't think it's easy to arrange to boot a solaris VM from a
ZFS-root stored in 'send' format.  It'd be interesting to explore.
You could recv one bulk stream on all nodes, then recv a different
incremental stream on each subnode to customize it, thus saving S3
dollars.  I think Solaris do have an early userspace like Linux, so
maybe something could be adapted to boot from a zfs send stream.

This application is not the same as backup, though.  It's a form of
replication, which is what zfs send | zfs recv's fragileness is good
for.  It's replication because if the stream goes sour somehow, you
can presumably regenerate it from the real authoritative master node
somewhere outside Amazon.  What you care about most is, when you bring up
your tens of EC2 instances and 'zfs recv' their roots, you want them
to really definitely have exactly the root disk you think they should.
If they don't, you can chuck the broken S3 stream and make a new one.

It would be tempting to save S3 dollars by tossing in 'zfs send'
incremental snapshots of the VM as the VM is running, instead of
stopping the VM and copying the whole virtual disk.  You could afford
in S3 $ and performance to do it frequently, even like, every five
minutes.  And zfs will let you coalesce these snapshots, so every hour
you could send a rollup and drop the tiny ones.  You could do
test-sends to 'wc' and only pay for an S3 dump after the 'zfs send'
stream gets kind-of big.  This would be kind of borderline safety
depending on whether the VM's are storing something of high value
(unsafe), or is it just logs and mailqueues full of spam backscatter
(slap!)---like, you are using zfs send/S3 as you'd use a fragile
unredundant working disk, and your real backup is elsewhere like on a
nightly zfs send -> zpool outside Amazon.  You can work through my
objections below and see if you think it's safe or not.

anyway 'zfs send' is for replicating one ZFS pool into another, not
for backing it up.  If you want to backup, you should use
gtar/pax/cpio/rsync/..., or else zfs recv into a backup pool.  If you
have enough scratch space, you can write file-backed vdev's to tape,
or make several dvd/bd-size file vdevs.  archiving the ZFS pools
themselves is obviously okay since otherwise ZFS would suck, but you
can see my specific cases below.

but I think S3 is too expensive to use for anything.  We need
self-organizing S3 on livecd, then we just rent hardware on the open

From: Miles Nordin <carton at castrovalva.Ivy.NET>
Subject: Re: [zfs-discuss] GSoC 09 zfs ideas?
To: zfs-discuss at opensolaris.org
Date: Mon, 02 Mar 2009 18:37:43 -0500
References: <49A511CF.1050107 at netsyncro.com> <38171ac60902260843l2e7a595cr53187ccbc924aa00 at mail.gmail.com> <49A8760C.5040500 at netsyncro.com> <47B26F54-E30A-454E-B6B5-FB18268221CA at ee.ryerson.ca> <49A88D37.10408 at gmail.com> <4786252C-55A5-4EFD-8976-955B93039A66 at ee.ryerson.ca>
In-Reply-To: <4786252C-55A5-4EFD-8976-955B93039A66 at ee.ryerson.ca> (David Magda's message of "Fri, 27 Feb 2009 22:24:31 -0500")
Message-ID: <oq63irbins.fsf at castrovalva.Ivy.NET>

>>>>> "dm" == David Magda <dmagda at ee.ryerson.ca> writes:

    dm> Yes, in its current state; hopefully that will change some
    dm> point in the future

I don't think it will or should.  A replication tool and a backup tool
seem similar, but they're not similar enough.

With replication, you want an exact copy, and if for some reason the
copy is not exact then you need something more: you want atomicity of
operations so the overwatcher scheduler:

 * can safely retry the send|recv until it works,

 * can always tell its minder with certainty how much has safely been
   replicated so far,

 * can attempt further replication without risking existing data.  

These things are a bit hard to achieve.  And zfs send|recv does them:

 * If a single bit is flipped the whole stream should be discarded

 * If, on a 'send' timeline of <full>, <incremental>, <incremental>,
   <incremental>, one of the preceeding blobs did not make it, or
   became bad on the replication target (because somebody wrote to it,
   for example---a FAQ), it should become impossible to restore
   further incremental backups.  The error should not be best-effort
   worked around, or simply silently concealed, the way it is and
   should be with restoring incremental backups.

 * reslicing the unit of replication after writing the stream is
   irrelevant, because you can just reslice on the replication-source
   if you need to do this.  The possibility of reslicing interferes
   with the atomicity I spoke of which makes the replication scheduler
   so much easier to get correct.

 * there's no need to test a stream's validity without restoring it.
   The replication target will always be available and have enough
   free space to test-by-receiving.

 * there's no need to restore the stream on an unimagined future
   filesystem.  It's more important that all fancyfeatures, ACL's,
   gizmos, properties, compression modes, record sizes, whatever, make
   it to the replication target exactly to avoid surprises.  No one is
   worried about data being locked in to a certain filesystem because
   it's all there for you to get with rsync on both replication source
   and target.

Try to use such a tool for backup, and you court disaster.  Your pile
of backups becomes an increasingly large time-lapse gamma ray
detector, which signals a ray's presence by destroying ALL the data
not just the bit, not even just the file, that the ray hit.  The
impossibility of reslicing (extracting a single file from the backup)
means that you can find yourself needing 48TB of empty disk on a
development system somewhere to get out a 100kB file locked inside the
atomic blob, which is an unacceptable burden in time and expense.  The
other points have obvious problems for backups, too---I'll leave
inventing imaginary restore scenarios as an exercise for the reader.
All these harmful points are things that replication wants/needs.  The
goals are incompatible.

If there's going to be a snapshot-aware incremental backup tool for
ZFS, I do not think zfs send|recv is a good starting point.  And I'm
getting frustrated pointing out these issues for the 10th time---it
seems like, I mention five relatively unsolveable problems, and people
seize onto the easiest one, misinterpret it, and then forget the other

the versioning issue (NOT mentioned above) is a problem for
replication among different Solaris releases, not just backup.  It
means you could potentially have to upgrade a whole mess of machines
at once.  At the very least you ought to be able to upgrade the target
before you upgrade the source, so you don't have to break replication
while doing the upgrade---coincidentally that's the right direction
for upgrade-test-downgrade, too, since it's on the target that you'd
possibly have to destroy the zpools if you decide you need to
downgrade.  We should want this and don't have it yet.

It makes having a single backup pool for a lab full of different-aged
systems impossible (even with backward-only compatibility, how do you
restore?), so it is worth solving for that use case too.  I think the
'zfs send' format of a given filesystem should be bit-for-bit
identical given a certain ZFS version, irrespective of zpool version
or kernel release on the sending system.  That's enough to solve the
single-lab-backup-pool problem, and it's also
regression-testable---keep some old streams around, recv them into the
pool under test, send them back out, and make sure they come out
identical.  And the 'zfs recv' panics need fixing.  those would both
be great things, but they would *NOT* make zfs send|recv into a backup
system!  They would make it into a better replication system.

zfs send|recv will not become backup tools if you manage to check off
a few other list-of-things-to-fix, either.  They can't be both a good
replication system and a good backup system at the same time.

no, I don't think the si wiki explains the full size of the issue
adequately.  It makes it sound like the tools are just a little too
small, and maybe someday they will get bigger, maybe some people
should just ignore the advice and use them anyway.  It's CYA bullshit.

i think in the mean time you should make backups with cpio (or some
tool that uses something similar underneath like Amanda) or by
archiving file-vdev zpools.  not perfect but better.  And you should
store it on the medium in some way that the whole thing won't be wiped
by one flipped bit (not gzip).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL: <http://lists.nycbug.org/pipermail/talk/attachments/20090310/b83ac3ce/attachment.bin>

More information about the talk mailing list