Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Imaging mounted disk volumes under duress (benjojo.co.uk)
139 points by todsacerdoti on Sept 21, 2021 | hide | past | favorite | 42 comments


It's a shame that Linux doesn't have APIs as comprehensive as Windows VSS. One of the several features is something called Volume Shadow Copy Service which lets you take backups of block devices which are in use. It's kind of similar to this but more supported and I think it also interacts with user processes like databases, quiescing them so you can take a consistent snapshot.

[Also if you are playing with custom block devices in Linux -- mounted or otherwise but not /dev/sda -- https://gitlab.com/nbdkit/nbdkit https://libguestfs.org/nbdkit.1.html One filter we have which is kind of similar to blktrace is called the nbdkit-stats-filter https://libguestfs.org/nbdkit-stats-filter.1.html]


The dattobd [1] block device driver fills this gap. It basically implements point in time like behavior on Linux, with hooks similar to the ones that VSS provides. It is used in the Datto Linux backup agent in combination with the backup appliance [2], but it can be used stand alone as well (see readme).

Disclaimer: I'm a software engineer at Datto.

[1] https://github.com/datto/dattobd

[2] https://www.datto.com/products/siris


This is just the coolest thing in the world, thank you so much for open sourcing it.

Have you asked upstream what they think of it, at all?


Hey, I'm one of the folks that work on maintaining dattobd. We haven't had a chance to talk to the upstream Linux kernel storage folks about a way to upstream it, but we're interested in exploring it, of course!


Once the world gets more back to normal, you could submit a talk topic to LSF/MM. Or just email the appropriate mailing list.


I've been doing my full disk backups with zfs snapshots for a while now and zfs sending them offsite. Granted it's a freebsd box, but same tools should be available on Linux.


Definitely possible on Linux. In the not distant past I had a large on-prem PostgreSQL DB that had a hot standby that we used for backups. Since ZFS snapshots are atomic you can just snapshot the dataset under the DB without worrying about quiescing the database. And then zfs send backups to files that are then backed up offsite.

If you ever cared it's easy enough to clone one of those snapshots and bring it up as another DB instance to poke around at how things used to be...


I've been doing this with several production PostgreSQL instances.

PostgreSQL on ZFS is great.

I have zstd compression enabled and I average 3.50x compression ratios.

(Probably some pretty awful CPU tradeoffs in there, but system load and query times seem fine. My database is 50 GB before zstd compression, so enabling it helps a ton.)

I also have ZFS snapshots taken hourly and shipped offsite. It's awesome that I don't need to pause PostgreSQL or anything to take the snapshot. It's atomic and instant.


With some quick and dirty `time` style tests zstd has a pretty low overhead. IIRC writing was around off = 100, lz4 = 110, zstd = 115 cpu utilization on my personal data set that resulted in 1x, 1.7x, 2.1x compression. Reading was negligible, single digit percentages, for both lz4 & zstd. For anything on a spindle that's a pretty good trade off of CPU time.


VSS is pretty great. I use Macrium Reflect on Windows, and it just works. I've never once had a problem restoring a Macrium backup to bare metal, even if it has a bunch of incremental backups.


VSS is great indeed. Very useful to eg. P2V a machine live, even saving the P2V image on the same drive.

But VSS, like LVM or other snapshoting FS, requires dedicated FS support. You don't have VSS on FAT32. Here, blktrace works at block device level, so does not rely on any FS support.


I'm not sure I follow here. Linux does have LVM, what's the problem with it?

That's also what the article says: If you have LVM, you don't need any special tool.


LVM doesn't signal to databases and other user applications so it's definitely not a replacement for VSS. As for whether LVM is a replacement for nbdkit, I guess it depends on if you like writing kernel code, or prefer instead to write nbdkit plugins in userspace in a variety of programming languages (even shell script: https://www.youtube.com/watch?v=9E5A608xJG0)


For user application it does not signal anything I guess, but I'm not sure if that's important for databases. Databases are usually crash-consistent, which is equivalent to a snapshot.

However, it does signal to the filesystem, quoting https://manned.org/fsfreeze.8 :

       fsfreeze is unnecessary for device-mapper devices. The device-mapper (and
       LVM) automatically freezes a filesystem on the device when a snapshot
       creation is requested. For more details see the dmsetup(8) man page.


No one is doubting that databases are crash consistent. But if you read the other replies in this thread you can see there are still advantages to doing the backup in cooperation with the userspace programs, eg in time taken to restore instead of having to do full WAL replay. That's ignoring the non-database cases too.


I think I saw Oracle mentioned earlier, this has some info about the VSS integration. It's called 'Oracle VSS Writer': https://docs.oracle.com/database/121/NTQRF/vss.htm#BEIJJCAE

I guess MS-SQL does something similar, and that does seem useful beyond what LVM does.

Also VSS apparently can offload the snapshotting to a SAN ('VSS Hardware Provider'): https://docs.microsoft.com/en-us/windows-server/storage/file...


That can't be supported by many databases, it must be insanely difficult to get that right, is it really used?


Oracle, MSSQL and some rarer ones (Actian Zen?) are "VSS-Aware" in that they register a writer driver. This gives them a callback before the snapshot starts - a chance to flush transactions and hold off writing any more, for a second or two until the disk snapshot is created and block tracking has begun. Then they get a second callback saying they can resume.

More interesting is e.g. Hyper-V using this to snapshot the guest VMs.

It just takes you one step beyond being only crash-consistent, so that everything is actually application-consistent.

Implementing a VSS writer is a big 90's-style COM interface, but there is also a lightweight alternative model on top of that that eludes my search right now.

Red Hat has some similar convention for freeze/thaw scripts that can make API calls to the DB to ask it to do the same thing.


> "More interesting is e.g. Hyper-V using this to snapshot the guest VMs."

It can run through VMware tools into virtual machines, too, and down a layer to SAN storage and trigger a SAN snapshot so that a backup system like Veeam can take data from the SAN while knowing the guests two levels higher flushed data before the snapshot was taken.

or e.g. here Nimble SAN can talk up to servers through VSS to trigger SQL Server quiesce before they take a SAN snapshot: https://infosight.hpe.com/InfoSight/media/cms/active/public/...


Does it just buffer to memory during this, or do the databases block writes?


It just needs to listen to explicit fsync commands and choose a matching point-in-time. Beyond that, no, it's not hard as a database to tell the OS about moments where a block-level disk image would not need repairing, after the OS asks you.

I assume this is to prevent situations that would need time-intensive WAL-replay or such, and there it'd only be opportunistic with the "sudden power loss" recovery as a fall-back path.


Postgres supports backing up by just copying its files if you let it know using pg_start_backup/pg_stop_backup.

https://www.postgresql.org/docs/8.1/backup-online.html

These database already have to make sure that they can recover from the system crashing. So it is probably only incrementally more work to add a feature to write out some extra metadata to enable these sorts of online backups.


AIUI, Postgres rotates its WAL, so once something is acknowledged as having hit the storage, the WAL entry referring to that write can be thrown away and that WAL space can be re-used. When you do pg_start_backup, it just temporarily stops throwing away old WAL entries, so that the copy of all the files you make elsewhere is guaranteed to have all the WAL entries it needs to make the database fully consistent.


Is VSS deprecated? They took out the "previous versions of file" feature in Windows 8.


Apparently the GUI part was removed in Windows 8 and then re-added in Windows 10: https://en.wikipedia.org/wiki/Shadow_Copy#Windows_8_and_Serv...


This is really quite clever, but, as the author hints, you should probably just use a file system layer with snapshotting support. So many things are so much easier.


This is fantastic. I am definitely going to look into the blktrace calls. They look useful.

---

I've been looking for a way to copy virtual machine disks over the network with minimal downtime.

I.e. Copy the entire 50 GB disk over. Suspend the VM. Do a quick pass that picks up the changed blocks. Resume the VM.

Does anyone have a tool for that? It would be similar to how QEMU/KVM does it for VM migrations. (They only support doing it for RAM though.)

---

Also related, you can use the sysrq feature on Linux to remount the root file system as read-only. Then run `dd if=/dev/sda | ssh user@destination.com 'cat > sda.img'` to transfer the root partition consistently.

I've used this twice to migrate VPSs off of Cloud providers and onto my physical servers to convert to .qcow2 images. Worked perfectly :)

I'm not sure if you are required to reboot afterwards, but I always do. It would be nice to "un sysrq remount".


Qemu has a native changed blocks API like vmware, but I'm not sure of anything that uses it extensively (besides oroxmox backup server pointing at a oroxmox server).

https://lwn.net/Articles/837053/

Also you can always do the fsfreeze yourself (which is what qemu agent can do, and is used by proxmox backup server).


> Does anyone have a tool for that? It would be similar to how QEMU/KVM does it for VM migrations. (They only support doing it for RAM though.)

I don't know of anything wrapping it, but if I was doing that I'd just put it on ZFS (handing zvols to qemu as disks) and then do incremental snapshots and send-recv them. (Bonus: trivial inline compression, and if you `zfs send --raw` even the transfer will be compressed)


Ah. Yeah, that's what I have been doing. It took about an hour to send 100G initial ZFS seed over the network between different datacenters. Then I suspended all of the VMs and sent the remaining incremental.

The --raw/-w flag does send the compressed data.

It always feels brittle to me because of how manual the process is. Maybe it's time to create some decent scripts with safe logic and checks...


Scripting would be nice, but unless you're deleting things it feels extremely safe to me, since you're working on (and retain) snapshots the whole way; if anything ever does go wrong just roll back:)


If you can pause writes to the disk while reading from it (and you are using a supported filesystem, and you aren't doing or can ignore any raw device accesses), it would be much easier to use fsfreeze(8) instead. Caution is of course required when using this on the root filesystem, but I would trust it far more than anything using blktrace. My guess, having used it before but not closely investigated it, is that while it may appear reliable at first, it is most likely not highly tested to drop no events under load, and the same applies for this tool. On the other hand, as long as I've worked out that fsfreeze won't deadlock the whole machine, I'm reasonably confident that it will result in a correct disk image.


I don’t understand how this works. If the trace API gave you the data in the writes then I can see how it works. You run your copy, then just run the writes over your copy and you have a snapshot that is consistent at some point in time.

However, if you just have a page modification flag then if you try and recopy the data that has been modified then it seems like you could just end up in a loop where you make no progress because the disk is continually being modified. If none of the modified pages have been modified during your second pass then everything is ok but if some have been modified then that could invalidate other pages that have weren’t modified during the first pass but we’re modified during the second pass.


The first pass, reading the whole disk, takes a long time (often hours for HDDs), so you'll have to deal with a lot of modifications.

As long as the system was only under normal near-idle load, there will be some modifications, but not too many, so you'll be able to catch up much more quickly, leaving the window for new modifications even smaller.

Of course, if the disk is continuously receiving a high write volume, the race might never end. But if it's idle long enough for you to do a sync and a final collection pass before the next write, you'll have a full image.


Plus, if the same few pages are getting written over and over (as often happens in high-write-load scenarios), you don’t need to queue up N reads, but instead can just do one, to capture the latest value. Similar to POSIX signal coalescing—once you know “X happened”, any other “X happened” notifications are redundant and can be dropped, until you actually handle the first one.


As long as you can catch up to a point where you read all busy pages and none were written in the meantime, yes.

Generally, the way VM migrations deal with this is by suspending the VM once a small enough set of dirty pages remain. I think the same can be done with block devices on Linux; if I remember correctly, there's some file in sysfs that you can poke to quiesce a block device. Just make sure your app is fully in RAM/cache, ideally locked, if you do this to your root filesystem :)


Except that's not how this tool works. It only does a single pass over the modified blocks as far as I can tell.[1][2]

[1] https://github.com/benjojo/hot-clone/blob/6a019efe28bdbbeeb1...

[2] https://github.com/benjojo/hot-clone/blob/6a019efe28bdbbeeb1...


You could potentially also speed it up a bit by, after the first pass, trying to read first blocks that are unlikely to be written again soon (normally the longer a block has been written last, the less likely it is to be written again soon). This would implicitly leave blocks that are being written over and over for last.


You missed the step where the author unmounts the filesystem. That idles all writes.

But yeah, the consistency test doesn't match the real scenario...

I guess it may work on a live filesystem if you iteratively ever get to a point that after reading the changed data you don't get any new changes in the trace and stop right there.


The unmount was just for verification. The tool is clearly intended to be used without unmounting the disk at any point.

Repeatedly copying until there are no new changes would produce a consistent image (if it terminates -- it requires the writing speed to the output device to be faster than the rate at which new data is being written to the input disk).

It may possible to stop earlier. The precise condition for producing a consistent image is as follows:

Let A[i] be the most recent modification time of block i, let B[i] be the modification time of block i at the time it was copied, and let C[i] be the time of the first modification to block i after it was copied (or infinity if the block hasn't been modified since it was copied).

We can stop if max(B) < min(C).

In detail, here is how to compute A, B, and C:

1. Initially, A[i] = B[i] = C[i] = 0.

2. When we are notified that block i has been altered, set A[i] to the modification timestamp. If C[i] == infinity, set C[i] to A[i].

3. When we copy block i, set B[i] to A[i] and set C[i] to infinity.


In truly bad situations where failure appears to be imminent, wouldn't it be safer to start rsync'ing whatever files you can to a reliable disk or another server?


Wow. Just wow.

I would take this guy as a coworker any day. Excellent real world awareness, practicality, only reinventing the bolts of one wheel, and excellent explanation of what and why.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: