Hacker Newsnew | past | comments | ask | show | jobs | submit | allanjude's commentslogin

klarasystems.com | OpenZFS Developer | REMOTE | Full-time Contract

We successfully hired from HN in the previous round and are looking for another OpenZFS Developer (3+ years of experience) to join our team!

Klara Inc. provides development & solutions focused on open source software and the community-driven development of OpenZFS and FreeBSD. We develop new features, investigate/fix bugs, and support the community of these important open source infrastructure projects. Some of our recent work includes major ZFS features such as Fast Deduplication (OpenZFS 2.3: https://github.com/openzfs/zfs/discussions/15896) and AnyRAID: https://github.com/openzfs/zfs/pull/17567.

We’re looking for an OpenZFS Developer with:

- Strong C programming skills and solid understanding of data structures

- Experience with file systems, VFS, and OS internals (threading, locking, IPC, memory management)

- Familiarity with ZFS internals (DMU, MOS, vdevs, ZPL, datasets, boot environments)

- Ability to work across Linux, FreeBSD, or illumos environments

Previous upstream contributions to OpenZFS or other open source projects are a big plus.

Submit an application through our site: https://klarasystems.com/careers/openzfs-developer/


Klara Inc. | OpenZFS Developer | Full-time (Contractor) | Remote | https://klarasystems.com/careers/openzfs-developer/

Klara provides open source development services with a focus on ZFS, FreeBSD, and Arm. Our mission is to advance technology through community-driven development while maintaining the ethics and creativity of open source. We help customers standardize and accelerate platforms built on ZFS by combining internal expertise with active participation in the community.

We are excited to share that we are looking to expand our OpenZFS team with an additional full-time Developer.

Our ZFS developer team works directly on OpenZFS for customers and with upstream to add features, investigating performance issues, and resolve complex bugs. Recently our team has upstreamed Fast Dedup, critical fixes for ZFS native encryption, improvements to gang block allocation, and has even more out for review (the new AnyRAID feature).

The ideal candidate will have experience working with ZFS or other Open Source projects in the kernel.

If you are interested in joining our team please contact us at zfs-hire@klarasystems.com or apply through the form here: https://klarasystems.com/careers/openzfs-developer/


My process reading this:

OpenZFS contractor? I wonder if they've worked with Allan Jude before. Oh hey, that's Allan Jude's company. Oh hey, that's Allan Jude!


The "M" at the end of that revision number suggests a "modified" tree, meaning they had their own patches that were not part of the upstream repository as well.


ZFS has supported online adding of vdevs since the start too, this is specifically modifying an existing vdev and widening it, which is much less common, and much more complex


There are advantages to doing the cloning at the block level, rather than the VFS layer. The feature was originally written for FreeBSD using the copy_file_range() syscall, then extended to work with the existing interfaces in Linux from btrfs.


There are a few different use cases, but cloning a VM image file is definitely a popular one.

Also, `mv` between different filesystems in the same ZFS pool. Traditionally when crossing filesystems doesn't allow just using `rename()`, `mv` resorted to effectively `cp` then `rm`, so at least temporarily required 2x the space, and that space might not be freed for a long time if you have snapshots.

With BRT, the copy to the 2nd filesystem doesn't need to write anything more than a bit of metadata, and then when you remove the source copy, it actually removes the BRT entry, so there is no long-term overhead.

One of the original developer's use cases was restoring a file from a snapshot, without having to copy it and have it take up additional space.

So you make a file (foo) 2 days ago. You change it each day. Today, the change you made was bad, and you want to restore the version from yesterday.

before BRT: you copied the file from the snapshot back to the live filesystem, and it took up all new space.

after BRT: we reference the existing blocks in the snapshot, so the copy to the live filesystem takes no additional space on disk. A small BRT entry is maintained in memory (and on disk).

If you remove the snapshot, the BRT entry is removed, and the file remains intact. No long term overhead.


It is, except it lets you do it as a sub-file level. You can clone a (block aligned) byte range of a file using the copy_file_range() syscall


That is not how this will work.

The reason the parity ratio stays the same, is that all of the references to the data are by DVA (Data Virtual Address, effectively the LBA within the RAID-Z vdev).

So the data will occupy the same amount of space and parity as it did before.

All stripes in RAID-Z are dynamic, so if your stripe is 5 wide and your array is 6 wide, the 2nd stripe will start on the last disk and wrap around.

So if your 5x10 TB disks are 90% full, after the expansion they will contain the same 5.4 TB of data and 3.6 TB of parity, and the pool will now be 10 TB bigger.

New writes, will be 4+2 instead, but the old data won't change (they is how this feature is able to work without needing block-pointer rewrite).

See this presentation: https://www.youtube.com/watch?v=yF2KgQGmUic


The linked pull request says "After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks". That'd mean that the disks do not contain the same data, but it is getting moved around?

Regardless, my entire point is that you still lose a significant amount of capacity due to the old data remaining as 3+2 rather than being rewritten to 4+2, which heavily disincentives the expansion of arrays reaching capacity - but that is the only time people would want to expand their array.

It just seems to me like they are spending a lot of effort on a feature which you frankly should not ever want to use.


I don't think that's true.

I don't use raidz for my personal pools because it has the wrong set of tradeoffs for my usage, but if I did, I'd absolutely use this.

Yes, your data has the old data:parity ratio for older data, but you now have more total storage available, which is the entire goal. Sure, it'd be more space-efficient to go rewrite your data, piecemeal or entirely, afterward, but you now have more storage to work with, rather than having to remake the pool or replace every disk in the vdev with a larger one.


> So the data will occupy the same amount of space and parity as it did before.

So you lose data capacity compared to "dumb" RAID6 on mdadm.

If you expand RAID6 from 4+2 to 5+2, you go from using 33.3% data for parity to 28.5% on parity

If you expand RAIDZ from 4+2 to 5+2, your new data will use 28.5%, but your old (which is majority, because if it wasn't you wouldn't be expanding) would still use 33.3% on parity.


Could you force a complete rewrite if you wanted to? That would be handy. Without copying all the data elsewhere of course. I don't have another 90TB of spare disks :P

Edit: I suppose I could cover this with a shell script needing only the spare space of the largest file. Nice!


> Could you force a complete rewrite if you wanted to?

On btrfs that's a rebalance, and part of how one expands an array (btrfs add + btrs balance)

(Not sure if ZFS has a similar operation, but from my understanding resilvering would not be it)

Not that it matters much though as RAID5 and RAID6 aren't dependable upon, and the array failure modes are weird in practice, so in context of expanding storage it really only matters for RAID0 and RAID10.

https://arstechnica.com/gadgets/2021/09/examining-btrfs-linu...

https://www.unixsheikh.com/articles/battle-testing-zfs-btrfs...


ZFS does not, and fundamental is not going to ever get one without rewriting so much you'd cry.


The easiest approach is to make a new subvolume, and move one file at a time. (Normal mv is copy + remove which doesn't quite work here, so you'd probably want something using find -type f and xargs with mv).


A post where we discuss some strategies and tools to make managing disk arrays on FreeBSD (and related platforms like TrueNAS Core) much easier. These concepts also apply to other operating systems, but the tools might differ slightly.


This was a bug in ZFS itself, and was resolved years ago: https://smartos.org/bugview/OS-6404

In FreeBSD 12.x and older, you could opt out of using UMA with: vfs.zfs.zio.use_uma=0, but that is not necessary


Thanks. I still see multi-second pauses moving data over Samba to an encrypted ZFS pool. I’m using a 10Gbps network card, so the drives are the bottleneck (or so I thought, until seeing these delays).

I’m on 13-RELEASE.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: