Hacker Newsnew | past | comments | ask | show | jobs | submit | shayonj's commentslogin

that's very interesting! I was noticing page vault storm on live migrations as well and I wonder if that's what you were running into / mentioning here regarding the lock contention

> you can do things that aren't practical with 100-200ms startup: speculative parallel execution (fork 10 VMs, try 10 approaches, keep the best), treating code execution like a function call instead of an infrastructure decision, etc.

i am not following, why isn't it practical?


Off the top of my head trading or realtime voice come to mind. Probably plenty other domains could benefit

Cool project. +1 on userfaultfd for the multi-node path. Wrote about how uffd-based on-demand restore works wrt to my Cloud Hypervisor change [1] if you are curious.

I think the the main things to watch are fault storms at resume (all vCPUs hitting missing pages at once) and handler throughput if you're serving pages over the network instead of local mmap. I think its less likely to happen when you fork a brand new VM vs say a VM that has been doing things for 5 mins.

Also interestingly, Cloud Hypervisor couldn't use MAP_PRIVATE for this because it breaks VFIO/vhost-user bindings. Firecracker's simpler device model is nice for cases like this.

[1] https://www.shayon.dev/post/2026/65/linux-page-faults-mmap-a...


Great writeup, bookmarked. The fault storm point is interesting -- our forks are short-lived (execute and discard) so the working set is small, but for longer-running sandboxes that would absolutely be a problem.

That's a good shout. I will update/port it back from here - https://github.com/cloud-hypervisor/cloud-hypervisor/pull/78..., but quite fast


This is going to be an interesting space to watch I think and big part of offering sandbox as a service basically for enterprise and saas needs.


Yeah, it's hard to hit the right balance with nuance around these and you're spot on. What I meant to get at was the specific difference in default modes where gVisor's systrap intercepts syscalls via seccomp traps and handles them entirely in a user-space Go kernel, so there's no hardware isolation boundary in the memory/execution sense. A microVM puts the guest in a VT-x/EPT-isolated address space, which is a qualitative difference in what enforces the boundary (perhaps?)

Whereas yeah, you can run gVisor in KVM mode where it does use hardware virtualization, and at that point the isolation boundary is much closer to a microVM's. I believe the real difference then becomes more about what's on either side of that boundary where gVisor gives you a memory-safe Go kernel making ~70 host syscalls, a microVM gives you a full guest Linux kernel behind a minimal VMM. So at least in my mind it comes down to a bit of around different trust chains, not necessarily one strictly stronger than the other.


I see this "hardware isolation" benefit of virtual machines brought up a lot, but if you look a little deeper into it, putting that label exclusively on VMs is very much unfair.

Just like containers, VMs are very loosely defined and, under the hood, composed of mechanisms that can be used in isolation (paging, trapping, IOMMU vs individual cgroups and namespaces). It's those mechanisms that give you the actual security benefits.

And most of them are used outside of VMs, to isolate processes on a bare kernel. The system call/software interrupt trapping and "regular" virtual memory of gVisor (or even a bare Linux kernel) are just as much of a "hardware boundary" as the hyper calls and SLAT virtual memory are in the case of VMs, just without the hacks needed to make the isolated side believe it's in control of real hardware. One traps into Sentry, the other traps into QEMU, but ultimately, both are user-space processes running on the host kernel. And they themselves are isolated, using the same very primitives, by the host kernel.

As you clarified here, the real difference lies in what's on the other side of these boundaries. gVisor will probably have some more overhead, at least in the systrap mode, as every trapped call has to go through the host kernel's dispatcher before landing in Sentry. QEMU/KVM has this benefit of letting the guest's user-space call the guest kernel directly, and only the kernel typically can then call QEMU. The attack surface, too, differs a lot in both cases. gVisor is a niche Google project, KVM is a business-critical component of many public cloud providers.

It may sound like I'm nitpicking, but I believe that it's important to understand this to make an informed decision and avoid the mistake of stacking up useless layers, as it is plaguing today's software engineering.

Thanks for your reply and post by the way! I was looking for something like gVisor.


Heya! nice to see you here. In retrospect it feels like CI companies and environments are very well suited for sandboxes since a lot of the problems overlap around ephemeral workloads, running untrusted code, fast cold starts, multi-tenancy isolation. Also, loved Buildkite at a past job! Looking forward to following cleanroom


It touches in the gvisor section around the trade-off that the surface area for gvisor is smaller. There are trade offs. It’s not dishonest.


That’s a good shout! I have been curious as well and did some experiments. Also left out GPU sandboxing from the post as well. Maybe will reflect in a part II post.


Wasmer looks v cool. I must check it out


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: