slpnix's comments

slpnix · on Aug 29, 2024

Podman supports this since quite a while with the krun variant of the crun runtime (https://github.com/containers/crun/blob/main/krun.1), provided in Fedora by the "crun-krun" package.

Just add "--runtime=krun" to your podman command line along the other arguments and you'll get the container running inside a VM powered by libkrun.

slpnix · on Sept 6, 2021

Behind the scenes, it's using QEMU with Alex Graf's patches for hvf (Hypervisor.framework) support, so it's Virtualization, not emulation. In other words, the performance is really good ;-)

BTW, in case you don't want to depend on a fork, upstream podman is going to gain M1 support (in the sense of 'podman-machine' knowing how to start aarch64 VMs with hvf) very soon.

slpnix · on Jan 29, 2021

I can't comment on the comparison with Docker for Mac bacause, honestly, I have never used it.

With krunvm, each session initiated with "krunvm start" is an independent lightweight VM. The maximum amount of RAM the VM can use is configured with the "mem" flag (or it uses what's configured by default), but the VMM will always try to use the minimum possible amount of RAM by returning the pages the guest is no longer using to the host (virtio-balloon's free page reporting feature).

That said, you can also start a single VM (using "krunvm start"), run podman's service ("podman system service...") inside it, switch to another terminal in macOS and execute multiple containers inside the VM using "podman remote" [1]. Now I'm thinking I should probably write a tutorial about this option ;-)

[1] https://www.redhat.com/sysadmin/podman-clients-macos-windows

slpnix · on Jan 29, 2021

The key is that libkrun (https://github.com/containers/libkrun), the library that krunvm uses for running the VMs, as recently integrated support for Hypervisor.framework on ARM64, in addition to KVM.

As for buildah, the Homebrew repo contains a build that includes this PR (https://github.com/containers/storage/pull/811).

slpnix · on Jan 29, 2021

krunvm uses libkrun (https://github.com/containers/libkrun) for executing the VM, and while the later is also based in rust-vmm and shares some code with Firecracker and Cloud-Hypervisor, it's specialized in the process isolation use case. This means it implements a different set of devices (most notably, virtio-fs instead of virtio-blk, and virtio-vsock+TSI (Transparent Socket Impersonation) instead of virtio-net), and it takes the form of a dynamic library instead of a final binary.

In fact, the networking limitations are caused by this use of virtio-vsock+TSI. TSI (WIP implementation here: https://github.com/containers/libkrunfw/blob/main/patches/00...) is an experimental mechanism that provides inbound and outbound networking capabilities to the guest, with zero-configuration and minimal footprint, by transparently replacing user-space AF_INET sockets with AF_TSI, that have both an AF_INET and AF_VSOCK personality.

TSI has the additional advantage that, for the host side, all connections appear to come and go to the process acting as a VMM (in this case, krunvm, as it links directly with libkrun), which makes it very container-friendly in a way that even side-cars (such as Istio) work out-of-the-box.

slpnix · on Nov 7, 2019

That's correct. The initial versions of the microvm patch series did require KVM, but the one that got upstreamed does work with TCG [1], thanks to the QEMU's maintainers feedback.

That said, I'm not sure for which kind of use cases it would be useful to run it this way, as the performance won't be amazing. I find TCG acceleration mainly useful for debugging and foreign systems emulation.

[1] https://wiki.qemu.org/Documentation/TCG

slpnix · on Nov 7, 2019

From the guest perspective, the differences are minimal. Even boot time of the guest (thinking about a custom-built minimalist Linux kernel here) is roughly the same.

On the host side, things are more interesting. Firecraker has a smaller TCB (Trusted Computing Base), is written in Rust, and is statically linked. On the other hand, QEMU provides more features (especially in the block layer, with more formats, network-based block devices, asynchronous I/O...), can be configured at build time to adapt it to a particular use case, and has a pretty good security record.

In the end, KVM userspace VMMs (Virtual Machine Monitors) are learning from each other, giving users more options to choose from. Everybody wins.

mato · on Nov 7, 2019

> In the end, KVM userspace VMMs (Virtual Machine Monitors) are learning from each other, giving users more options to choose from. Everybody wins.

Indeed. Nice to see that the cross-pollination is happening.

For folks interested in what can be accomplished with userspace VMMs, a very minimalist example is the Solo5 project (https://github.com/Solo5/solo5), specifically the 'hvt' tender.

rrdharan · on Nov 7, 2019

> QEMU... has a pretty good security record

That's an interesting and I would argue, contrarian take?

https://www.theregister.co.uk/2017/01/30/google_cloud_kicked...

"QEMU has a long track record of security bugs, such as VENOM, and it's unclear what vulnerabilities may still be lurking in the code."

slpnix · on Nov 7, 2019

I think the slide 14 from the talk "Reports of my Bloat Have Been Greatly Exaggerated" [1] presented by Paolo Bonzini at KVM Forum 2019 gives some good perspective about QEMU's security track:

"Of the top 100 vulnerabilities reported for QEMU:

- 65 were not guest exploitable

- 3 were not in QEMU :)

- 5 did not affect x86 KVM guests

- 3 were not related to the C language

- Only 6 affected devices normally used for IaaS

The most recent of these 6 was reported in 2016"

The rest of this talk was also very interesting. I encourage everyone with 10 minutes to spare and an interest in VMMs to take a look at the slides.

[1] https://static.sched.com/hosted_files/kvmforum2019/c6/kvmfor...

mato · on Nov 7, 2019

> "Of the top 100 vulnerabilities reported for QEMU:

> - 65 were not guest exploitable

> [...]

Which leaves about 30 that presumably were guest exploitable.

Don't get me wrong -- QEMU is useful. As a "kitchen sink" solution that runs anything, anywhere, with any useful combination of emulated {devices,processors,systems}.

However, this is also its biggest weakness. Which is why Google and Amazon all run their own custom VMMs for their IaaS services.

The microvm machine type as described here is a great step to improve this situation. The next step in my book would be to reconfigure QEMU's build system to allow building a binary that only supports the devices provided by microvm, and nothing else.

ketralnis · on Nov 7, 2019

> "Of the top 100 vulnerabilities reported for QEMU:

> - 3 were not related to the C language

wow