The implementation makes some weird choices like rebuilding a bunch of services like DNS, cert, weird dependency on SQLite. Wish people would stop reimplementing Kubernetes and just build on top of it.
I think "per-user" is probably the wrong killer feature for something like this. Much more potential in shared distributed processes that support multiple users (chat, CRDT/coauthoring). Appears that the underlying layer can probably do that.
In any case, super cool idea, and I hope something like this lands in the serverless platforms from all the major cloud providers. It's always been mind blowing to me that Google Cloud Functions supports websockets without allowing you to route multiple incoming connections from different users to the same process. That simple change would unlock so many useful scenarios.
Thanks for taking the time to look through the architecture. There are definitely some choices that would have seemed weird to me when we set out to build this, but that we did not make lightly.
We actually initially built this on Kubernetes, twice. The MVP was Kubernetes + nginx where we created pods through the API and used the built-in DNS resolver. The post-MVP attempt fully embraced k8s, with our own CRD and operator pattern. It still exists in another branch of the repo[1].
Our decision to move off came because we realized we cared about a different set of things than Kubernetes did. For example, cold start time generally doesn’t matter that much to a stateless server architecture (k8s’ typical use), but is vital for us because a user is actively waiting on each cold start. Moving away from k8s let us own the scheduling process, which helped us reduce cold start times significantly. There are other things we gain from it, some of which I’ve talked about in this comment tree[2]. I will say, it seemed like a crazy decision when I proposed it, but I have no regrets about it.
The point of sqlite was to allow the “drone” version to be updated in place without killing running backends. It also allows (but does not require) the components of the drone to run as separate containers. I originally wanted to use LMDB, but landed on sqlite. It’s a pretty lightweight dependency, it provides another point of introspection for a running system (the sqlite cli), and it’s not something people otherwise have to interact with. I wrote up my thought process for it at the time in this design doc[3].
You’re right about shared backends among multiple users being supported by Plane. I use per-user to convey that we treat container creation as so cheap and ephemeral you could give one to every user, but users can certainly share one and we’ve done that for exactly the data sync use case you describe.
Hi Paul, thanks for your explanation - you should add that to the documentation, e.g. in a chapter "Why not K8S?".
Also you should give some advice about how to deploy when the default for deploying apps in an organization is K8S, what might be not too exotic nowadays.
Will Plane need it´s own cluster? Does it run on top of K8S? How is the relation to K8S in general for a deployment scenario?
Good idea on both counts. Documentation will be one of my priorities over the coming months and it’s great to have feedback on what’s missing.
Re. the “Why not k8s” question, you might enjoy this post from a couple months back; although it only touches on Plane briefly, it shows the framework we used to make the decision. https://driftingin.space/posts/complexity-kubernetes
Knative has solved most of those pod start time problems since it’s dealing with a similar scenario, unless 0.008s startup time isn’t good enough for you.
I don't think I have ever read something negative about SQLite.
I also don't read the GP comment as being negative toward SQLite. It sounds more like the author was surprised about the architecture, since a naive view would think Kubernetes would be good enough.
I think you would get a much higher long term payoff with a custom scheduler. Dask does something like this. Both on scheduling and when it has to "Drain".
We considered that approach, but even plugging in a scheduler would not allow us to own scheduling end-to-end; there’s still latency introduced by having events go through etcd. In the end the complexity of kubernetes got in our way, and we realized we were using it as a glorified OCI runtime API, so we decided to cut it out.
I think "per-user" is probably the wrong killer feature for something like this. Much more potential in shared distributed processes that support multiple users (chat, CRDT/coauthoring). Appears that the underlying layer can probably do that.
In any case, super cool idea, and I hope something like this lands in the serverless platforms from all the major cloud providers. It's always been mind blowing to me that Google Cloud Functions supports websockets without allowing you to route multiple incoming connections from different users to the same process. That simple change would unlock so many useful scenarios.