The implementation makes some weird choices like rebuilding a bunch of services ...

paulgb · on Oct 13, 2022

Thanks for taking the time to look through the architecture. There are definitely some choices that would have seemed weird to me when we set out to build this, but that we did not make lightly.

We actually initially built this on Kubernetes, twice. The MVP was Kubernetes + nginx where we created pods through the API and used the built-in DNS resolver. The post-MVP attempt fully embraced k8s, with our own CRD and operator pattern. It still exists in another branch of the repo[1].

Our decision to move off came because we realized we cared about a different set of things than Kubernetes did. For example, cold start time generally doesn’t matter that much to a stateless server architecture (k8s’ typical use), but is vital for us because a user is actively waiting on each cold start. Moving away from k8s let us own the scheduling process, which helped us reduce cold start times significantly. There are other things we gain from it, some of which I’ve talked about in this comment tree[2]. I will say, it seemed like a crazy decision when I proposed it, but I have no regrets about it.

The point of sqlite was to allow the “drone” version to be updated in place without killing running backends. It also allows (but does not require) the components of the drone to run as separate containers. I originally wanted to use LMDB, but landed on sqlite. It’s a pretty lightweight dependency, it provides another point of introspection for a running system (the sqlite cli), and it’s not something people otherwise have to interact with. I wrote up my thought process for it at the time in this design doc[3].

You’re right about shared backends among multiple users being supported by Plane. I use per-user to convey that we treat container creation as so cheap and ephemeral you could give one to every user, but users can certainly share one and we’ve done that for exactly the data sync use case you describe.

[1] https://github.com/drifting-in-space/plane/tree/original-kub...

[2] https://news.ycombinator.com/item?id=32305234

[3] https://docs.google.com/document/d/1CSoF5Fgge_t1vY0rKQX--dWu...

POPOSYS · on Oct 13, 2022

Hi Paul, thanks for your explanation - you should add that to the documentation, e.g. in a chapter "Why not K8S?".

Also you should give some advice about how to deploy when the default for deploying apps in an organization is K8S, what might be not too exotic nowadays. Will Plane need it´s own cluster? Does it run on top of K8S? How is the relation to K8S in general for a deployment scenario?

THANKS!

paulgb · on Oct 13, 2022

Good idea on both counts. Documentation will be one of my priorities over the coming months and it’s great to have feedback on what’s missing.

Re. the “Why not k8s” question, you might enjoy this post from a couple months back; although it only touches on Plane briefly, it shows the framework we used to make the decision. https://driftingin.space/posts/complexity-kubernetes

_jezell_ · on Oct 13, 2022

https://developer.ibm.com/articles/reducing-cold-start-times...

Knative has solved most of those pod start time problems since it’s dealing with a similar scenario, unless 0.008s startup time isn’t good enough for you.

schainks · on Oct 13, 2022

It's funny how SQLite gets so much flak, but every time I've used it in production, it just _worked_.

rmetzler · on Oct 13, 2022

I don't think I have ever read something negative about SQLite.

I also don't read the GP comment as being negative toward SQLite. It sounds more like the author was surprised about the architecture, since a naive view would think Kubernetes would be good enough.

sandGorgon · on Oct 14, 2022

u can plug in your own scheduler - https://kubernetes.io/blog/2020/12/21/writing-crl-scheduler/

I think you would get a much higher long term payoff with a custom scheduler. Dask does something like this. Both on scheduling and when it has to "Drain".

paulgb · on Oct 14, 2022

We considered that approach, but even plugging in a scheduler would not allow us to own scheduling end-to-end; there’s still latency introduced by having events go through etcd. In the end the complexity of kubernetes got in our way, and we realized we were using it as a glorified OCI runtime API, so we decided to cut it out.