Stateful Apps on Kubernetes: A quick primer

bajsejohannes · on May 1, 2018

I would recommend against running stateful apps in kubernetes. It's not really ready for it. Big problems include routing (it works fine for http requests, but not for DBs, message brokers, etc) and just the pain of setting up stateful sets.

If you don't believe me, take it from someone who should know what they're talking about: https://twitter.com/kelseyhightower/status/96341350830081229...

atombender · on May 2, 2018

We run stateful apps on Kubernetes. There are obvious rough areas (the lack of persistent volume resizing, for example, which is scheduled for 1.11), but overall, it's great.

What a lot of naysayers leave out, or choose to ignore, is that the challenges running stateful apps on Kubernetes mirror those of running stateful apps anywhere. If you run Postgres on a VM, for example, you're completely reliant on that VM staying up -- this is no different from Kubernetes. Some will also point out the dangers of co-locating lots of software (such as Postgres) on the same machine as many other containers, as they will compete for CPU and I/O; but this is also no different than on Kubernetes, which provides plenty of tools (affinities/anti-affinities, node selectors) to isolate containers to machines. And so on. Containers bring some new challenges, but Kubernetes meets them quite well.

What specific issues do you have? I'm not sure I understand the point about routing. I also don't understand what the "pain" of stateful sets refers to.

FBISurveillance · on May 2, 2018

The the original commenter but I'll jump in:

1. While "we already rely on VM staying up", with k8s we reply on both VM staying up and kubernetes infra on top of that VM staying UP 2. Maintaining a complex stateful system on k8s _requires_ having and maintaining an operator for that system. 3. You reduce your options when it comes to tweaking systems, e.g. local SSDs on GCP are available in SCSI and NVMe flavors, while GKE supports only SCSI; harder to perform fine-tuning and other tasks on the underlying VMs that would have been trivial with Chef or similar. 4. Enterprise systems like Splunk explicitly mention that their support does not cover Splunk clusters running on kubernetes. 5. As mentioned, you can't even resize a disk without going through dance of operations that would take days or weeks when you're working something like Kafka at scale. 6. Some stateful services like Zookeeper require stable identities and this is far from perfect on kubernetes. 7. More complex traffic routing that involves additional fees because to achieve (6) you sometimes need to expose things publicly.

That's just from the top of my head.

Disclaimer: We run 10+ stateful services on Kubernetes.

williamstein · on May 2, 2018

I've used Kubernetes a massive amount during the last two years for running stateful apps. In contrast, I do recommend it. Yes, it is challenging, since stateful app are. However, the challenges are all well worth solving in the context of Kubernetes (great benefits from health checks, automated reproducible deployments, etc.). The situation is pretty good these days in my experience; at least, a lot better than 2 years ago!

bajsejohannes · on May 2, 2018

Good point about it getting better. A lot of the pain was from trying to do it before 1.8.

loiselleatwork · on May 2, 2018

Author of post here and CRL employee: just for some additional detail, we reached out to Kelsey about the problems he's seen running databases in Kubernetes.

https://twitter.com/kelseyhightower/status/96347131657256140...

He said "You still need to worry about database backups and restores. You need to consider downtime during cluster upgrades."

These things are totally true. K8s doesn't automate backups (edit: by default; though, it can) and if you need to take K8s down for upgrades, then everything is down. For its part, though, CockroachDB supports rolling upgrades with no downtime on Kubernetes.

As for routing, that is tough problem if you want to run K8s across multiple regions, though we have some folks who've done it.

And if one finds setting up StatefulSets challenging, we have a tutorial on how to do it written by a former Kubernetes engineer: https://www.cockroachlabs.com/docs/stable/orchestrate-cockro...

hardwaresofton · on May 2, 2018

Note that none of those things are impossible on kubernetes, k8s just doesn't offer them by default (which is good IMO).

There are projects that help you run databases in kubernetes and also make backups of many things hosted:

- Automatic CephFS for your cluster -> https://rook.io/docs/rook/master/

- Backups for cluster resources and volumes -> https://github.com/heptio/ark

- Spin up dynamic postgres clusters -> https://github.com/zalando-incubator/postgres-operator

Databases are just applications with different resource needs. Please stop pushing forward the notion that they can't be run in containers or container orchestration systems. Databases are just programs. If the substrate for running your containers doensn't reliably support flock or fsync or something your database needs, then maybe pick a better substrate that does -- container runtimes these days and kubernetes don't stand in your way these days.

merb · on May 2, 2018

Well with k8s 1.10+ it's also possible to use statefulsets and local volume, so with affinity it's possible to just use k8s as an orchestration system where you "install" your database and keep it up to date with k8s. of course if a node goes down you need to failover, etc. but patroni/zalando postgres works really well with statefulsets and local volume. (as long as a single node is still running, which should always be the case...) (https://kubernetes.io/blog/2018/04/13/local-persistent-volum...)

hardwaresofton · on May 2, 2018

I want to note that this has actually been possible since like k8s 1.7, you can just start a DB with node affinity and use hostPath volumes.

That's what I was doing until rook came around. If you're running in something like AWS (or even if you're not), you can also do something like attach an EBS volume (to a local host) and do that. Or, you can set up a plain ISCSI drive (or get one from your provider, even basic providers these days might offer storage that way) and use that.

There have been a LOT of choices for a long time.

merb · on May 2, 2018

> even basic providers these days might offer storage that way

my "bare metal" provider doesn't ;)

AndyNemmity · on May 2, 2018

We run many, many stateful apps on kubernetes. Not without challenges certainly, but I am not sure any of them are really kubernetes specific.

They just don't act like other services, and require more care. That's about it. I think that's what Kelsey is referring to, you can't just treat them the same as other pods.

rrdharan · on May 1, 2018

> Because Kubernetes itself runs on the machines that are running your databases, it will consume some resources and will slightly impact performance. In our testing, we found an approximately 5% dip in throughput on a simple key-value workload.

5% seems like a surprisingly large overhead. What is k8s doing in this situation that would have that kind of impact?

smarterclayton · on May 1, 2018

CPU Cache contention, network overhead introduced by Kubernetes service proxy model, even the liveness checks.

We haven’t yet evolved Kubernetes services to prefer specific cores and avoid app workloads quite yet (although cpu management is getting closer).

Docker is also somewhat hefty memory wise and you may contend on disk if not careful.

5% seems pretty reasonable to me in general, just as a consequence of having something heavier weight on the same node managing workloads.

a-robinson · on May 2, 2018

Yeah, it appeared to just be general resource contention from having to share the machine -- CPU interrupts, less memory available, etc.

I'll note, though, that the 5% number is when using host networking for both Cockroach and the client load generator. Using GKE's default cluster networking through the Docker bridge is closer to 15% worse than running directly on equivalent non-Kubernetes VMs.

nimos · on May 1, 2018

Hard to say without knowing what they ran on what. There is a non trivial amount of memory that gets eaten up by the various k8s processes, docker and networking if you are using small nodes. I have a completely empty k8s cluster up right now with 1 worker and 1 master and the worker has about ~230 mb of ram used up.

stefanatfrg · on May 2, 2018

I'd like to know how to solve the storage dilution problem with stateful apps in k8s where you have to buy 3-18x more raw capacity than desired to meet availability & durability guarantees.

For example if you ran CDB on a baremetal cluster of 3 nodes with 30TB of raw capacity, 15TB is lost to RAID10, 10TB is lost to running a replicated database such as cockroach DB, leaving you with 5TB effective capacity which is a 1/6 dilution of your initial capacity.

If you ran cockroach DB on a replicated network volume, with a replication factor of three, it gets worse. If you bought 30 TB of disks, you'd lose 20 TB to volume replication, ~6.67TB to CDB replication leaving you with 3.3TB of effective capacity or a 1/9 dilution. If those disks were configured with RAID your effective capacity would drop to a 1/18 dilution.

You could achieve a 1/3 dilution which is the effective minimum for a replicated database if you didn't configure RAID, but you increase the impact of disk failure, in that it would take much much longer to recover a cluster.

lowbloodsugar · on May 1, 2018

>Given its pedigree of literally working at Google-scale

I understood that a team at google developed k8s but google doesn't actually run it for their "google-scale" workloads. Am I misinformed?

bajsejohannes · on May 1, 2018

You are correct.

> [kubernetes is] a simplified clone of Google’s internal borg system

https://medium.com/@steve.yegge/honestly-i-cant-stand-k8s-48...

outside1234 · on May 1, 2018

That said, Google, to my understanding, does run a completely containerized infrastructure internally, including databases and other stateful things, so it is not wildly off to suggest running a database on Kubernetes.

puzzle · on May 2, 2018

Before it ran on F1/Spanner, Adwords ran on sharded MySQL on Borg. Not for its entire early life, but for quite a few years. Later in life, Checkout and maybe Wallet ran on MySQL on Borg, too. So did YouTube, which used Vitess, a sharding layer (now ported to Kubernetes and open sourced).

Bigtable, Spanner and even Colossus/D run in containers on Borg.

tayo42 · on May 2, 2018

How is colossus running in a container in borg? That's their network file system that makes running databases in containers possible?

puzzle · on May 2, 2018

There's a bootstrapping issue, of course, but that's how it works.

And it's even crazier than what you're picturing. What if I told you that Bigtable runs on top of Colossus, but Colossus itself stores metadata about files, including Bigtable's files, in... Bigtable? It's really turtles all the way down, the last of which is, luckily, Chubby.

Some of the details here: http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Ke...

kyrra · on May 2, 2018

Packaging was (is?) a bit different, because it existed before Docker existed.

Pdf of the presentation: https://www.usenix.org/sites/default/files/conference/protec...

And a recording of the talk: https://www.usenix.org/conference/lisa14/conference-program/...

puzzle · on May 2, 2018

Googlers, including the ones working on Kubernetes, like to pick on MPM, but that's probably because they haven't run major services deployed from Docker images.

puzzle · on May 2, 2018

Kubernetes might be simpler than Borg in so many ways (let me count the ones I care about...), but it also has better features that Borg did not implement (labels and selectors) or that are offered only by some other internal services, which obviously are configured through very different mechanisms (ingress).

daxfohl · on May 2, 2018

Wow, I'd never heard of Fargate until this. It seems so somewhat obvious now for things that Lambda doesn't fit. Heroku for containers, not VMs.

manigandham · on May 2, 2018

There's also Azure Container Instances: https://azure.microsoft.com/en-us/services/container-instanc...

And Google has AppEngine, now with Flexible Environment that can run containers: https://cloud.google.com/appengine/docs/flexible/dotnet/quic...

atombender · on May 2, 2018

It's my understanding, based on comments by googlers here on HN, that Google does run a bunch of apps on GKE. We don't know about which apps, but it's not surprising that they want to dogfood their own cloud platform.

daxfohl · on May 1, 2018

Has anyone looked at Service Fabric (Microsoft tech) for things like this? That has offered stateful services for years now. I'm pretty sure it runs on Linux, and I've seen that it's Docker compatible. I know it's kinda in the same space as K8s but I don't really know the details. Would SF be able to do something like this in a similar (or better?) way?

zapita · on May 2, 2018

It's complicated, because the definition of Service Fabric seems to be in flux.

The "original" Service Fabric is a high-level framework which requires invasive source code changes (you can't just drop an existing app on top of it), but gives you lots of benefits (scale, reliability etc) if you make the effort.

Recently container-based platforms - Docker, Kubernetes, etc - have come along with a different tradeoff: better compatibility with existing applications in exchange for less magical benefits. That approach is getting much more traction, and I think internally at Microsoft there is some infighting between the "Service Fabric camp" and the "Containers camp". One consequence of the infighting is that Service Fabric is extending its scope to include features like "container support". It's not clear to what extent that is done in collaboration with the "container people", or as a way to bypass them. I think they are still trying to decide whether to embrace Kubernetes, or replicate the functionality in-house. My prediction is that the container-based approach will win, but if will take time for the politics to fully play out. In the meantime things will continue to be confusing.

Bottom line: when evaluating Service Fabric, watch out for confusing and inconsistent use of the brand. It's a common pattern with large vendors - for example IBM with "Bluemix", SAP with "Hana", etc.

daxfohl · on May 2, 2018

Okay that's about what it looked like to me too. There's only so many magic words you can throw at a tech and expect it to work together happily. Looking into it, the stateful service side of SF doesn't seem particularly compatible with the container side of it. A stateful service is a stateful SF service, and a container service is its own thing. Maybe there's a way to plug them together but unfortunately I didn't see it.

jrbancel · on May 2, 2018

Disclaimer: I work at Microsoft, not on Service Fabric but I have built complex stateful services on top of Service Fabric.

As zapita said, Service Fabric now handles containers but I think it is just because containers became trendy and FOMO kicked in.

Where Service Fabric is decades ahead of the container orchestration solutions is as a framework to build truly stateful services, meaning the state is entirely managed by your code through SF, not externalized in a remote disk, Redis, some DB, etc...

It offers high level primitives like reliable collections [0], as well as very low level primitives like a replicated log to implement custom replication between replicas [1]. I feel that publicly this is not advertised enough and it is unfortunate because it is a key differentiator for Service Fabric that the competitors won't have for a while, if ever because it is a completely opposite approach: containers are all about isolation, being self-contained and plateform independent while SF stateful services are deeply integrated with Service Fabric.

[0] https://docs.microsoft.com/en-us/azure/service-fabric/servic...

[1] https://docs.microsoft.com/en-us/dotnet/api/system.fabric.fa...

tapirl · on May 2, 2018

Are there any cloud providers providing remote disks without replications? It looks such needs are popular for deploying databases in which replications are maintained by the databases themselves.

jen20 · on May 2, 2018

That's effectively what EBS is, no?

tapirl · on May 2, 2018

I have the impression that EBS does replication automatically, is it wrong?