>> How would you make it so that printf doesn't crash your entire program if the...

jhgg · on Feb 24, 2019

Can we all take a step back here and for a second realize that Node.JS is based upon a language/event model that was purpose-built and designed to handle client side browser operations, and has since been expanded into the server space. And Erlang/OTP/BEAM was built and designed for running reliable soft real-time distributed telecom systems.

By default (and unless you go out of your way w/ web-workers) the javascript event-loop (and thus nodes event loop) is single threaded. To work around this, you can bind many node processes to a given port to load balance requests (SO_REUSEADDR, anyone?) - or simply run many smaller instances of node (perhaps in a bunch of containers) where traffic ingresses in via some form of load balancer. The load balancing problem is unavoidable, and you will definitely need the same if you want to send requests to multiple BEAM nodes. However, BEAM can schedule your work across all the cores you give it.

But let's talk about work for a second, and about a very special thing that the BEAM VM gives you, that other runtimes (whether it be node, JVM, golang's, etc...) aside from the actual operating system of your computer does not. And that's specifically preemptive scheduling.

Suppose you have a single core computer that's running Linux, and you have a process that is sitting there busy looping. Let's say we just make a simple script that does nothing infinitely in a loop. Does your computer grind to a halt? Most likely, no. You can still probably use your terminal, move your mouse, operate your web browser, etc... You can thank pre-emptive scheduling for that. The OS suspends the process to allow other processes to do work - hopefully in a fair manner (on linux, the CFS (aptly named Completely Fair Scheduler) does this).

Now let's say you have a single node process serving requests. Let's say that a specific kind of requests requires 250ms of CPU time to compute - and does not explicitly yield back to the event loop. (You can imagine doing some processing of input data, deserialization, serialization, aggregation, etc...). During this computation, nothing else within the node process can progress. This means that requests that may not take a lot of time to compute now have to wait 250ms to be processed. Generally, I see node deployments not having single request/response request handling, but rather many concurrent requests/responses being handled at any given time, using promises/callbacks to allow the event loop to progress while waiting on IO from something else. During the periods of expensive computation from a given request handler, the entire event loop is stalled, and the response time percentiles of your requests spike. A pathological case would be something like `setTimeout(() => while(1) { }, 1000)` deadlocking your entire node process after a whole second, as the loop does not yield back to the event loop.

In BEAM, this does not exist. Processes are scheduled and pre-empted - to allow for fair utilization of the underlying computation resources (very much like how your OS does it.) This means that a computationally intensive process does not stall the event loop for all other processes, meaning that your response times and percentiles remain low for all other work within the system.

Now of course, you could hand-craft your javascript code to explicitly yield to the scheduler every so often, but that's a lot of work that you as a programmer are now doing that your runtime could be doing for you, and if you forget to do it, could be catastrophic to the performance of your soft-realtime system.

This is only one of the many benefits that OTP/BEAM provide over other runtimes. But one compelling enough for Discord as a company to bet on it. For a given service, we run entirely homogeneous infrastructure. We do not need to allocate or dedicate special resources to our largest servers (100k CCU/350k members), and instead can run and schedule it alongside the millions of other small servers that exist on Discord - all without negatively impacting the performance, percentiles, or soft-realtime guarantees of your chat with a few of your friends.

cryptica · on Feb 25, 2019

The NodeJS cluster module can be used to do load balancing between multiple processes on the same machine - It supports load balancing either at the OS level or at the application level. The application level 'round robin' approach is the default and leads to more even distribution between processes in terms of CPU usage. The application 'round robin' approach may be limited in terms of scalability at some point but I've done tests with 32 processes on a 32 core machine and I couldn't see the slightest sign of struggle from the process which hands off the connections to worker processes. Hopefully, eventually the OS scheduler in Linux will have improved enough to outperform the application-level LB but for now it hasn't.

In any case, you don't necessarily need loadbalancing at the host level, you can load balance at the cluster level only. Your load balancers (e. g. nginx or haproxy...) have their own hosts/machines in your cluster and they loadbalance between processes directly. Some of those may be running on the same machine but the load balancer does not dostinguish between them. A random load balancing approach yields the most even distribution from my experience - You do need each process to be able to support maybe 1k concurrent users in order to get the sample sizes on each prpcess to allow even random distribution between them but this is easily achieved with most Node.js WebSocket libraries. They can easily support 10K concurrencr connections per prpcess with very high message throughput. If the commections are mostly idle, each process can handle 100k connections or more.

jhgg · on Feb 25, 2019

Load balancing a bunch of connections between a bunch of processes across a cluster of nodes is pretty well understood. I don't think that's what I'm trying say here. My post is more on the power of pre-emptive scheduling built into the runtime, and what it means for your application.

We use BEAM/OTP for way more than just holding open websocket connections. Our entire websocket layer is a few hundred lines of elixir code - and honestly hasn't been touched in over a year - and has remained pretty much the same as we scaled from 200k ccu -> to well over 5m ccu. Holding open websockets and load-balancing them is pretty much a solved problem for us.