Saving months of compute time with a single Grafana query

Ekrekr · on July 12, 2024

I really enjoyed this read!

One thing that wasn't clear to me, is that if running NPM to install dependencies on pod startup is slow, why not pre build an image with dependencies already installed, and deploy that instead?

lmz · on July 12, 2024

Surely they weren't running npm at start. It's just that nodejs allows multiple versions of the same module to coexist and all the different version clients have different version dependencies which could be collapsed to one common version.

mavidser · on July 12, 2024

> if running NPM to install dependencies on pod startup is slow

Loading the AWS SDK via `require` was slow, not installing. As sibling comment says - collapsing different SDKs into one helped reduce loading times of the many SDKs.

mrits · on July 12, 2024

Without proper telemetry and performance metrics you will get to do this in a few more months again

throwthrow5643 · on July 12, 2024

The 'one weird trick' could've been spotted in a graphical bundle analyser. But are they not caching npm packages somewhere, seems like an awful waste downloading from the npm registry over and over? I would think it would be parsing four different versions of the AWS sdk that was so slow.

candiddevmike · on July 12, 2024

> seems like an awful waste downloading from the npm registry over and over

Pondering this question across every organization in the world and the countless opportunities for caching leads to dark places. Would be interesting to see CDN usage for Linux distributions pre and post docker builds becoming popular.

roboben · on July 12, 2024

Sadly Grafana (cloud) comes at a cost too. Anyone struggles with this horrible active metrics based pricing too? Not only Grafana Cloud but others do it like that too.

We moved shitloads to self hosted Thanos. While this comes with its own drawbacks obv, I think it was worth it.

skrtskrt · on July 12, 2024

you can self host all the Grafana solutions too if you liked them but didn't like the pricing

zug_zug · on July 12, 2024

I'm really surprised that 300ms at startup would result in 25% fewer pods.... What % reduction in the total startup time is that?

Is it possible the prior measurement happened during a high traffic period and the post measurement happened in a low traffic period?

serverlessmom · on July 12, 2024

It’s a 50% reduction in startup time, and each “run” for a pod is fairly quick.

sebstefan · on July 12, 2024

I really don't understand spinning up a whole pod just for a request

Wouldn't it be cheaper to just keep a pod up with a service running?

If scaleability is an issue just plop a load balancer in front of it and scale them up with load but surely you can't need a whole pod for every single one of those millions of requests right?

> Checkly is a synthetic monitoring tool that lets teams monitor their API’s and sites continually, and find problems faster.

>With some users sending *millions of request a day*, that 300ms added up to massive overall compute savings

No shit, right?

crummy · on July 12, 2024

The article said they had to do a bunch of cleanup between requests when it was handled by one service. Which surprised me but these requests must be doing more than just HTTP requests I guess.

c0brac0bra · on July 12, 2024

Yea they do E2E checks with playwright as well, among other things. A bunch of stuff could get cached from those checks I suppose, especially if it's user-written code.

BobbyTables2 · on July 12, 2024

I do not understand how cloud proponents talk about the he costs of self hosting but then get into situations like this.

Spending serious engineering time to wrangle with the complexities of cloud orchestration is not something that should be taken lightly.

Cloud services should be required to have a black-box Surgeon’s General warning.

hibikir · on July 12, 2024

The best advantage of cloud was never price: It was not having to argue with your data center organization, which often lead to taking months to provision anything, even a very boring VM. If those companies were good at managing data centers, and could hire people actually interested in helping the company run, they'd have had little need for the cloud in predictable compute loads.

Until you get quite big, all necessary interactions with the cloud provider are just bills. It's just much easier, even though it is often expensive

silverquiet · on July 12, 2024

Cloud was absolutely sold as saving money once upon a time, but like any other industry, they wanted to push the margins higher by getting away from providing commodity services (cloud primitives in their case), and thus the marketing started saying it was never about saving money or how you're not supposed to use the primitives; you need to use the higher-level, managed services (that also happen to have tons of vendor lock-in).

I've spent the last decade or so wondering if the emperor was wearing clothes and not really getting what everyone else has been talking about. Which isn't to say that cloud is useless, but it's not the universal panacea that it was often sold as, and it seems that others are waking up to that.

jermaustin1 · on July 12, 2024

The other advantage was for moving off-prem. Cap-Ex vs Op-Ex. That was the reason one of my past employers switched.

Another past employer switched because we hit our scale up limit, and needed to start scaling out. A small refactor allowed us to scale out, and we moved to azure's managed database, queue, and blob storage. The web frontend could scale based on connections, and the queue and blob storage was slower than our current approach, but it was better once we added the autoscaling. Since the slower speed was PER connection. Minimum scale was 5 instances, so that there was no bottleneck when scaling.

There are many reasons to go "cloud" but for most small businesses (or at least small departments of large organizations), cloud-first doesn't seem like a great option unless you have 10s of thousands in credits each month. Just build your software, scale up first on-prem or at a datacenter - it is LOADs cheaper and predictable.

BobbyTables2 · on July 13, 2024

The last part really resonates with me. Ironically, the small businesses always seem to start first in the cloud.

Always felt I was too stupid to see what everyone is so excited about.

jermaustin1 · on July 14, 2024

Based on the lack of upvotes on the comment, I guess I'm also too stupid to understand why "cloud" is better than "not cloud".

vb-8448 · on July 12, 2024

> which often lead to taking months to provision anything, even a very boring VM

It's still true? In my experience it used to be, nowadays most of the organizations have an internal serf-provisioning portal.

dartos · on July 12, 2024

Who do you think builds and maintains those portals?

Internal data center organizations that need to be argued with to acquire quota. They usually don’t care for internal or dev stuff, but as soon as something needs prod levels of quota or new licenses, prepare to wait.

michpoch · on July 12, 2024

> organizations have an internal serf-provisioning portal

Which will create things in cloud…

candiddevmike · on July 12, 2024

> Spending serious engineering time to wrangle with the complexities of cloud orchestration is not something that should be taken lightly.

Bare metal and datacenter orchestration is leaps and bounds more complex. You're paying for the abstraction.

latch · on July 12, 2024

If your scale is crazy, or your product doesn't allow you to use battle-tested pieces, then orchestration is complex in both cases.

In most cases, managing software on bare metal is more complex in exactly one case: when engineers only know cloud abstractions.

skrtskrt · on July 12, 2024

after working at a cloud provider, I would say:

1. managing compute clusters has gotten a lot easier but managing storage clusters and running good storage products for block & object storage on them is very very very far from a solved problem and quite frankly it sucks and is not fun.

2. planning out buying, installing, upgrading, patching, and retiring hardware, server/hypervisor OSs, takes waaaaaaaaay more engineering management skill and experience than 99.9% of companies have. Plus you probably have to fight for every dollar of investment against a board / investors. Even at the cloud provider we were constantly getting kneecapped by upper management not wanting to spend money on hardware.

rbranson · on July 12, 2024

Yeah calling the EC2 API is definitely more complex than leasing datacenter space, purchasing racks of hardware, deploying a fault tolerant and secure network, capturing and managing offsite backups, dealing with hardware component failures, etc.

latch · on July 13, 2024

Serious question, are you familiar with dedicated hosting?

trollbridge · on July 12, 2024

If you’re at that scale, there are plenty of other platforms that don’t have vendor lock in and overly complex, proprietary APIs by design.

trollbridge · on July 12, 2024

It really isn’t. I use a combination of bare metal, VMs on those bare metal, and servers hosted at places like Digital Ocean.

Orchestration is dead simple and mostly automated using off the shelf, open source tools. If a server goes down, it’s a few minutes to replace it. The cloud based hosting is a fixed cost each month - no usage based surprises.

Meanwhile, for clients, spent huge amounts of time fixing broken Kubernetes setups and hit serious design constraints because of usage based pricing on their PaaS infrastructure like being unable to do complex queries from a database.

I wouldn’t think twice about the same query on our in house hosted DBs on $400 servers.

michpoch · on July 12, 2024

> If a server goes down, it’s a few minutes to replace it.

Like you drive to the server rooms and have a stack of new servers, physically replace the old one with the new one and re-set everything in a few minutes? Or is your “bare metal” an ec2 instance?

wongarsu · on July 12, 2024

If you run your own datacenter maybe. But if you pay to rent bare metal servers, orchestrating those isn't any more complex. The biggest downside is that depending on the provider you might have to wait hours instead of minutes when provisioning a new server

tamiral · on July 12, 2024

It’s not a set of up and leave it! You have to continuously monitor and improve! Yes using some cloud service will save XYZ time but doesn’t mean it’s a set it and forget it feature.

I’ll add this is a really good write up ! Love this comment :

“There is no harm in using boring/simple methods if the results are right.”

rjmunro · on July 12, 2024

$5k/month was 25% of his pods, so the total was ≈$20k/month. It's entirely possible that self hosting would cost much more than that, particularly as they wouldn't be able to save costs by scaling down.

tnolet · on July 12, 2024

Moreover, we need to have presence in 20+ regions around the world. This multiplies the hassle of self hosting / bare metal / colo.

Disclaimer: Checkly founder.

serverlessmom · on July 12, 2024

Yeah. My dad had told me more than one horror story of early tech startups buying truckloads of hardware to scale way beyond demand growth. And I remember when getting on Slashdot meant your service would inevitably go down.

Of course, given stable demand and known requirements, bare metal can be a great option. But it’s not strictly better than public cloud hosting.

I think it’s just been long enough that people have forgotten the limitations of bare metal engineering.

latchkey · on July 12, 2024

> I think it’s just been long enough that people have forgotten the limitations of bare metal engineering.

Not just engineering, but deploying it too. I'm building a whole business around deploying it based at least partially on the fact that it has been forgotten. I've just been doing it long enough that I remember how to do it and it isn't getting any easier as the need for compute grows into even more complex and powerful hardware.

helsinkiandrew · on July 12, 2024

The problem wasn't between the cloud and self hosting - the problem was they had stateful code that didn't scale to thousands of requests for different clients. So they are bringing up new instances every invocation.

The same 3s runtime startup cost (and need for more hardware) would happen if they were running their own servers.

mgkimsal · on July 12, 2024

Would the actual costs have been less, the same, or more, running on their own hardware? Processes can take longer running on your own hardware, but still have a lower TCO.

helsinkiandrew · on July 12, 2024

Possibly - the advantage of cloud is you don't need to provision for your peak load, it scales with usage from a few requests a day, to millions a second for an hour a day. If your usage is lumpy or growing fast then paying 2-3 times more than the cost of your own hardware can be cheaper.

bravetraveler · on July 12, 2024

'accept vendor lock in, it'll save you the cost of engineers'

Routinely: oops, our API usage slipped and we mistakenly paid more than the staff to avoid this would cost

Keep fucking up, tech industry. My job role depends on it (SRE)

belter · on July 12, 2024

The Tech Industry is well know for shunning the Price of Education, while rejecting the Cost of Ignorance.

JackMorgan · on July 12, 2024

Maybe even shunning the reasonable Price of Education while paying the significant Costs of Ignorance.

belter · on July 12, 2024

AWS has a three day course on the technicals of efficient cost management. Believe it or not...I heard is one of the least requested classes....

wisemang · on July 12, 2024

Maybe by the time you need it you can’t afford to pay for it…

sumtechguy · on July 12, 2024

Managing these things is a skillset too. You now have X VM's and Y' containers and Z storage things. You still get to manage them. It is easier but is not a zero cost which some people seem to think it is. I have one where it is basically internal and I am at my teams all the time 'clean up your mess'. Tons of PoC's spun up and just left laying around. Things that do millions of calls (to be fixed later). That sort of thing. A cloud abstracts one set of skills. But everything above that line is still on the groups to manage.

Cloud stuff is really alluring at first. Works for awhile then the costs become above what it would cost to run it yourself. Cloud is not a 'set it and forget it' sort of thing. You have to manage it too.

bluGill · on July 12, 2024

Even in Germany you won't get an engineer for only $5k/month. You can get bad engineers in places like India for that, but good engineers cost more than that.

ghaff · on July 12, 2024

When you have salaried staff, it's very easy to ignore their cost when they do various background work because it seems to happen for "free." That's not to say that cloud services are always a good value but it's also the case that it's easy to ignore the full cost of doing everything in-house.

bravetraveler · on July 12, 2024

German companies for all their precision probably make more than one spending mistake, like we do over here.

There's asymmetry you aren't considering: one engineer can make/solve many problems. Not even all on either side.

To be clear: I'm not taking a dig at this specific case. Larger patterns. It's more nuanced than X of these or Y of those. I'm not arguing in your vacuum/hypothetical

There need-not be an increase in staff in a lot of cases. Just better, or different, staff. If I wanted to earnestly dissect/solve this I wouldn't have opened with snide jokes.

The engineers hired to stave expense tend to go on to create it

The positive spin is: 'get the engineers anyway, self-hosted or not'. They'll help you optimize $solution. My point is things are rarely this positive. Things can be optimized to the point of being non-optimal.

fs111 · on July 12, 2024

[flagged]

serverlessmom · on July 12, 2024

Slightly longer: having many versions of the same SDK required added to startup time.

dxbydt · on July 12, 2024

many of the tricks we learned in the late 90s - 2000s can no longer be pulled off. We used to download jar files over the net. Running a major prop trading platform meant 1000s of dependencies. You’d have swing and friends for front end tables, sax xml parsers, various numerical libraries, logging modules- all of this shit downloaded in the jar when the customer impatiently waited to trade some 100MM worth of fx. We learned how to cut down on dependencies. Built tools to massively compress class files. Tradeoff 1 jar with lots of little jars that downloaded on demand. Better yet, cache most of these jars so they wouldn’t need to download every single time. It became a fine art at one point - the difference between a rookie and a professional was that the latter could not just write a spiffy java frontend, but actually deploy it in prod so customers wouldn’t even know there was a startup time - it would just start like instantly. then that whole industry just vanished overnight- poof!

now i write ml code and deploy it on a docker in gcp and the same issues all over again. you import pandas gbq and pretty much the entire google bq set of libraries is part of the build. throw in a few stadard ml libs and soon you are looking at upwards of 2 seconds in Cloud Run startup time. You pay premium for autoscaling, for keeping one instance warm at all times, for your monitoring and metrics, on and on. i am yet to see startup times below 500ms. you can slice the cake any which way, you still pay the startup cost penalty. quite sad.