Why did it take them *9 hours* to notice? The problem was immediately obvious to...

eep_social · on March 24, 2023

Staffing an actual 24x7 rotation of SREs costs about a million dollars a year in base salary as a floor and there are few SREs for hire. A metrics-based monitor probably would have triggered on the increased error rate but it wouldn’t have been immediately obvious that there was also a leaking cache. The most plausible way to detect the problem from the user perspective would be a synthetic test running some affected workflow, built to check that the data coming back matches specific, expected strings (not just well-formed). All possible but none of this sounds easy to me. Absolutely none of this is plausible when your startup business is at the top of the news cycle every single day for the past several months.

sosodev · on March 24, 2023

"there are few SREs for hire"

How do you figure? If you mean there are few SRE with several years of experience you might be right. SRE is a fairly new title so that's not too surprising.

However, my experience with a recent job search is that most companies aren't hiring SRE right now because they consider reliability a luxury. In fact, I was search of a new SRE position because I was laid off for that very reason.

chatmasta · on March 24, 2023

You don't even need an SRE to have an on-call rotation; you could ping a software engineer who could at least recognize the problem and either push a temporary fix, or try to wake someone else to put a mitigation in place (e.g. disabling the history API, which is what they eventually did).

However, I think the GP's point about this class of bug being difficult to detect in a monitoring system is the more salient issue.

eep_social · on March 24, 2023

Well hang on! Your question was why was the time to detect so high and you specifically mentioned 24x7 SRE so I thought that’s what we were talking about ;)

And I do think the answer is that monitoring is easy but good monitoring takes a whole lot of work. Devops teams tend to get to sufficient observability where a SRE team should be dedicating its time to engineering great observability because the SRE team is not being pushed by product to deliver features. A functional org will protect SRE teams from that pressure, a great one will allow the SRE team to apply counter-pressure from the reliability and non-functional perspective to the product perspective. This equilibrium is ideal because it allows speed but keeps a tight leash on tech debt by developing rigor around what is too fast or too many errors or whatever your relevant metrics are.

eep_social · on March 24, 2023

I’ve anecdotally observed the opposite. I have noticed SRE jobs remain posted, even by companies laying off or announcing some kind of hiring slowdown over the last quarter or so. More generally, businesses that have decided that they need SRE are often building out from some kind of devops baseline that has become unsustainable for the dev team. When you hit that limit and need to split out a dedicated team, there aren’t a ton of alternatives to getting a SRE or two in and shuffling some prod-oriented devs to the new SRE team (or building a full team from scratch which is what the $$ was estimating above). Among other things, the SRE baliwick includes capacity planning and resource efficiency; SRE will save you money in the long term.

On a personal note, I am sorry to hear that your job search has not yet been fruitful. Presumably I am interested in different criteria from you —- I have found several postings that are quite appealing to the point where I am updating my CV and applying, despite being weakly motivated at the moment.

sosodev · on March 25, 2023

My search was fruitful. I'm doing regular SWE work now. Market sucks though.

namaria · on March 24, 2023

Every system fail prompts people to exclaim "why aren't there safeguards?". Every time. Well guess what, if we try to do new stuff, we will run into new problems.

wouldbecouldbe · on March 24, 2023

There is nothing new about using redis for cache, or returning a list for a user.

namaria · on March 24, 2023

Are you trying to say cache invalidation in a distributed system is a trivial problem?

chatmasta · on March 24, 2023

I'm not disagreeing with you, and I'm not the commenter you're replying to, but it's worth noting that cache leakage and cache invalidation are two different problems.

namaria · on March 24, 2023

You're right. Thanks for pointing that out. My original point still stands, distributed systems are hard and people demanding zero failures are setting an impossible standard.

oulu2006 · on March 24, 2023

It's non-trivial but it's also not that hard, there are well known strategies for achieving it; especially if you relax guarantees and only promise eventual consistency then it becomes fairly trivial - we do this for example and have little problems with it.

skywhopper · on March 24, 2023

This wasn’t a cache invalidation problem. It was a cache corruption error.

wouldbecouldbe · on March 25, 2023

Im saying there is nothing new about it.

sinuhe69 · on March 24, 2023

Probably the cheapest solution would be letting GPT monitors user feedbacks from various social media channels and alert human engineers to check for the summative problem. GPT can even engage with users to request more details or reproduceable cases ;)

Yiin · on March 25, 2023

that's abusable, as you can manipulate gpt however you like.

scarmig · on March 24, 2023

Since it now handles visual inputs, I wonder how hard it'd be to get GPT to monitor itself. Have it constantly observe a set of screenshares of automated processes starting and repeating ChatGPT sessions on prod, alert the on-call when it notices something "weird."

eep_social · on March 24, 2023

RUM monitoring does 99% of what you want already. Anomaly detection is the hard part. IMO too early to say whether gpt will be good at that specific task but I agree that a LLM will be in the loop on critical production alerts in 2023 in some fashion.

pharmakom · on March 24, 2023

They raised a billion dollars.

eep_social · on March 24, 2023

How much have they spent?

dharmab · on March 24, 2023

You don't necessarily need a full team of SREs- you can also have a lightly staffed ops center with escalation paths.

eep_social · on March 24, 2023

I don’t think that model has the properties you think it does. Someone still has to take call to back the operators. Someone has to build the signals that the ops folks watch. Someone has to write criteria for what should and should not be escalated, and in a larger org they will also need to know which escalation path is correct. And on and on — the work has to get done somewhere!

majormajor · on March 24, 2023

The way those criteria usually get written in a startup with mission-critical customer-facing stuff (like this privacy issue) is that first the person watching Twitter and email and whatever else pages the engineers, and then there's a retro on whether or not that particular one was necessary, lather, rinse, repeat.

All you need on day 1 is someone to watch the (metaphorical) phones + a way to page an engineer. Don't start by spending a million bucks a year, start by having a first aid kit at the ready.

Perhaps they could also help this person out by looking into some sort of fancy software to automatically summarize messages that were being sent to them, or their mentions on Reddit, or something, even?

eep_social · on March 24, 2023

Yup, twitter monitoring is a thing that I have seen implemented. We did not allow it to page us, however. As you say, some of the barriers around that are low or gone as of late. I wonder if someone has already secured seed funding for social media monitoring as a service. The feature set you can build on a LLM is orders of magnitude better than what was practical before.

Looking at my post up-thread, I wish I had emphasized the time aspect more - of course all of these problems are solvable but it takes both time and money. They have the money now but two months ago the parts of this incident were in place but the scale was so small that it never actually leaked data. Or maybe a handful of early adopters saw some weird shit but we’re all well-trained to just hit refresh these days. Hiring even one operator and getting them spun up takes calendar time that simply has not existed yet. I assume someone over there is panicking about this and trying to get someone hired to make sure they look better prepared next time, because there will be a next time, and if they’re even half as successful as the early hype leads me to believe, I expect they are going to have a lot more incidents as they scale. One in a million is eight and a half times per day at 100 rps.

eep_social · on March 25, 2023

> early adopters saw some weird shit

Since I wrote this, I have seen several anecdotes that support this guess. This is a classic scaling problem. One or two users saw it, and one even says they reported it, but at small scale with immature tools and processes getting to the actual software bug is a major effort that has to be balanced around other priorities like making excessive amounts of money.

guessmyname · on March 24, 2023

> […] it was because they're based in San Francisco. Do they really not have a 24/7 SRE on-call rotation?

OpenAI is hiring Site Reliability Engineers (SRE) in case you, or anyone you know, is interested in working for them: https://openai.com/careers/it-engineer-sre . Unfortunately, the job is an onsite role that requires 5 days a week in their San Francisco office, so they do not appear to be planning to have a 24/7 on-call rotation any time soon.

Too bad because I could support them in APAC (from Japan).

Over 10 years of industry experience, if anyone is interested.

eep_social · on March 24, 2023

I had forgotten that I looked at this and came to the same conclusion as you. I’d happily discuss a remote SRE position but on-site is a non-starter for me, and most of SRE, if I am reading the room correctly.

Edit to add: they’re also paying in-line or below industry and the role description reads like a technical project manager not a SRE. I imagine people are banging down the door because of the brand but personally that’s a lot of red flags before I even submit an application.

VirusNewbie · on March 24, 2023

that is quite low for FAANG level SRE/SWE .

p1esk · on March 24, 2023

Also, I heard their interviews (for any technical position) are very tough.

inconceivable · on March 24, 2023

nobody qualified wants the 24/7 SRE job unless it pays an enormous amount of money. i wouldn't do it for less than 500 grand cash. getting woken up at 3am constantly or working 3rd shift is the kind of thing you do with a specific monetary goal in mind (i.e., early retirement) or else it's absolute hell.

combine that with ludicrous requirements (the same as a senior software engineer) and you get gaps in coverage. ask yourself what senior software engineer on earth would tolerate getting called CONSTANTLY at 3am, or working 3rd shift.

the vast majority of computer systems just simply aren't as important as hospitals or nuclear power plants.

mnahkies · on March 24, 2023

Timezones are a thing - your 3am is someone's 9am and may be a significant part of your customer base.

Being paged constantly is a sign of bad alerts or bad systems IMO - either adjust the alert to accept the current reality or improve the system

inconceivable · on March 24, 2023

spinning up a subsidiary in another country (especially one with very strict labor laws, like in european countries) is not as easy as "find some guy on the internet and pay him to watch your dashboard. and then give him root so he can actually fix stuff without calling your domestic team, which would defeat the whole purpose.

also, even getting paged ONCE a month at 3am will fuck up an entire week at a time if you have a family. if it happens twice a month, that person is going to quit unless they're young and need the experience.

chatmasta · on March 24, 2023

It's really not that difficult, and there are providers like Deel who can manage it all for you, to the point you just ACH them every month.

Source: co-founder of a remote startup with employees in five countries

inconceivable · on March 24, 2023

like you said, timezones are a thing. now you're managing a global team.

Godel_unicode · on March 24, 2023

That sounds harder than it is, especially if you already allow remote work. It mostly just forces you to have better docs.

mnahkies · on March 24, 2023

Sorry to be clear I was replying to this part of your comment

> the vast majority of computer systems just simply aren't as important as hospitals or nuclear power plants.

I agree that the stakes are lower in terms of harm, but was trying to express that whilst it might not be life and death, it might be hindering someone being able to do their job / use your product - eg: it still impacts customer experience and your (business) reputation.

False pages for transient errors are bad - ideally you only get paged if human intervention is required, and this should form a feedback cycle to determine how to avoid it in future. If all the pages are genuine problems requiring human action then this should feed into tickets to improve things

nijave · on March 24, 2023

Not only that, but you probably need follow the sun if you want <30 minute response time.

Given a system that collects minute-based metrics, it generally takes around 5-10 minutes to generate an alert. Another 5-10 minutes for the person to get to their computer unless it's already in their hand (what if you get unlucky and on-call was taking a shower or using the toilet?). After that, another 5-10 minutes to see what's going on with the system.

After all that, it usually takes some more minutes to actually fix the problem.

Dropbox has a nice article on all the changes they made to streamline incidence response https://dropbox.tech/infrastructure/lessons-learned-in-incid...

okdood64 · on March 25, 2023

I've worked two SRE (or SRE-adjacent) jobs with oncall duty (Some unicorn and a FAANG). Neither have been remotely as bad to what you're saying. (Only one was actually 24/7 for a week shift.)

The whole point is that before you join, the team has done sufficient work to not make it hell, and your work during business hours makes sure it stays that way. Are there a couple bad weeks throughout the year? Sure, but it's far, far from the norm.

hgsgm · on March 25, 2023

Constantly? It's one wakeup in 4 months.

oulu2006 · on March 24, 2023

I did that for a few years, and wasn't on 500k a year, but I'm also the company co-founder, so you could argue that a "specific monetary goal" was applicable.

majormajor · on March 24, 2023

You don't need 24/7 SREs, you could do it with 24/7 first-line customer support staff monitoring Twitter, Reddit, and official lines of comms that have the ability to page the regular engineering team.

That's a lot easier to hire, and lower cost. More training required of what is worth waking people up over; way less in terms of how to fix database/cache bugs.

richdougherty · on March 25, 2023

Support engineers and an official bug reporting channel would help. I noticed and reported the issue on their official forums at on 16 March, but got no response.

https://community.openai.com/t/bug-incorrect-chatgpt-chat-se...

I only reported it on the forums because there didn't seem to be an official bug reporting channel, just a heavyweight security reporting process.

As well as the actions they took to fix this specific bug, another useful action would be to have a documented and monitored bug reporting channel.

cloudking · on March 24, 2023

Probably because they launched ChatGPT as an experiment and didn't think it would blow up, needing full time SRE etc. I don't think it was designed for scale and reliability when they launched.

CubsFan1060 · on March 24, 2023

Do events like this cause them to lose enough revenue that it would make sense to hire a bunch of SRE's?

nijave · on March 24, 2023

Probably the real reason. I assume they intend to make money off enterprise contracts which would include SLAs. Then they'd set their support based off that

chatmasta · on March 24, 2023

Given the Microsoft partnership, they might not even need to manage any real infrastructure. Just hand it off to Azure and let them handle the details.

raldi · on March 25, 2023

Just add metrics for the number of times "ChatGPT" and "OpenAI" appeared in tweets, reddit posts, and HN comments in the last (rolling) five minutes, put them on a dashboard alongside all your other monitoring, and have a threshold where they page the oncall to review what's being said. It doesn't even have to be an SRE in this case; it could be just about anyone.