Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why did it take them 9 hours to notice? The problem was immediately obvious to anyone who used the web interface, as evidenced by the many threads on Reddit and HN.

> between 1 a.m. and 10 a.m. Pacific time.

Oh... so it was because they're based in San Francisco. Do they really not have a 24/7 SRE on-call rotation? Given the size of their funding, and the number of users they have, there is really no excuse not to at least have some basic monitoring system in place for this (although it's true that, ironically, this particular class of bug is difficult to detect in a monitoring system that doesn't explicitly check for it, despite being immediately obvious to a human observer).

Perhaps they should consider opening an office in Europe, or hiring remotely, at least for security roles. Or maybe they could have GPT-4 keep an eye on the site!



Staffing an actual 24x7 rotation of SREs costs about a million dollars a year in base salary as a floor and there are few SREs for hire. A metrics-based monitor probably would have triggered on the increased error rate but it wouldn’t have been immediately obvious that there was also a leaking cache. The most plausible way to detect the problem from the user perspective would be a synthetic test running some affected workflow, built to check that the data coming back matches specific, expected strings (not just well-formed). All possible but none of this sounds easy to me. Absolutely none of this is plausible when your startup business is at the top of the news cycle every single day for the past several months.


"there are few SREs for hire"

How do you figure? If you mean there are few SRE with several years of experience you might be right. SRE is a fairly new title so that's not too surprising.

However, my experience with a recent job search is that most companies aren't hiring SRE right now because they consider reliability a luxury. In fact, I was search of a new SRE position because I was laid off for that very reason.


You don't even need an SRE to have an on-call rotation; you could ping a software engineer who could at least recognize the problem and either push a temporary fix, or try to wake someone else to put a mitigation in place (e.g. disabling the history API, which is what they eventually did).

However, I think the GP's point about this class of bug being difficult to detect in a monitoring system is the more salient issue.


Well hang on! Your question was why was the time to detect so high and you specifically mentioned 24x7 SRE so I thought that’s what we were talking about ;)

And I do think the answer is that monitoring is easy but good monitoring takes a whole lot of work. Devops teams tend to get to sufficient observability where a SRE team should be dedicating its time to engineering great observability because the SRE team is not being pushed by product to deliver features. A functional org will protect SRE teams from that pressure, a great one will allow the SRE team to apply counter-pressure from the reliability and non-functional perspective to the product perspective. This equilibrium is ideal because it allows speed but keeps a tight leash on tech debt by developing rigor around what is too fast or too many errors or whatever your relevant metrics are.


I’ve anecdotally observed the opposite. I have noticed SRE jobs remain posted, even by companies laying off or announcing some kind of hiring slowdown over the last quarter or so. More generally, businesses that have decided that they need SRE are often building out from some kind of devops baseline that has become unsustainable for the dev team. When you hit that limit and need to split out a dedicated team, there aren’t a ton of alternatives to getting a SRE or two in and shuffling some prod-oriented devs to the new SRE team (or building a full team from scratch which is what the $$ was estimating above). Among other things, the SRE baliwick includes capacity planning and resource efficiency; SRE will save you money in the long term.

On a personal note, I am sorry to hear that your job search has not yet been fruitful. Presumably I am interested in different criteria from you —- I have found several postings that are quite appealing to the point where I am updating my CV and applying, despite being weakly motivated at the moment.


My search was fruitful. I'm doing regular SWE work now. Market sucks though.


Every system fail prompts people to exclaim "why aren't there safeguards?". Every time. Well guess what, if we try to do new stuff, we will run into new problems.


There is nothing new about using redis for cache, or returning a list for a user.


Are you trying to say cache invalidation in a distributed system is a trivial problem?


I'm not disagreeing with you, and I'm not the commenter you're replying to, but it's worth noting that cache leakage and cache invalidation are two different problems.


You're right. Thanks for pointing that out. My original point still stands, distributed systems are hard and people demanding zero failures are setting an impossible standard.


It's non-trivial but it's also not that hard, there are well known strategies for achieving it; especially if you relax guarantees and only promise eventual consistency then it becomes fairly trivial - we do this for example and have little problems with it.


This wasn’t a cache invalidation problem. It was a cache corruption error.


Im saying there is nothing new about it.


Probably the cheapest solution would be letting GPT monitors user feedbacks from various social media channels and alert human engineers to check for the summative problem. GPT can even engage with users to request more details or reproduceable cases ;)


that's abusable, as you can manipulate gpt however you like.


Since it now handles visual inputs, I wonder how hard it'd be to get GPT to monitor itself. Have it constantly observe a set of screenshares of automated processes starting and repeating ChatGPT sessions on prod, alert the on-call when it notices something "weird."


RUM monitoring does 99% of what you want already. Anomaly detection is the hard part. IMO too early to say whether gpt will be good at that specific task but I agree that a LLM will be in the loop on critical production alerts in 2023 in some fashion.


They raised a billion dollars.


How much have they spent?


You don't necessarily need a full team of SREs- you can also have a lightly staffed ops center with escalation paths.


I don’t think that model has the properties you think it does. Someone still has to take call to back the operators. Someone has to build the signals that the ops folks watch. Someone has to write criteria for what should and should not be escalated, and in a larger org they will also need to know which escalation path is correct. And on and on — the work has to get done somewhere!


The way those criteria usually get written in a startup with mission-critical customer-facing stuff (like this privacy issue) is that first the person watching Twitter and email and whatever else pages the engineers, and then there's a retro on whether or not that particular one was necessary, lather, rinse, repeat.

All you need on day 1 is someone to watch the (metaphorical) phones + a way to page an engineer. Don't start by spending a million bucks a year, start by having a first aid kit at the ready.

Perhaps they could also help this person out by looking into some sort of fancy software to automatically summarize messages that were being sent to them, or their mentions on Reddit, or something, even?


Yup, twitter monitoring is a thing that I have seen implemented. We did not allow it to page us, however. As you say, some of the barriers around that are low or gone as of late. I wonder if someone has already secured seed funding for social media monitoring as a service. The feature set you can build on a LLM is orders of magnitude better than what was practical before.

Looking at my post up-thread, I wish I had emphasized the time aspect more - of course all of these problems are solvable but it takes both time and money. They have the money now but two months ago the parts of this incident were in place but the scale was so small that it never actually leaked data. Or maybe a handful of early adopters saw some weird shit but we’re all well-trained to just hit refresh these days. Hiring even one operator and getting them spun up takes calendar time that simply has not existed yet. I assume someone over there is panicking about this and trying to get someone hired to make sure they look better prepared next time, because there will be a next time, and if they’re even half as successful as the early hype leads me to believe, I expect they are going to have a lot more incidents as they scale. One in a million is eight and a half times per day at 100 rps.


> early adopters saw some weird shit

Since I wrote this, I have seen several anecdotes that support this guess. This is a classic scaling problem. One or two users saw it, and one even says they reported it, but at small scale with immature tools and processes getting to the actual software bug is a major effort that has to be balanced around other priorities like making excessive amounts of money.


> […] it was because they're based in San Francisco. Do they really not have a 24/7 SRE on-call rotation?

OpenAI is hiring Site Reliability Engineers (SRE) in case you, or anyone you know, is interested in working for them: https://openai.com/careers/it-engineer-sre . Unfortunately, the job is an onsite role that requires 5 days a week in their San Francisco office, so they do not appear to be planning to have a 24/7 on-call rotation any time soon.

Too bad because I could support them in APAC (from Japan).

Over 10 years of industry experience, if anyone is interested.


I had forgotten that I looked at this and came to the same conclusion as you. I’d happily discuss a remote SRE position but on-site is a non-starter for me, and most of SRE, if I am reading the room correctly.

Edit to add: they’re also paying in-line or below industry and the role description reads like a technical project manager not a SRE. I imagine people are banging down the door because of the brand but personally that’s a lot of red flags before I even submit an application.


that is quite low for FAANG level SRE/SWE .


Also, I heard their interviews (for any technical position) are very tough.


nobody qualified wants the 24/7 SRE job unless it pays an enormous amount of money. i wouldn't do it for less than 500 grand cash. getting woken up at 3am constantly or working 3rd shift is the kind of thing you do with a specific monetary goal in mind (i.e., early retirement) or else it's absolute hell.

combine that with ludicrous requirements (the same as a senior software engineer) and you get gaps in coverage. ask yourself what senior software engineer on earth would tolerate getting called CONSTANTLY at 3am, or working 3rd shift.

the vast majority of computer systems just simply aren't as important as hospitals or nuclear power plants.


Timezones are a thing - your 3am is someone's 9am and may be a significant part of your customer base.

Being paged constantly is a sign of bad alerts or bad systems IMO - either adjust the alert to accept the current reality or improve the system


spinning up a subsidiary in another country (especially one with very strict labor laws, like in european countries) is not as easy as "find some guy on the internet and pay him to watch your dashboard. and then give him root so he can actually fix stuff without calling your domestic team, which would defeat the whole purpose.

also, even getting paged ONCE a month at 3am will fuck up an entire week at a time if you have a family. if it happens twice a month, that person is going to quit unless they're young and need the experience.


It's really not that difficult, and there are providers like Deel who can manage it all for you, to the point you just ACH them every month.

Source: co-founder of a remote startup with employees in five countries


like you said, timezones are a thing. now you're managing a global team.


That sounds harder than it is, especially if you already allow remote work. It mostly just forces you to have better docs.


Sorry to be clear I was replying to this part of your comment

> the vast majority of computer systems just simply aren't as important as hospitals or nuclear power plants.

I agree that the stakes are lower in terms of harm, but was trying to express that whilst it might not be life and death, it might be hindering someone being able to do their job / use your product - eg: it still impacts customer experience and your (business) reputation.

False pages for transient errors are bad - ideally you only get paged if human intervention is required, and this should form a feedback cycle to determine how to avoid it in future. If all the pages are genuine problems requiring human action then this should feed into tickets to improve things


Not only that, but you probably need follow the sun if you want <30 minute response time.

Given a system that collects minute-based metrics, it generally takes around 5-10 minutes to generate an alert. Another 5-10 minutes for the person to get to their computer unless it's already in their hand (what if you get unlucky and on-call was taking a shower or using the toilet?). After that, another 5-10 minutes to see what's going on with the system.

After all that, it usually takes some more minutes to actually fix the problem.

Dropbox has a nice article on all the changes they made to streamline incidence response https://dropbox.tech/infrastructure/lessons-learned-in-incid...


I've worked two SRE (or SRE-adjacent) jobs with oncall duty (Some unicorn and a FAANG). Neither have been remotely as bad to what you're saying. (Only one was actually 24/7 for a week shift.)

The whole point is that before you join, the team has done sufficient work to not make it hell, and your work during business hours makes sure it stays that way. Are there a couple bad weeks throughout the year? Sure, but it's far, far from the norm.


Constantly? It's one wakeup in 4 months.


I did that for a few years, and wasn't on 500k a year, but I'm also the company co-founder, so you could argue that a "specific monetary goal" was applicable.


You don't need 24/7 SREs, you could do it with 24/7 first-line customer support staff monitoring Twitter, Reddit, and official lines of comms that have the ability to page the regular engineering team.

That's a lot easier to hire, and lower cost. More training required of what is worth waking people up over; way less in terms of how to fix database/cache bugs.


Support engineers and an official bug reporting channel would help. I noticed and reported the issue on their official forums at on 16 March, but got no response.

https://community.openai.com/t/bug-incorrect-chatgpt-chat-se...

I only reported it on the forums because there didn't seem to be an official bug reporting channel, just a heavyweight security reporting process.

As well as the actions they took to fix this specific bug, another useful action would be to have a documented and monitored bug reporting channel.


Probably because they launched ChatGPT as an experiment and didn't think it would blow up, needing full time SRE etc. I don't think it was designed for scale and reliability when they launched.


Do events like this cause them to lose enough revenue that it would make sense to hire a bunch of SRE's?


Probably the real reason. I assume they intend to make money off enterprise contracts which would include SLAs. Then they'd set their support based off that


Given the Microsoft partnership, they might not even need to manage any real infrastructure. Just hand it off to Azure and let them handle the details.


Just add metrics for the number of times "ChatGPT" and "OpenAI" appeared in tweets, reddit posts, and HN comments in the last (rolling) five minutes, put them on a dashboard alongside all your other monitoring, and have a threshold where they page the oncall to review what's being said. It doesn't even have to be an SRE in this case; it could be just about anyone.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: