A lot of assumptions here that probably aren't worth making without more info -- For example it could certainly be the case that there was a "real" file that worked and the bug was in the "upload verified artifact to CDN code" or something, at which point it passes a lot of things before the failure.
We don't have the answers, but I'm not in a rush to assume that they don't test anything they put out at all on Windows.
I haven't seen the file, but surely each build artifact should be signed and verified when it's loaded by the client. The failure mode of bit rot / malice in the CDN should be handled.
The actual bug is not that they pushed out a data file with all nulls. It’s that their kernel module crashes when it reads this file.
I’m not surprised that there is no test pipeline for new data files. Those aren’t even really “build artifacts.” The software assumes they’re just data.
But I am surprised that the kernel module was deployed with a bug that crashed on a data file with all nulls.
(In fact, it’s so surprising, that I wonder if there is a known failing test in the codebase that somebody marked “skip” and then someone else decided to prove a point…)
Btw: is that bug in the kernel module even fixed? Or did they just delete the data file filled with nulls?
1. Start Windows in Safe Mode or the Windows Recovery Environment (Windows 11 option).
2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory.
3. Locate the file matching C-00000291*.sys and delete it.
4. Restart your device normally.
This is a public company after all. In this market, you don’t become a “Top-Tier Cybersecurity Company At A Premium Valuation” with amazing engineering practices.
Priority is sales, increasing ARR, and shareholders.
Not caring about the actual product will eventually kill a company. All companies have to constantly work to maintain and grow their customer base. Customers will eventually figure out if a company is selling snake oil, or a shoddy product.
Also, the tech industry is extremely competitive. Leaders frequently become laggards or go out of business. Here are some companies who failed or shrank because their products could not complete: IBM, Digital Equipment, Sun, Borland, Yahoo, Control Data, Lotus (later IBM), Evernote, etc. Note all of these companies were at some point at the top of their industry. They aren't anymore.
Keyword is eventually. By then C-level would've been retired. Others in top management would've changed multiple jobs.
IMO point is not where are these past top companies now but where are top people in those companies now. I believe they end up being in very comfortable situation no matter which place.
Exceptions of course would be criminal prosecution, financial frauds etc.
Bingo! It's the Principal Agent Problem. People focus too much on why companies do X and companies do Y, it's bad in the long term. The long term doesn't exist. No decision maker at these public companies gives a rat's ass about "the long term", because their goal is to parasitize from the company and fly off to another host before the damage they did becomes apparent. And they are very good at it: it's literally all they do. It's their entire profession.
People stop caring when they see their friends getting laid off while the CEO and head of HR get big bonuses. That what happens at most big companies with subpar executives these days.
> Not caring about the actual product will eventually kill a company.
Eventually is a long time.
Unfortunately for all of us ("us" being not just software engineers, but everyone impacted by this and similar lack of proper engineering outcomes) it is a proven path to wealth and success to ignore engineering a good product. Build something adequate on the surface and sell it like crazy.
Yeah, eventually enough disasters might kill the company. Countless billions of dollars will have been made and everyone responsible just moves on to the next one. Rinse & repeat.
Boeing has been losing market share to AirBus for decades. That is what happens when you cannot fix your problems, sell a safe product, keep costs in line, etc.
i wonder how far from the edge a company driven by business people can go before they start to put the focus back on good engineering. Probably much too late in general. Business bonus are yearly, and good/bad engineering practices take years to really make a difference.
This isn’t hindsight. It’s “don’t blow up 101” level stuff they messed up.
It’s not that this got past their basic checks, they don’t appear to have had them.
So let’s ask a different question:
The file parser in their kernel extension clearly never expected to run into an invalid file, and had no protections to prevent it from doing the wrong thing in the kernel.
How much you want to bet that module could be trivially used to do a kernel exploit early in boot if you managed to feed it your “update” file?
I bet there’s a good pile of 0-days waiting to be found.
And this is security software.
This is “we didn’t know we were buying rat poison to put in the bagels” level dumb.
The ClownStrike Falcon software that runs on both Linux and macOS was incredibly flaky and a constant source of kernel problems at my previous work place. We had to push back on it regardless of the security team's (strongly stated) wishes, just to keep some of the more critical servers functional.
Pretty sure "competence" wasn't part of the job description of the ClownStrike developers, at least for those pieces. :( :( :(
ClownStrike left kernel panics unfixed for a year until macOS deprecated kernel extensions altogether. It was scary because crash logs indicated that memory was corrupted while processing network packets. It might've been exploitable.
Haven't used Windows for close to 15 years, but I read the file is (or rather supposed to be) a NT kernel driver.
Are those drivers signed? Who can sign them? Only Microsoft?
If it's true the file contained nothing but zeros that seems to be also kernel vulnerability. Even if signing were not mandatory, shouldn't the kernel check for some structure, symbol tables or the the like before proceeding?
The kernel driver was signed. The file it loaded as input with garbage data had seemingly no verification on it at all, and it crashed the driver and therefore the kernel.
Hmm, the driver must be signed (by Microsoft I assume). So they sign a driver which in turn loads unsigned files. That does not seem to be good security.
NT kernel drivers are Portable Executables, and kernel does such checks, displaying BSOD with stop code 0xC0000221 STATUS_IMAGE_CHECKSUM_MISMATCH if something went wrong.
Choice #3 structure the update code so that verifying the integrity of the update (in kernel mode!) is upstream of installing the update / removing the previous definitions package, such that a failed update (for whatever reason) results in the definitions remaining in their existing pre-update state.
(This is exactly how CPU microcode updates work — the CPU “takes receipt” of the new microcode package, and integrity-verifies it internally, before starting to do anything involving updating.)
When you can't verify an update, rolling back atomically to the previous state is generally considered the safest option. Best run what you can verify was a complete package from whoever you trust.
Perhaps - but if I made a list of all of the things your company should be doing and didn't, or even things that your side project should be doing and didn't, or even things in your personal life that you should be doing and haven't, I'm sure it would be very long.
> all of the things your company should be doing and didn't
Processes need to match the potential risk.
If your company is doing some inconsequential social app or whatever, then sure, go ahead and move fast and break things if that's how you roll.
If you are a company, let's call them Crowdstrike, that has access to push root privileged code to a significant percentage of all machine on the internet, the minimum quality bar is vastly higher.
For this type of code, I would expect a comprehensive test suite that covers everything and a fleet of QA machines representing every possible combination of supported hardware and software (yes, possibly thousands of machines). A build has to pass that and then get rolled into dogfooding usage internally for a while. And then very slowly gets pushed to customers, with monitoring that nothing seems to be regressing.
Anything short of that is highly irresponsible given the access and risk the Crowdstrike code represents.
> A build has to pass that and then get rolled into dogfooding usage internally for a while. And then very slowly gets pushed to customers, with monitoring that nothing seems to be regressing.
That doesn't work in the business they're in. They need to roll out definition updates quickly. Their clients won't be happy if they get compromised while CrowdStrike was still doing the dogfooding or phased rollout of the update that would've prevented it.
> That doesn't work in the business they're in. They need to roll out definition updates quickly.
Well clearly we have incontrovertible evidence now (if it was needed) that YOLO-pushing insufficiently tested updates to everyone at once does not work either.
This is being called in many places (righfully) the largest IT outage in history. How many billions will be the cost? How many people died?
A company deploying kernel-mode code that can render huge numbers of machines unusable should have done better. It's one of those "you had one job" kind of situations.
They would be a gigantic target for malware. Imagine pwning a CDN to pwn millions of client computers. The CDN being malicious would be a major threat.
Oh, they have one job for sure. Selling compliance. All else isn't their job, including actual security.
Antiviruses are security cosplay that works by using a combination of bug-riddled custom kernel drivers and unsandboxed C++ parsers running with the highest level of privileges to tamper with every bit of data it can get its hands on. They violate every security common sense. They also won't even hesitate to disable or delay rollouts of actual security mechanisms built into browsers and OSes if it gets in the way.
The software industry needs to call out this scam and put them out of business sooner than later. This has been the case for at least a decade or two and it's sad that nothing has changed.
> Nope, I have seen software like Crowdstrike, S1, Huntress and Defender E5 stop active ransomware attacks.
Yes, occasionally they do. This is not an either-or situation.
While they do catch and stop attacks, it is also true that crowdstrike and its ilk are root-level backdoors into the system that bypass all protections and thus will cause problems sometimes.
That anecdote doesn't justify installing gaping security holes into the kernel with those tools. Actual security requires knowledge, good practice, and good engineering. Antiviruses can never be a substitute.
You seem security-wise, so surely you can understand that in some (many?) cases, antivirus is totally acceptable given the threat model. If you are wanting to keep the script kiddies from metasploiting your ordinary finance employees, it's certainly worth the tradeoff for some organizations, no? It's but one tool with its tradeoffs like any tool.
That's like pointing at the occasional petty theft and mugging, and using it to justify establishing an extraordinary secret police to run the entire country. It's stupid, and if you do it anyway, it's obvious you had other reasons.
Antivirus software is almost universally malware. Enterprise endpoint "protection" software like CrowdStrike is worse, it's an aggressive malware and a backdoor controlled by a third party, whose main selling points are compliance and surveillance. Installing it is a lot like outsourcing your secret police to a consulting company. No surprise, everything looks pretty early on, but two weeks in, smart consultants rotate out to bring in new customers, and bad ones rotate in to run the show.
Yeah, that's definitely a good tradeoff against script kiddies metasploiting your ordinary finance employees. Wonder if it'll look as good when loss of life caused by CrowdStrike this weekend gets tallied up.
The failure mode here was a page fault due to an invalid definition file. That (likely) means the definition file was being used as-is without any validation, and pointers were being dereferenced based on that non-validated definition file. That means this software is likely vulnerable to some kind of kernel-level RCE through its definition files, and is (clearly) 100% vulnerable to DoS attacks through invalid definition files. Who knows how long this has been the case.
This isn’t a matter of “either your system is protected all the time, even if that means it’s down, or your system will remain up but might be unprotected.” It’s “your system is vulnerable to kernel-level exploits because of your AV software’s inability to validate definition files.”
The failure mode here should absolutely not be to soft-brick the machine. You can have either of your choices configurable by the sysadmin; definition file fails to validate? No problem, the endpoint has its network access blocked until the problem can be resolved. Or, it can revert to a known-good definition, if that’s within the organization’s risk tolerance.
But that would require competent engineering, which clearly was not going on here.
> it’s insane to me that this size and criticality of a company doesn’t have a staging or even a development test server that tests all of the possible target images that they claim to support
I know nothing about Crowdstrike, but I can guarantee that "they need to test target images that they claim to support" isn't what went wrong here. The implication that they don't test against Windows is so incredulous, it's hard to take the poster of that comment seriously.
Thank you for pointing this out. Whenever I read articles about security, or reliability failures, it seems like the majority of the commenters assume that the person or organization which made the mistake is a bunch of bozos.
The fact is mistakes happen (even huge ones), and the best thing to do is learn from the mistakes. The other thing people seem to forget is they are probably doing a lot of the same things which got CrowdStrike into trouble.
If I had to guess, one problem may be that CrowdStrike's Windows code did not validate the data it received from the update process. Unfortunately, this is very common. The lesson is to validate any data received from the network, from an update process, received as user input, etc. If the data is not valid, reject it.
Note I bet at least 50% of the software engineers commenting in this thread do not regularly validate untrusted data.
Not validating an update signature is a huge security compliance issue. When you get certified, and I assume CroudStrike had many certifications, you provide proof of your compliance to many scenarios. Proving your updates are signed and verified is absolutely one of those.
We don't have the answers, but I'm not in a rush to assume that they don't test anything they put out at all on Windows.