CroudStrike/Microsoft outage

Its unbelievable that software meant to protect us would cause all this damage.

Share your stories.

Quis custodiet ipsos custodes? (Juvenal)

Keep your friends close and your enemies closer (Sun Tzu)

Take your pick.

I had to Google it…
https://www.google.com/search?q=CrowdStrike%2FMicrosoft+outage
It hasn’t affected me (yet), but it looks like a first class screw up. :man_facepalming:

I was at work this morning as usual and then, without warning, in around 10 minutes the computers in the office crashed one after the other, sometimes while documents were still open. Sometimes there were blue screens (csagent.sys - Page Fault in Nonpaged area), sometimes the screen just went black. All PCs were then in an endless loop of a failed boot, and then there was only the screen for recovery.

At this point I felt that something really bad was going on and so I was in hurry to get at least one important protocol document to print out (which was needed to work the day) before my PC also gave up less than ten seconds later. :smiley:

PS: The company is called CrowdStrike

What a time consuming effort for companies with hundreds or thousands of computers. If they’re running VM’s then they can simply roll back to a prior image. If not, then lots of time is going to be spent fixing the pc’s.

“A CrowdStrike engineer posted in the official CrowdStrike subreddit that the workaround steps involve booting affected Windows systems into Safe Mode or the Recovery Environment, navigating to a CrowdStrike directory, and deleting a .sys file and rebooting. If this works, it’s not something that can be done through a network push, so a lot of manual work remains to be done.”

Yep, “Croudstrike” is what someone making fun of Asian people would call Cloudstrike. :man_facepalming:

1 Thank

I got called at midnight that one of my servers was being reported down by the monitors. I logged in, couldn’t connect to the server. Told the monitor guys that this was a bigger problem. Then my laptop blue screened! About half of my company’s servers (and it’s not a small company) had BSOD. Been working all night to get them back up.

It’s now 11am, I took a short nap, then a shower, got some fresh coffee and am back at it again.

1 Thank

Ouch, sounds like you’ve had a long night. I know that feeling the next morning. I hope this post finds you finally getting some sleep and that you won’t read it until another 8 hours or so from now. :slight_smile:

Interesting, I assume it’s a company-issued laptop also running Crowdstrike?

It won’t, assuming you’re not running Crowdstrike.

3 Thanks

I wish. I’m wide awake though. I feel refreshed at the moment. (might be the caffeine talking)

Yep, company issued laptop. I work in the finance world. So, stuff has to be locked down all over the place. Also had to figure out how to get around Bitlocker. That was the worst part. Every one of our machines has it enabled. Gotta figure out that key code to unlock it to implement Crowdstrikes “fix” LOL!

1 Thank

Unless I need to fly somewhere via an airline, or I need to use an ATM machine, or go shopping, etc. :+1:

1 Thank

Wow that’s nuts. I’m extremely unimpressed with Crowdstrike’s communication around this crisis, they’re like “we’ve already issued a fix” as if their fix would automatically propagate to a system that got taken offline.

2 Thanks

Right? They’re “fix” is only helpful if the computer was not online during the affected timeframe.

2 Thanks

Well this a fascinating inside perspective from you. Let us know how things develop (within the limits of what your NDAs and all that jazz allow for).

1 Thank

I keep my PCs powered down when not in use and I was watching a bluray last night so I didn’t hear about this till I got to work. Thankfully we don’t use Crowdstrike so it is mostly business as usual.

2 Thanks

Oh, it’s not really that much any more. Just working to unlock bitlocker and then apply the “fix” of deleting that one file on both servers and any computer that was on the network at the time. I’m not even the one doing it anymore. But, I’m still listening in on the Zoom and chats. (Hence why I now have time to reply here… LOL!)

1 Thank

Who would have guessed that implementing something at the kernel level would ever prove to be a bad idea :smirk:

And yes, I’m being sarcastic

3 Thanks

Right. I mean out-of-kernel drivers are a pretty common and unavoidable thing, but automatic online updates for those drivers doesn’t seem like a good idea. And in the case of Crowdstrike it feels like if that level of security is required (I understand it can also be used to monitor user activity, which is another issue) then you should probably be using another OS that offers that security natively.

I’m trying to think how this would have turned out if it were a buggy third-party module on Linux. As far as I can work out the scenarios it would require a more manual update and reboot process to apply the buggy driver update, so the onus would be more on the administrator(s) to do proper testing. But if that ball got dropped then it would also require hands-on interventions to get the system booted again, just like in this present case on Windows. I guess that shows the merits of the new paradigms that enterprise companies like SUSE are moving toward regarding immutable systems with atomic updates that automatically revert to the last known good system image if the system fails to boot. But I also hate the idea of constant scheduled reboots to apply the latest system image, as it would require orchestrating a complex failover system with individual nodes rebooting at different times to maintain the service availability.

What strikes me is how this file got pushed out without some level of testing. From all accounts it immediately bricks the computer so it’s not as if “sometimes it does and sometimes it doesn’t”. That begs the question as to who tested the push, and if CrowdStrike’s product testing team was involved in it. I don’t know if they have such a team but as big as they all I assume that they do.

As a programmer I used to test my programs/fixes (and of course they always worked flawlessly :rofl:) and then gave to the testing team that ran them in a test environment before they were rolled out.

Hopefully heads will roll at CrowdStrike.

3 Thanks

Think about how poorly written this update was. It blue screened the machine upon installation, then if that wasn’t enough, the machine would not boot afterwards and needed the tech to go into safe mode and manually delete a file.

I’m guessing that the CrowdStrike program crashed when it was trying to read the file. If so then the program was poorly written since it shouldn’t have done that.

That kind of makes sense then that the file push wasn’t rigously tested since for them it was “only an input file” rather than a software change.

1 Thank