Skip to main content

CrowdStrike; Key Takeaways and Lessons

By 25th July 2024Uncategorised

CrowdStrike just released their “Preliminary Post Incident Review (PIR)”, and I found it a really interesting read:

Falcon Content Update Preliminary Post Incident Report | CrowdStrike

TL;DR! 😴

You might be in a situation, where you don’t want to to read it because it is too long, so here is my short version of what CrowdStrike say happened:

  • CrowdStrike released a “configuration” update to the Windows sensor on possible novel threat techniques <– If you want the technical details, a channel file called “C-00000291*.sys” was updated and caused the BSoD
  • It was out in the wild for 78 minutes – yep, that’s the short amount of time it took to blue screen of death (BSoD) 8.5 million Windows machines worldwide (CrowdStrike don’t mention this in their PIR 😜)
  • This impacted every single machine that received the update; basically this points to a lack of a testing process and ability to perform staged rollout of updates. (CrowdStrike mentions the word test/testing 24 times in the PIR!)

OK, but who are CrowdStrike? 🤔

CrowdStrike is a cybersecurity technology company specializing in endpoint security, threat intelligence, and cyberattack response services. Their flagship product, the Falcon platform, leverages cloud technology to provide advanced detection and prevention capabilities, defending against malware, ransomware, and other cyber threats.

So, CrowdStrike is a premium branded AV/EDR. Not all businesses can afford the high licensing costs, which is probably why this made such big headlines; only companies with perhaps a larger IT budget can afford the licensing. Now, I do believe you get what you pay for when it comes to security, so CrowdStrike is 100% worth the money, based on what people have said about the platform.

Not a question of if, but when…🚨

Yep, this was bound to happen at some point. You may disagree, but given the fact with how AV/EDR vendors have to hook into the Windows Kernel (which is some deep, low-level shiz that every AV vendor, including Microsoft have to do!), something like this was bound to happen at some point. Now, you may disagree with me, and yes, CrowdStrike should have a more robust testing and staged rollout process and procedure, but this was bound to happen with this type of product at some point. Perhaps not on this scale, but wow, this was something to behold!

Test, Test, Test! 🧪

Pushing out a “content update” that runs in the Kernel context is always going to be dangerous, which is why I cannot believe that CrowdStrike let this happen. This update caused a 100% successful blue screen of death for every endpoint that received the update, of which the only logical conclusion that you can draw up is that they did literally zero, nil, zilch testing of the update prior to its release, which should not have happened.

People may say that testing is hard and whatnot, but not in this case. It is the responsibility of CrowdStrike ALL vendors to ensure the stability of their software “updates” before they release them to their customers’ systems.

I mean thankfully, CrowdStrike seem to be slowly learning that this is a really bad idea. Like 8.5 million times a bad idea!! Plus I cannot fathom how a product like this didn’t (but will hopefully soon) allow it’s customers to control the updates by using something like update rings. That’s why we have test servers and even sets of test machines so that we can understand the impact of these updates. I mean, come on, I can’t believe customers haven’t fed this back to CrowdStrike!

If you look at the Defender settings, that has update rings:

I 100% expected something like this to be included in the CrowdStrike config for endpoint settings.

Million Dollar Question – Should I Jump Ship?! 🚢

TBH, that would be your choice if you are already using CrowdStrike, but if you want my opinion, the answer should be absolutely not.

Why? CrowdStrike is still probably one of the best solutions out there, and taking a knee-jerk reaction to something like this is not the right approach. And because this has happened to them, you can be pretty much 99% guaranteed that this will not happen again – they understand the impact and what it will do to their business if it happens again. Processes and procedures will be reviewed and updated, and customers will have the option to configure update rings to control the rollout at some point in the very near future.

Hey, security is hard, and yes, CrowdStrike should have done better, but we are where we are. Moving AV/EDR security providers is a lot of hard work – I know this because I assisted with helping a large enterprise move from McAfee to Microsoft Defender for 6000+ endpoints, and it takes a lot of planning, testing, time and effort from multiple stakeholders within the business. What I would say is stick with CrowdStrike, give them another chance to right the wrongs, and when renewal time does come around, make sure you haggle them down on price 😉!

We should all pay thanks to all the IT Pros 🙏

The events that happened, I would say, shocked the world. Flights were grounded, people were unable to pay for food or shopping, and hospital and doctor appointments were cancelled. However, we should thank the IT Pros who worked tirelessly (and in some cases 24×7) to get systems back online. When they saw what was happening, and when the news broke I can just imagine them (as I would have been the same if we were managing customers with CrowdStrike):

Everything was basically on fire, and I salute all IT Pros out their – you don’t get enough recognition for what you do 🫡.

OK, but what lessons can I learn from this? 🧑‍🏫

Here are my takeaways of what we should all learn from this, and I encourage you to maybe go away and check you have these:

  • Test ALL updates for any systems that you have
    • All updates for any AV, OS patch, software patch should be tested. Period. That is what update rings are for!
  • Controlled/staged rollout
    • If your AV/EDR the option to stage or control the rollout of updates, then please make sure that it has been configured. If your AV doesn’t have this, then perhaps you should be asking why they don’t
  • Vendor testing process
    • Does your vendor test updates thoroughly? Do they have a documented testing and validation process and procedure? Now might be a time to ask them and get that doc from them
  • Backups
    • OK, this should be a given, but make sure that all systems are backed up. I would even backup DEV/TEST environments. Yes, they may not be production, but they are always critical to developers if they cannot do their job!
  • Encryption keys
    • Sysadmins had to use BitLocker recovery keys to boot into safe mode and remove the culprit file. Do you have an internal process for handling these keys in this type of emergency, and do you then have a process for rotating them once they have been used?
  • DR & BC Plans
    • All of the above should be thought about from a DR (Disaster Recovery) and BC (Business Continuity) perspective. Do you have these plans already in place? If so, now is a good time to re-read them, reflect on what just happened, and update them so that you are better prepared if something like this happens to you (or has happened to you)

Want to discuss this further?

Hey, I would love to hear your thoughts on this or if you want to discuss any of the points in this post.

Or, if you need any help configuring update rings in Defender or Windows Update, feel free to send us an email. We would be happy to help 🤓

Leave a Reply