Episode 426 mins7/24/2024

How CrowdStrike Caused the World’s Largest IT Outage: A Postmortem

host
Mirko Novakovic
Mirko Novakovic
guest
Fabian Lange
Fabian Lange
BONUS #4 - How CrowdStrike Caused the World’s Largest IT Outage: A Postmortem with Fabian Lange

About this Episode

Instana co-founder and agent expert Fabian Lange is back together with Dash0 CEO Mirko Novakovic to break down what went wrong to cause CrowdStrike’s massive worldwide outage, how they prevented rollout glitches at Instana and why giving your customers control prevents IT headaches.

Transcription

[00:00:00] Chapter 1: Introduction to Code Red and CrowdStrike Outage

Mirko Novakovic: Hello everybody. I'm Mirko and welcome to Code red. Code because we are talking about code and red stands for requests, errors and Duration. The Core Metrics of Observability. On this podcast, we will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Today my guest is Fabian Lange. Fabian was my co-founder and VP of engineering at Instana, and he is also the mastermind behind the Instana Auto Discovery agent, which was the centerpiece of doing the instrumentation and discovery of components in customer environments. Fabian, welcome to Code red.

Fabian Lange: Hi.

Fabian Lange: Thanks for having me.

Mirko Novakovic: Fabian. This is a special edition of Code red and it will discuss what happened with the CrowdStrike outage. So let me summarize what I understood. On July the 19th, somewhere 4 a.m. in the morning, there was a rollout of what Crowdstrike calls a sensor of a Falcon agent and within some environments, mainly some versions of windows. This led to a crash and the famous blue screens on a lot of PCs around the world. And some airports went down, some hospitals went down. It was a pretty significant outage. I read one of the biggest outages in software history. So caused by a sensor update. And that led me to this podcast and discussing with you because at Instana, we also had an agent who we had sensors and the very similar scenario, how we rolled these out, and also some rail guards, how we protected us against such outages. So, Fabian let's discuss how these agents work.

[00:01:59] Chapter 2: Instana's Approach to Agent Updates

Fabian Lange: Right? So when I read the news, I thought immediately, oh, that's quite familiar. Even the time, as you mentioned, 4 a.m. in the morning at the same time, the Instana agents were by default configured to update. So pretty much we have an idea of what could have happened. I've saw analyses on like the technicalities of what happened in the Windows kernel. Instana never worked on that level, but besides that, the whole process behind that, how we wrote our updates and how maybe cross-site should have done, could have done, how customers could decide what to do is a great topic to talk about.

Mirko Novakovic: Yeah, exactly. So at Instana we had something like a microkernel agent. So we rolled out an agent to the customer, which was basically only a skeleton of the agent with the base functionality. And we did this because we knew that rolling out an agent is normally a very big process at the customer side because it must go through the full rollout of software. It has to be certified in some scenarios, so we wanted that agent to be as stable as possible. And normally I don't even remember in the first five years we normally did never update the core agent, right? Because that was only the base foundation for what then is done by the so-called sensors, correct?

Fabian Lange: Yeah, that's absolutely right. The first years, actually, all the time, you could take an Instana agent that we developed six years ago and just use it today because it would still work. This core agent would still work and download the newest versions of everything else, and then work with that newest version of everything else. The main problem that customers were facing is getting the agent installed onto their systems, and once they are there, then maintaining them. If you have a lot of systems, it is a challenge. And I think the same challenge CrowdStrike wants to solve for customers is you don't need to maintain it anymore once you have it on the system. And that was the same philosophy we had at Instana as well. Once you have the end state installation on the system, you don't need to update it anymore because the updates are automatic and managed by Instana.

[00:04:18] Chapter 3: The Nature and Risks of Automatic Updates

Mirko Novakovic: To be fair, this was one of our core IPs. It was one of the core concepts we have built the product around, and I think still it's one of the most powerful technologies and agents in the observability space, which means this agent, if you install it on your server, it automatically discovered all the components that were running on that system. Then if it discovered the component, it downloaded a so-called sensor. So same verbiage as with the CrowdStrike sensor. And for each component, like if it discovered a Java process, it downloaded a Java sensor or a MySQL sensor, an NGINX sensor, and then that sensor could be updated all the time. Right. If we patched it, if we added some new functionality, it was very easy for the customer basically doing nothing because as you said, 4 a.m. in the morning, the agent connected to our servers checked if there is a new sensor version, downloaded it, installed it and that's it, right? Automatically updated. Automatically maintained. No need to do any manual updates.

Fabian Lange: Observability is quite similar to cyber security in that regard is that you want the system to be protected. You want it to have monitored all the new, newest versions of software. And if you think about a cluster system on Kubernetes, where you can schedule any kind of workloads on these Kubernetes clusters, you want the agent to adapt to whatever is running on it. And that's what the Instana agent does. It sees what's running on it and then gets the latest software. The latest monitoring software and monitors this system. So this is quite a unique capability because it also allowed us to iterate. So when a customer was telling us, hey, this is great software, great monitoring. But on this part of the system something is not really working. We could look at it and we could fix it in the next day. It was working. So quite a big competitive advantage over a software that needs to be manually rolled out and manually installed on systems where customers, as you said before, does not really have a.

Mirko Novakovic: Yeah, but now talking about the downside, which we have seen at CrowdStrike and the downside is if you do that and you automatically roll out this new sensor to all your agents, to all your servers and systems, if there is a problem with that sensor, it can break all the systems, right? That's what happened at CrowdStrike, at least with the windows affected nodes and PCs. Right. That could have happened with Instana too, though. As you said, we were not that deeply integrated into the kernel of the system. And we also added some protection mechanisms which we can talk about later. How we done. But first my question would be what do you think? How is it even possible that a problem that affected so many PCs was not found before the rollout?

[00:07:24] Chapter 4: Challenges in Testing and Complexity of Customer Environments

Fabian Lange: I think it's quite likely possible. We have seen it happened. I was wondering why it didn't happen before. The way I understand it is that in cyber security, you have those things called zero day exploits. So something that is threatening your software today, that was created today. And if the customer I want to be protected today is this kind of a trade off of risk versus reward versus benefit that I want to be protected. So that's why I installed the protection that was developed today and was tested today was staged today. But maybe there was a corner case, whatever caused this problem that was overlooked. But in the cases where it works, probably all the other days in the past that CrossStrike it existing, it has worked and given the customers this advantage. So I think yeah, it had to happen one day.

Mirko Novakovic: Yeah, absolutely. I think first of all, it's Murphy's Law, right? If it can happen, it will happen. That's number one. But number two is I think what we learned also Instana is that the complexity of the environments. So the number of different components you will discover on the customer side, essentially we have never seen an environment that was similar, right. Every customer was different. The versions were different. The patches were different. The components that were running were different. So testing such sensors is really tough because you have to reproduce all those different environments and try to pre-test a rollout. In our own environment, we did this right. We had a highly automated environment where we had hundreds, if not thousands of different components running in different versions and being automatically tested with the sensors.

[00:09:12] Chapter 5: Mechanisms to Mitigate Update Risks

Fabian Lange: Exactly. We had that Instana. I'm pretty sure CrowdStrike also has that. It's just that the sheer amount of permutations available out there in the world, you will never, as a vendor, be able to test this unless you have a really locked hardware like maybe Tesla has in their cars or Apple has in their iPhones. If you really know what's going on, then you can guarantee that it will not happen. Otherwise, I think this is shared responsibility with customers as well, is if they want to have the benefits of a system that automatically updates, they should have some say in when to get the updates. Maybe. How much of this update should have been rolled out already. Do they want to be the first 10%, or the last 10% of the customers that get an update? I think some choice, some testing capabilities needs to be there in kind of a staggered or staged approach and rollouts.

Mirko Novakovic: Yeah, and that's what we developed right over time. We had multiple scenarios for customers to protect against this automatic updates and potential problems. Right. And so number one was you could basically turn your agent offline, which means it was not any more connected to our service and did not update automatically. And the customer had to basically install the sensors manually. Right. They could be downloaded and then to a repository and then update it manually. That meant they were protected from any updates that we did, and they could roll it out in their pace with their kind of environments and staging environments as they want it.

Fabian Lange: So yes, this prepackaged agent that contains everything that is stable and fixed, customers could download, could test, could stage, validate, could do whatever they want with it and roll it out. The request originally came from customers who had computers that are not connected to the internet, so that was one way of doing it. But of course, they also then said, oh, we forgot to update. You had a new version. There was an incident with a version update of library that we wanted to get, and we forgot about the system. So that's a downside to improve on the situation. We have various degrees of version pinning and fixing and repository mirror. And so customers, for example, could decide at Instana. Well, we want all the updates from Instana, but before they are downloaded from Instana, they are downloaded to our system on our site on kind of a gateway or proxy. Then we can validate that everything is signed and coming from Instana on our system before we distribute it internally. Or they could then introduce some kind of delay that they say, okay, we look at this for one week, and if Instana has not record any of those updates, we give it to our internal systems. Or they could just cherry pick certain sensors because the agent is very modular. So maybe I want the newest stuff on PHP, but not on Java and things like that. You could even mix and match with the added complexity that then. Now you need to manage a mix of situation which most customers didn't want to do. So I think most of the Instana customers prefer the very convenient and always up to date way of having the Instana agents update automatically.

[00:12:39] Chapter 6: The Role of Monitoring and Feedback Mechanisms

Mirko Novakovic: Yeah, I think as you said, right. It's a trade off, right? Do I want the automatic mechanism or do I want to be safer, right. And as you said, what we build is basically a proxy of the repository. Right? We had a central repository on our servers where all the sensors were, and if there was a new sensor, it would automatically update it to all the agents. But as a customer, you could basically put in between on your side a repository where these sensors were downloaded. And you could then say, for example, that an agent first in a development or testing environment downloads those sensors. And once you have validated that it's working, then it's rolled out to production, right? So it's a staged environment, which in this case of CrowdStrike, if you would have done that with these windows PCs, you would have immediately seen that your dev or test system is crashing. And you could have prevented that from going into production. Right? That's a classical staged environment. And I think that was something we implemented. And some of the bigger enterprise customers used as a protection mechanism against such outages.

Fabian Lange: And do we actually use the internally to our for our testing? We needed that behavior that we had developers making a new version of something and then they wanted to test it. So this is actually what the customers were using is one side of this. But the other side of this was used internally, where there were alpha pre-alpha versions of our own sensors in our repository that our developers were using on our fleet of test machines.

Mirko Novakovic: And then the other thing you mentioned, and I just want to reiterate, is canary deployments, right. Where you said, hey, do I want to be in the first 10% or last 10%? So this basically means you can roll out things or deployments in a way that not all systems are updated at once, but you roll it out iteratively, right? For example, you say first, only 1% of my customer agents get the update, and then I will, for example, wait for 15 20 minutes and see if it's working. So I, I need that feedback mechanism. Right, which we call self monitoring. So we all the agents were reporting back. If they were updated we we could see if it happened and if it was running well and if we could get feedback, then we could stop the rollout and and prevent that update from getting to all the different agents. That's another approach of doing it.

Fabian Lange: Yes. So self monitoring is an essential part when you have some, some rolling updates, some some canary or a B testing doesn't matter how you want to call it, is that some portion of the customers or systems gets already a version that you observe what happens with this portion and if everything is working all right, then you can push it further. Or if there are problems, you can even stop and revert. And I think this is what CrowdStrike also did. They reverted quite within like half an hour. They reverted this. They said they saw it, but it was too late because everybody already had this broken.

Mirko Novakovic: Yeah. Especially because it was the classic blue screen right then nothing happens anymore. So you have to reboot windows in a safe mode, and then you can do things like, I think you had to remove two files to get it back in operational mode. But again, monitoring or observability is not only for applications. It can also be used as a way to self monitor your agents and give that feedback to your monitoring system. So we actually used Instana to monitor Instana. Right. We used our own tool to monitor the agents that were monitoring the applications of our customers. And I think that's something you should have. If you are operating such a massive amount of agents, you have to have feedback and you have to have a mechanism to roll out step by step, see the feedback, and then being able to roll back if there is a mistake.

Fabian Lange: Yes. And that's not only for very critical mistakes that can bring down your environment, but you could have maybe a configuration change that now changes your logging. So you suddenly get a flooded with logs that you didn't intend to get. And now your vendor of the observability solution is charging you a lot of money because you didn't know that suddenly you're like 100 x the amount of traffic of logs you're sending to your observability solution. So that's something you should always monitor yourself or vendors should monitor for their customers. Is the behavior of all the systems after a rollout like expected similar to before? Even better, but of course similar to before.

[00:17:19] Chapter 7: Cybersecurity Constraints and Potential Solutions

Mirko Novakovic: The question is why? Why did CrowdStrike not have this mechanism? And I think you already gave a hint, Fabian, because if you are in the space of cybersecurity and it's about protecting you against security threats, it's really a matter of time, right? Where in our space in observability, you can maybe work. Three. You can wait for three four days until you have some instrumentation in place. But when there is a hacker group, or if there is a security threat coming from an unpatched component or whatever, you really want to be protected as fast as possible. So I guess that's just a good guess. That cross-site does it that way, because they really want to make sure that their customers are protected as fast as possible. Once they have this threat protection for new threat, they want to get it out to all the instances so that the customer is protected. I think that's one of the explanations. I could see why this happened.

Fabian Lange: Yes. And that's unfortunately also why I think it crashed windows. Because of course, windows does not want any driver to crash the system. So windows has a big certification process. When you want to release software that runs in the kernel, you need to get it checked. But that's a long process. And if you want to fight a threat that came up tonight, then you cannot go through this process. So while this core agent certainly has been verified by Microsoft to do only the things it's supposed to do and do them correctly, these updates can cause different behavior that has not been tested on that specific system or that specific type of system. Maybe, maybe headless computers in airport were affected because on the test system, none of them were headless or things like that. This really depends on the Bible now, and I'm pretty sure people at CrowdStrike know exactly what happened, and they will make sure it does not happen again.

Mirko Novakovic: Yeah, but as a solution, I could see that you can maybe in the future classify systems because you don't want airports to fail. You don't want hospitals to fail, right? If we talk about millions of computers, you can maybe put something like, as you said, right? I don't want to be in the first 10% getting a patch. I can maybe live with getting it an hour later and being more secure, so I can see that there will be changes in the future where you maybe can categorize system as being very critical in terms of what have.

Fabian Lange: You for a very critical system. Want a threat protection immediately. I think this is really an interesting space and an interesting dilemma that CrowdStrike is facing there is that the real critical systems need update fast, and they're not so critical systems. They want it later.

[00:20:07] Chapter 8: Lessons and Future Directions in Agent Rollouts

Mirko Novakovic: It's a trade off, right? It's a trade off of being secure against cyber threats or being under risk to crash a system. Right. So yeah, you are right. It's not an easy answer, but I think there will be asks by the customers to have different rollout setups and different workflows for it and being able to configure it similar to what we had in Instana, so that you can at least decide which system you want to have updated automatically. Always. Or if you want to be last in the row, right to see if it's working. They need the feedback mechanism. We don't know if they have one or not, but I guess as you said, they saw the problem pretty quickly and rolled it back. So they have at least some feedback mechanism. Maybe it was also customers calling with blue screens. We don't know. But there was a feedback mechanism. And so I think there will be significant changes I think in the way they do the rollouts. And yes, it's a trade off between being secure fast but also mitigating the risk of crashing systems.

Fabian Lange: Yeah, it's the same trade off that our customers also need to decide when they roll out new software versions. Some deploy every day, some say, or maybe one time per week is enough because you're making the risk of change whether the risk of no change. So it's a classic IT dilemma?

[00:21:38] Chapter 9: Transition to OpenTelemetry

Mirko Novakovic: Absolutely. And it also shows how the complexity these days have introduced challenges that are not easily to fix. Right. There's just so many different operating systems, versions, components and then threats. Right. Security threats, cyber threats. But also observability problems that you have managing performance load, etc. that it's just a risk that can always happen and we can just try to make that risk really small. But at the end of the day, it's not 100% preventable in my point of view. I just want to wrap it up, Fabian, by saying that at Dash0 we took a different approach than in Instana, right? We are not developing our own agent anymore. We don't have an auto discovery. We don't have sensors. We are basically relying on the new standard, which is called OpenTelemetry. And the agent in OpenTelemetry is kind of the collector, and there are also some auto instrumentations, but there are many advantages of that approach. One is that it's a battle tested open source agent that is used by many different customers and developers. And also the patches are transparent and problems can be found pretty quickly and tested by the open source community. That's number one. And number two, is that the whole process of rolling out the agent, rolling out the collector, configuring the collector is basically in the hands of the customer. Right? So they don't have this fully automated mechanism here, but they have full control. They benefit from a rock solid open source component developed by hundreds of developers in a large community. And that way we think that this is a very robust and yeah, I would say production ready. Component.

[00:23:39] Chapter 10: Concluding Thoughts on Observability and Preparedness

Fabian Lange: I believe that this what you described as a really good compromise of doing things because there is no one solution that works. And it turns out, from my experience in this space, that if you end up having a little bit more control than the solution is always a bit better. So if you as a customer have control when you deploy a new version, what kind of instrumentation you make, then the solution is always a little bit better. But of course you also love to have some assistance. As you said, customers don't want to develop their agent. There is something in the open source space that they can take and can use, and it does work. The updates are transparent for everybody to see. And this is certainly slightly more control approach to things. That is probably the right way to go.

Mirko Novakovic: Fabian. Thank you a lot for getting on this special edition of Code red and discussing with me this CrowdTag outage. Because when I read sensor, when I read agent and updates, I was immediately thinking of you. And I know that,

Fabian Lange: But I wasn't it.

Mirko Novakovic: Yeah, have not been it. But I know that we had these.

Mirko Novakovic: Discussions at Instana too, and we had the fears that this could happen. It never happened. Fingers crossed that it won't in the future. But we have put some architectural decisions in place that help customer to prevent it. But at the end of the day, it's a complex environment. Complex systems. And I think we hope that it will not happen again. But we need to be careful.

Fabian Lange: Absolutely. Mean time to repair is always king. So things will go wrong, and you need to know how to fix it. And for that, I believe you need a good observability solution.

Mirko Novakovic: Good ending statement. Thanks, Fabian.

Fabian Lange: Thank you. Mirko.

Share on