Episode 2138 mins3/20/2025

From Outages to Optimization: How ilert Solves Incident Response

host
Mirko Novakovic
Mirko Novakovic
guest
Birol Yildiz
Birol Yildiz
#21 - From Outages to Optimization: How ilert Solves Incident Response with Birol Yildiz

About this Episode

ilert CEO Birol Yildiz joins Dash0’s Mirko Novakovic to break down how ilert helps enterprises manage incidents more effectively. They explore why AI is a ‘must’ for some aspects of incident response, the role of technical redundancies in ensuring reliability, and what defines a strong MTTA and MTTR in the enterprise world.

Transcription

[00:00:00] Chapter 1: Introduction to Code RED

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I am co-founder and CEO of Dash0. And welcome to Code RED code because we are talking about code and red stands for request Errors and Duration the Core Metrics of Observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Hello everyone! Today I have Birol Yildiz at my show. Birol is co-founder and CEO of ilert incident and on-call management and founded in Cologne. And I'm really happy to have you on my show, Birol.

Birol Yildiz: Thanks for having me, Mirko. It's a pleasure.

[00:00:49] Chapter 2: Experiencing a Code RED Moment

Mirko Novakovic: Before we talk about Cologne and that we are very close with the companies, I always start my podcast with the question which is what was your biggest Code RED moment.

Birol Yildiz: I had a few and we're an incident management company, so we experience a few of these moments. But I generally tend to, like, suppress negative thoughts. But there's one, one, one moment I remember vividly, which is an outage of our web platform and mobile apps which lasted about like 15 to 20 minutes. And the reason why I remember it vividly is because, like, I was on vacation back then, I was like having early breakfast with my kids. And then I was like, returned to the room. And I see these messages on my phone, like from our status page from the engineers. We're having an outage, right? The reason why this was very intense is like we pride ourselves of like being a very reliable platform, right? Because we we need to be up when all of our customers are down and in 15 minutes. That's pretty long, right? That's really long. When we look at outages and incidents or like broadly speaking, there are three root causes for incidents. For it incidents. It's either like someone miss a change and breaks production. Second is you have unexpected system loads, system overload basically. And the third is something that you depend on doesn't work and you're not able to deal with it. Like it could be your cloud provider. It could be an API. And in that case, like it was a very unspectacular root cause it was our core engineer making an upgrade of Kubernetes. Fortunately, it was very early in the morning. So all of our repeat customers, they weren't in the platform yet. And all of our, like us customers, they were sleeping. And this, this outage, like this configuration change. This upgrade led to the Kubernetes master losing the connection of all their node groups. So they were having like all these nodes running but not receiving any traffic. So everything was working right. But the all these node groups, they didn't existed for for the Kubernetes master essentially, which took us some time to recover from it. But again, fortunately it was like no event ingestion affected, no notifications affected. But that was pretty intense, I would say. Yeah.

Mirko Novakovic: Yeah, I can feel that. I can feel that. The biggest question is why does this always happen when you're on vacation? Right? I had Ben Sigelman here also from Lightstep. And he had the same story right? With his Code RED moment. He was, I think, partying at New Year and on vacation and, and the outage happened. So it's kind of weird that these things always happen when you don't really need them, right?

Birol Yildiz: 100%. 100%. We actually have a saying, like in the company, like whenever I'm going on vacation, everyone is like taking like twice as much care as, like not to break things because it actually like we had a more, I would say, issues, not outages issues while I was away. But yeah, but for some reason those things tend to happen when I'm not here. So yeah, I guess one solution would be not going on vacation anymore, but I haven't tried that. I guess then I would have other problems.

[00:04:02] Chapter 3: Career Background and ilert's Creation

Mirko Novakovic: Yeah, I agree, I agree. No that's good. Yeah. I mean, your company is headquartered in Cologne. I know the background that you are sitting. I'm in Solingen with my company. That's essentially, I don't know, maybe in miles 15, 20 miles away. So it's pretty close. Yes. And it's, so I'm glad to have you here. And also glad to have somebody nearby who is in the same space. I know you have been working for Rewe before. Rewe Digital. Yeah. Who doesn't know? Riva outside of Germany is probably one of the biggest supermarket chains, right? Some something like Walmart or in mainly Germany. I think there are some international outlets too, but it's pretty big here. And they have a digital unit also in Cologne. And, and you were responsible there for a product, I think. Right.

Birol Yildiz: How was the chief product owner for all the big data products? And this was at a time when they were digital, like the mission for digital was to come up with digital new business models. And one of that was the online grocery delivery. Right. So they did that. Like, I think they started doing it ten years ago, and they realized that they need heavily. They need to own the customer experience and they need to heavily invest in, in software. And I took over a department that was called like big data. And we were building like big data products for the entire customer journey, if you will. So these were things like a recommendation service. These were things like a service that takes care, that makes sure that you have your, the grocery items in stock at the right time. Which is a different problem if you sell, for example, electronic items, or if you sell t-shirts when you have perishable goods, the entire supply chain becomes a lot more complex. So that's what we're doing.

Mirko Novakovic: Yeah. And I know because you always also have been Instana customer very early. One of our earliest Instana customers. I know that you were also very early on a Kubernetes stack using containers and microservice. So basically leading also on the architecture and new technology side of things forever. So how did you then figure out that you want to build your own business and that it should be incident management? Was that based on something you experienced at Drave or.

Birol Yildiz: Partly so that's like that's a long story. Actually, it has some personal elements as well. But I'll start with the backstory. So I this was before Rewe. I was writing my I was working as a student, I was studying and I was working for different companies as a software engineer. My final stages of my master's studies. I was working for a software company and doing my master's thesis, and I was doing research on so-called context aware notification system. So these were notification systems that were supposed to be intelligence when it comes to answering the question, given a problem, who is the best person to handle this problem? And one recurring theme that I discovered during my first job, my second job. And then like the final job while I was doing my master's thesis was in back then, the predominant solution that was used for monitoring was an arduous one. Insight I had was if you have a problem, you cannot rely on email to raise awareness of the problem. Right? You need a way that's more effective to notify someone about a problem. This was back then an SMS or maybe a phone call. And one realization I had was that it was, like, pretty hard to make Nagios send an SMS. So there were these monitoring tools that were like, you could model everything, you could model dependencies, you have host service dependencies, and they were able to detect any issues. But when it comes to connecting those tools with the actual people that need the information in case of an incident, that was pretty complicated.

[00:08:11] Chapter 4: Defining Incident Management

Birol Yildiz: So you either had to like, set up your own hardware, your own modem that is capable of sending SMS, somehow connecting it with Nagios, and then having maybe like the phone number of one person in that modem, you know, it was a mess, you know, and that was like, okay, that's like, forget about this context aware notification system that's supposed to be intelligent. Like my first insight was, let's solve that problem first. Like, let's make it very easy to send an SMS without reliance on any piece of hardware, without requiring any complicated setup. And this was at a time where there were no APIs for everything. So there was like Twilio was, I think, just getting started and they weren't available in Europe yet. This is how the actual idea came to be. And then when I during my time at Rewe, when I was working with I was working with different teams and won. One of those teams was a platform team, and they were responsible for operating 24 over seven services. They were on call. And then this is something that I realized again, okay, this is seems to be still a problem, although we were much further better than what the situation was when I was doing my master's thesis, but it was still not really solved in our case. So that's why we teamed up with my co-founders, Chris and Roman, and then decided to go all in into incident management.

Mirko Novakovic: So what is incident management? I mean, that's essentially the category, right? There's no call. Is that part of incident management. Is incident management something different. Let's let's specify that category. I had JJ from Rutley on the show also before and we also discussed it. But it's interesting to see that there are many vendors in that space, but they have different approaches. Right to the category.

Birol Yildiz: There are a few ways to look at incident management. And one thing we try to do is but we didn't succeed is not label ourselves as an incident management solution, but instead an incident response solution. Because when you talk about incident management, like people would think depending on who you talk to, like the first thing would be, okay, this is probably where I have service. Now, this is probably something I would solve with a JIRA in our domain. We're talking about real time incident management. So really the first like the first minute to hours, all the actions that you take when you have an incident that are really time sensitive and you're looking to to stop the bleeding immediately because you cannot wait for I don't know, like, when it's super urgent, like you, you have you have to really move fast and stop the bleeding. And what we cover is we cover everything during that cycle and we the one term that's out there is the incident response life cycle. And it starts with preparing for incidents, responding to incidents and then communicating incidents because incidents they rarely affect only it. They affect the entire value chain. So you need to communicate with your customers and communicate with other departments within the company and tell them that, hey, we're having an incident and this is impacting you as well, right? And so they can have their own business response. But and also the last phase would be learning from incidents. So if you look at the Birolader category of incident management, you have like these workflow oriented incident management solutions like ServiceNow who are capable of modeling your entire business and you're setting up custom workflows, and then you have the real time incident management. This is the part that we cover and we call it incident response platforms, on which we position ourselves as an all in one incident response platform where you can manage incidents end to end.

[00:11:53] Chapter 5: How ilert Functions During an Incident

Mirko Novakovic: Let's go through one of those incidents. So as far as I understand, you are basically doing a lot of integrations, right? We are just working on an integration with Dash0 too. So if, for example, Dash0 we see a problem, we will report it to you, right? So our alert will go to you. And then what happens?

Birol Yildiz: It actually starts a little bit before that or before an incident happens. So when you want to prepare for incidents, what you essentially want to do is answer all the questions that you have when there is an incident beforehand. So these are questions like, okay, for example, when there's an incident, the first question you ask is, okay, who is the person that can like, which team is capable of fixing that incident out of that team? The who is the person that is on call that I can reach out to now? And how can I reach that person? What do I do if that person doesn't respond? So these are things that you set up beforehand but before incidents actually happen. And for that you need to connect with the potential sources for that tells you that there's an incident. And one of those sources could be observability solutions, monitoring solutions like Dash0, for example. And there are it turns out, like even small companies, they don't have a single source. They have like three, four, five sources. And in a big company it's like tens, hundreds of potential sources. And our promise to our customers is whatever our customers are using, we integrate with that solution. If it's a source for critical information, we have an integration for that. And these are like usually observability monitoring tools, but also ticketing tools. If you are a managed service provider, maybe you have something, something like a SLA with your customers.

Birol Yildiz: Whenever they raise a ticket, you want to respond within an hour or so. And that would be a critical source of information for us. We integrate with that as well. Or you have just a VIP hotline for reporting incidents, right? Again, when you are a managed service provider, you want to provide your customers with a phone number. And incidents like, despite all the sophistication and observability monitoring, there are quite some significant incidents that are still reported manually by a customer getting awareness of that incident and then calling whoever needs to be called right. And we provide that as well. So that that for us is also a source. So we, we, we do have we provide a product that's called routing, where we are capable of routing incoming calls to the right person. So that's the preparation phase, right of of incident response. And then your actual question. So there is there is an alert. What happens then. Right. So you submit a Dash0 the platform submits a critical event to ilert. What happens then. So let's say it's 3 a.m. in the morning and I'm the on call engineer. I get the the page, I get a notification. Maybe I missed that notification because I'm sleeping very deep. And then I get a phone call from ilert. Either calls me and tells me, hey, there is a critical alerts, reads out the text, and asks me to acknowledge the alerts. The first thing I do is I acknowledge it, which prevents my my, my backup from being notified.

Birol Yildiz: And then what do you do? The first thing is triage the alert. Is this an actual problem? Does it have business impact? Is this something that I can solve on my own, or do I need additional support? And let's assume this is a major incident. If it's a major incident, what do you do? Do you maybe add additional responders because you need help from your colleague. So you log into ilert, you create situational awareness. You look at the alert, you see the information that is provided by Dash0. There are links to Dash0 where I can just check what's happening. I add my colleagues as a responder and because this has business impact, I at the same time I update our status page within an ilert from the alert itself. I say, okay, this has business impact. And I think this happens very quick because this is a stressful situation. We don't want to spend too much time on communicating incidents, finding the perfect message. This can be something like, okay, I say a short message like this API is slow and it affects all our payment services, right? And then it crafts a beautiful message using an LLM says, okay, hey, we're experiencing this outage here. You can expect an update in 30 minutes. And then I go on and try to fix the actual problem with my colleague. And then usually companies use the collaboration tools like Slack or Microsoft Teams.

Birol Yildiz: Again, this is orchestrated by ilerts you can spin up. You spin up on dedicated ballroom channel. You collaborate on the incidents. If you look at the alerts like one, one important question that you that people also ask is what has changed, right? Because usually, like I said incidents like one major root cause for incidents are changes introduced by deployments, by configuration changes like in our case. And we make that visible to you. So we not only integrate with observability and monitoring solutions, build pipelines, CI, CD pipelines. Let's say you've connected your GitHub pipelines with ilert, and then you're able to see, okay, you have this alert from Dash0. And in the meantime, there were like in the last five hours, there were these deployments from your team. Maybe one of those deployments has caused this incident, right. So we show you that information, and then at some point you maybe decide to fix a rollout, a patch or rollback a deployment. And then hopefully you fix it, you update the status page. Then you're happy again, right? So that's how it works. Then after that, I mean, you would go to sleep, for example. Probably if it's Friday morning, you fix the issue, then what begins after that is the post-incident phase, which like you use for learning from that incident and creating maybe a postmortem agreeing on actions to prevent such an incident from happening again.

Mirko Novakovic: That makes sense. I have a few questions that came up. First is how do you map the problem? You said you need to understand who's responsible for it, right? How do you map that? So if we since as Dash0, we send you an incident for service, like, I don't know, the ad service how do you know who's responsible for that? Is that mapped out in the tool? Is it based on tags or information I have to send to you, or how do you do that?

Birol Yildiz: Very basic way is to map it out manually where you have. Okay. Whenever I have this alert source create an incident. It's that specific team that should receive these alerts. But then it could be it could be more dynamic as well. Where, like you said, in the payload of the Dash0 events, there is a field that says, like which service is affected. And so that's more dynamic in nature. Right? Because it's like it's part of the payload. And we can use that, that, that value to dynamically route the incidents to the right team based on the service that's affected. That's a more sophisticated approach. So that's one way we do it as well. And usually like when, especially in larger organizations, every team connects their I don't want to say their own Dash0 instance, but they have their own connection to Dash0 where they receive just a specific. Well, in Dash0 they would make sure just to receive a specific stream of events of alerts, right where they know that these are relevance to my team and they would distribute it accordingly.

Mirko Novakovic: And then one of the things I was thinking about is, okay, you you call me 3 a.m. in the morning. I log in to ilert. I'm not. I acknowledge the incident. I look at it and I figure out I need six of my colleagues to help me. Right. Because it's a major outage, but it's 3 a.m. in the morning, so I create something like a war room or something, and then I can add the six people. And will you call them also for me or you.

Birol Yildiz: Have an incident, the first thing you do is triage. And when you find out you need six colleagues, you add those six colleagues as a responders. And in the background, ilert will do its magic and call those six people. And you have created a war room in slack. And it will not only call those six people, invite them to the war room so they know which chat channel to join. And then you collaborate in that war room, right? And in the background we record all the chat history, so later on we can use that information to create a post-mortem document, for example, because you know, the chat history essentially is essentially a goldmine for creating a post-mortem document. It contains all the decisions, it contains all the speculation about root causes and so on and so forth. And we combine that with the machine generated data, with the alerts, with the timeline to create an 80% version of a post-mortem document. For example.

[00:20:26] Chapter 6: Integration and User Dynamics

Mirko Novakovic: I was thinking I need that for my family too, right? Sometimes when you don't know where everybody is, right, I use ilert and you call all my kids and yeah, yeah, bring them together.

Birol Yildiz: And I think we were very good at that because in ilert it's just a click, but in the background it could be like ilert giving, sending your kids an SMS and if they don't respond within minutes, then sending them a push and then giving them a call and then another call, right. So yeah, you could do that.

Mirko Novakovic: So that makes sense. And the customers are mainly larger organization also smaller ones. What's your typical customer for that kind of solution? It sounds like you have to have some sort of maturity in your process.

Birol Yildiz: We work with large enterprises like Ikea, for example, as well all the large like MSPs, like Bachelor Data Group NTT data. But I wouldn't say you have to like, have a large team in order to benefit from other, because even if you're a team of like two or 3 or 4 people, whenever there is on call involved whenever there's more than one person. Because if it's just one person, if it's always the same person, then probably having a solution like Azure would be an overkill. We make it very easy to get started with, even if you're a team of five people. For example, we have a free plan as well, where teams of up to five people can use us for free. And yeah, I would say once you have more than one person and once you have more than one potential source for critical events, then you probably need a solution.

Mirko Novakovic: Like I always have to go to that topic. At the moment it's the AI topic. You said there could be an LLM getting a better message for your customers, but how do you see LLMs working in that space? Are you using them today for ilert? And how do you see that going forward?

Birol Yildiz: We make heavy use of LLM powered AI functionality. One is and this is a very or two of those use cases, I just talked about creating incident messages. That's probably if you wouldn't provide that in the platform, that person would probably go to ChatGPT anyway and just type in, hey, I want to create a message. Can you like make it good looking for me. And we make it very easy by you. Just click a button and then it selects the services that are affected. It understands the intent. So that's that's I would say one one use case that's like very straightforward. But the overlap is also like very high in the sense that it's useful right. Because you want to draft a message and then you make that very easy and you see it right away. You discover if there is like some sort of hallucination. So that's a clear fit for us. But then or the post-mortem generation that I just talked about, that's also an area where we think that's basically it's basically not responsible not to use LLMs for the drafting this post-mortem document from this huge body of text. Right. So these are two areas, but there are also other, other areas where we work with LLMs like one is intelligent alert grouping.

Birol Yildiz: So when you have alerts from multiple sources. And we use, for example, those text embedding models to understand the intent, like the semantic meaning of an alert. And for that we don't we host them on our own. We're not making a call to OpenAI to anthropic or bedrock or whatsoever. So we host these text embedding models on our own. We vectorize all the alerts that come, like, come into our platform, and then we compare those vectors those alerts that happen within let's say a time frame of like five minutes and then we're able to correlate them because alerts fatigue is also a problem when you have when you're being flooded with alerts, you end up with too many notifications, too many distractions. And this impacts your ability to any minute that is taken away from you that you can spend on resolving the incident is a minutes more in meter. Right? And that's what you're trying to reduce. You're trying to reduce MTTR. And this is an area where AI is also very helpful. So we have not only message generation post-mortem creation, we also have intelligent alert grouping. So there are a few areas where you are already using AI to improve the user experience.

[00:24:50] Chapter 7: Leveraging AI in Incident Management

Mirko Novakovic: When you talk about MTTR, right. I like that metric, but it's actually not that easy to measure. But you could probably. Right. Because you know, when an incident appears, you know, when it probably you have to close it in your tool. Also when you are done. Right. So you basically know, how long does it take until somebody really acknowledges and responds. And then you also know how long it takes basically to, to, to repair. Right. Is there any statistics you can share with what you have learned over the years with so many customers? How long does it take that somebody responds to a problem and how long does it take on average to to I mean, I can see that it varies. Right. But is there some kind of rule of thumb where you would say, oh, the best. I love the Dora report write. Dora always says, hey, the best teams respond at that time and it takes so long. And then there are some that don't have the process in place. So do you have any kind of guidance? What is a good timing here?

Birol Yildiz: We actually have something in the works where we want to use our own data just to come up with a report that shows, okay, this is how these core KPIs like MTTA, MTTR across different industries and like what, what factors contribute to a good character and MTTA and and what what factors affect those KPIs negatively? I think totally, my response would be, when it comes to MTTA, you want to like we had companies like even really large companies who manage a critical. And that specific example is this is a large managed service provider who manages a critical infrastructure of an automotive of Often an automotive customer. They had MTTA. So the time between something is reported and the human is working on it of up to two hours before they introduce a pilot, and then they were able to bring it down to less than ten minutes. And now, now you might think, okay, like how can this happen after two hours? But it's depending on the way, like the communication channels incidents are alerts are reported and depending on the how many people there are involved, they can get very like you can reach 20 hours really quickly. Like if the first instance misses an alerts, then you have the next escalation, maybe after 30 minutes and that one is not effective to and so on and so forth. So yeah, to answer your question, I think a very good MTTA is probably in the minutes, like less than, less than 10 to 5 minutes. And of course MTTR hopefully always less than an hour if it's something that is business critical and has business impact. But yeah, but what we're working on is a more structured report based on our own first party data.

Mirko Novakovic: Well, I can already see that, right. I mean, just waking somebody up at 3 a.m. in the morning could take some time if you don't have an automated solution. Right. And going down that escalation path and getting from 2 hours to 10 minutes or five minutes, that's a big difference for your business, right? Especially if it's business critical. Yeah. So then that makes sense. And how do you make sure. That's another question because you said yes you had an outage. But especially I know when it comes to communication that's not always that reliable. Right. How do you make sure that you can make the call? I think that's very important for you. Right. That are you using multiple providers or how do you I mean, probably you're using something like Filio or other APIs.

Birol Yildiz: When you ask me about our Code RED moments, that one thing I didn't mention is that the way it was like for us, it was somewhat shocking is because we put in so many measures of reliability. So basically reliability was a feature from day one for us, right? We had multiple regions. Back then we had multiple availability zones, but now we're deployed across multiple regions. So we had redundancy at the infrastructure level, database level, application level level. We have redundancy at the provider level. So we don't work with one provider for making phone calls sending SMS. We work with three providers for every notification channel. So that was one reason why like this particular incident hit really hard because it was like you had so many actions, but then it boiled down to a very simple like mistake during an upgrade process. Right. Despite having all these levels of redundancy. But yeah, to answer your question we have provided redundancy. We have the data center level redundancy. We have the application is architected in a way where one part doesn't work, it work, it doesn't affect the other. There are many parts of the application. So yeah, that's that's a big let's say a core feature of ilert is availability.

[00:29:39] Chapter 8: Building System Resilience and Reliability

Mirko Novakovic: I can see that this is really important in that case. Right. Because you want to get that response quickly. And by the way, I think outages are I mean, I can tell you also from as an observability platform, we have the same I'm always very worried. It always happens. Right. You can't , it's it. And the funny part is if you look at things like the outage with CrowdStrike or Datadog had the large, I think almost 24 hours outage, I think there was also a Docker image update. And, and it's always some sort of as you mentioned it, it's some sort of change, some simple, oh, I just have to update this version of my Kubernetes or this thing. And then something very weird happens. Right. That's something you can basically not really prevent some or. Yes. Right. If you have multiple regions and things, you can make sure that these changes are rolled out only in one region, and once it works, it goes to the second. So probably you have also now very high, high sophisticated rollout strategies I can imagine.

Birol Yildiz: Yeah, I mean after this incident for sure. But there is, there's a reason why there's a saying never touch a running system. Right. So you ask about like how can you prevent outages in general. But one way could be like you don't make any changes. But the problem is if you don't make any changes, if you don't touch your own system anymore, your customers will stop touching your systems either right at one point because you get irrelevant. So what we recommend is like embrace the fact that incidents will happen, but prepare for incidents like accept that they will happen, but try to limit their business impact. And if you do it, if it hurts, do it more often. Right?

Mirko Novakovic: I always remember there is this famous SRE site Reliability engineering book from Google. Yes. And there is this passage where they describe, I think the service was called Borg. Right A centralized lookup server for services. And that had 99.999 whatever percent availability. But they had downtime of around 15 minutes per year. But because it was a centralized service that affected all the Google services and what I found interesting. So, the way they fixed it is by making sure that they shut down the Bork service every day. Yeah. And at first you think like, oh, why? Right. But the problem is and I know that when I started, I used to work and there were mainframes and if you as a developer, you think something is always there. You start programming like this, right? Yeah. You don't have the failure. Right. But if you know, oh, this Bork service that goes away every day, you program differently, right? Then you start saying, oh, I can't rely anymore that this thing is there. And then same what you said. Why are you using three APIs for your messages? Because, you know, it could be that one of those messages, which is probably also very rare that it doesn't work. But if you program in a way that you rely on it being there, it's very different to, oh, I don't rely then I have to have a fallback. Right? I have to say, oh, try this. If it doesn't work, try it again. If it doesn't work, then take the second service. Right. And that's how they fixed. The problem with Borg is basically that by shutting it down more frequently, the developers had to totally change the way they are using that service.

Birol Yildiz: You basically telling other people that depend on the service don't depend on it being there all the time. Right? And that was like the third category of root causes I was trying to explain in the beginning, like you, the one one root cause, for instance, is like something that you rely on breaks. But you have to architect your systems in a way where you expect failures to happen for your dependencies, and I wouldn't even say that you have to design the system in a way that you have a fallback. The best approach is probably having an active active setup. For example, we have three providers. They are not like primary, secondary and tertiary. We use them all. All three all the time. Because if a fallback strategy always assumes that you have this one standby service sitting there. First of all you probably not a good customer if that service is only a standby service, because they would only make money from you if your primary service is experiencing an issue, which is hopefully very rarely. And secondly, is also better from an engineering perspective because a fallback scenario is also something that's very like intrusive because from one point to the to the to the next point, you're redirecting all the traffic to a service that was sitting there that didn't receive any traffic, and then all of a sudden takes all the traffic, right? So that really goes good. I would say like similar with when you have a database replication across multiple regions, then you have a fallback that most of the time that fallback is not as smooth as you would expect. So therefore we think it's better probably to have to always distribute your traffic across multiple like either APIs, multiple regions, and have an active active setup instead of an active passive setup.

[00:34:57] Chapter 9: The Evolution of DevOps

Mirko Novakovic: I have one last question, because I saw that you say that you are also a DevOps tool, right? And I would like to I understand how you think about that term today, DevOps versus developers versus SRE versus the new hype around platform engineering. Right. How do you see that term and how would you position ilert in the context of developers and platform engineering.

Mirko Novakovic: Devops SRE.

Birol Yildiz: Probably not good at defining those terms. Devops, SRE and Platform Engineer. Me neither. By the way, yeah, they are like they somehow overlap and what? But the way I look at it is and probably this applies to you as well. Like I have a career long enough in it that where there was a time where you would build a piece of software for six months, a year, or for a year, then you would hand it over to someone else to another department, which is called the operations department, and they would have like steps to execute to deploy that software. Like literally it says open a terminal, type in this command and do this. Do this. So that's the exact opposite of DevOps, right? So that's where you have operational responsibility completely separated from the build responsibility. And when you have a DevOps mode of operation, first of all, you like see the problem of operations as a or the task of operations as a software task. So you try to code as much as possible and automate as much as possible. And you get to experience firsthand when your software is not a good citizen in the world of operations, because it's you who get to get the pages.

Birol Yildiz: And that's probably one of the most effective ways of making your software more reliable. When you experience it firsthand. When it's not working, it's a little bit similar, like you said, with the example with Borg and the Borg team designing that service in a way where other teams have to expect that this service is not available once a day. Right. So they could feel and they can prepare for it. And similarly, that's similar with DevOps. And no, to answer the question how do we fit in in That's solution. Why are we a DevOps tool? Because in a way, we adopt many of these practices. For example, we have a Terraform provider where you can codify all your incident response related activities in Terraform and apply the same automations and version control techniques that you would use in a DevOps DevOps environment. And of course, when you have to when you have operational responsibility in a DevOps team, you need a solution that brings immediate human attention to problems that need immediate human attention. And that's where we come in.

Mirko Novakovic: It's a good answer.

[00:38:00] Chapter 10: Conclusion

Mirko Novakovic: Thanks. It was nice having you here. It was fun talking to you.

Birol Yildiz: Absolutely. Thank you. Mirko. It was a pleasure talking to you.

Mirko Novakovic: Thanks for listening. I'm always sharing new insights and insights and knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on