host

Mirko Novakovic

guest

Stephen Whitworth

Episode 3942 mins3/5/2026

#39 - Beyond On-Call: How incident.io Built Multiplayer Incident Response with Stephen Whitworth

host

Mirko Novakovic

guest

Stephen Whitworth

Listen on

Apple Podcasts Spotify Youtube

About this Episode

incident.io co-founder and CEO Stephen Whitworth joins Dash0’s Mirko Novakovic to explain why paging someone is only the start of an incident, not a holistic solution. They break down how incident.io supports the full incident lifecycle (coordination, comms, timelines, and follow-ups), why incident response is a “multiplayer game” across engineers, support, and leadership, and how AI is starting to reshape triage by pulling context from telemetry, past incidents, and customer signals.

The episode closes with a practical look at what it will take to safely move from AI-assisted response to AI-driven auto-fixes that minimize the rollout ‘blast radius.’

Transcription

[00:00:00] Chapter 1: Introduction and Context for Code RED

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I am co-founder and CEO of Dash0 and welcome to Code RED Code because we are talking about code and Red stands for requests, errors and Duration the Core Metrics of observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Today my guest is Stephen Whitworth. Stephen is the co-founder and CEO of incident.io, an incident management platform used by engineering teams at companies like Netflix, Airbnb and OpenAI. And by the way, also at Dash0. Before incident.io, he led engineering teams at Monzo and co-founded fraud detection startup ravelin. Excited to have you here. Welcome to Code RED.

Stephen Whitworth: Yeah. Pleasure to be here, Mirko. Thanks for having me.

[00:00:53] Chapter 2: Stephen’s Biggest Code RED Moment

Mirko Novakovic: Yeah. This will be an interesting conversation in the context of AI and everything that's happening. But I always thought my question with the first question is, what was your biggest Code RED moment in your career?

Stephen Whitworth: Yes, I remember it like it was yesterday. So I guess my background was like, I was self-taught coding. So I kind of started my career off as a data analyst and then worked my way into data science. And what that meant was like, I was building these kind of, you know, models of like how we would like predict a taxi fare or something. And I was doing it in these like, you know, R and Stata and Matlab and all this stuff. And I kind of wanted to get that into the product somehow. So I sort of taught myself enough to be able to like load in some of these weights into like a model and a service in the system. And I had someone that was like helping me kind of do all of that. So like I was super excited, you know, about to like integrate this, you know, first production deploy ever. And it turns out that like I ended up causing this massive outage off the back of it. So it was a this like weird interaction with kind of the queuing system we used, which meant that we like every time the service booted up, it would try and create like a new subscriber to this topic, and then it would immediately crash and then create another different one. So I basically overloaded our messaging cluster in like 10s which was like a global outage. I was obviously mortified because, you know, this is like literally the first production, you know, deployment I've ever made in my life. And I genuinely thought I was gonna get fired because I had no idea about sort of blameless postmortems and all of this sort of side of the world. So I was like, right, you know, I had a good two weeks time to time to pack up my stuff. But ultimately, I guess it was my first introduction, I guess, to incidents and it shaped that like, oh, you can actually use these as a chance to sort of protect against stuff going wrong and improve. Which was yeah, I, I guess sort of a stressful moment, but like a good learning moment nonetheless.

[00:03:01] Chapter 3: Origins of incident.io at Monzo

Mirko Novakovic: Yeah, I can see that. But everyone who has been in that situation knows how it feels, right?

Stephen Whitworth: Yeah. And they start sweating across here.

Mirko Novakovic: Yeah, exactly. And then you came. I guess you came up with the idea of incident.io while being at Monzo.

Stephen Whitworth: Yes, exactly. So we had sort of internal tooling at Monzo, which was built from a co-founder, Chris. And they later open sourced it and it was kind of the conflation between, hey, I've been woken up to deal with these issues throughout my career. And the best that we have in the market is sort of something that wakes you up. But then there's kind of no software after that point. And then Chris had solved that problem internally for Monzo. They'd open sourced it. And, you know, there were lots of other companies using that version of things. So it was like, oh, right. This is such a big problem. The status quo kind of sucks. People seem to love whatever Chris has built. Like let's go in a, I guess, build the company that sort of stands behind that. So that was us about four and a half years ago.

[00:04:01] Chapter 4: On-Call vs. Incident Response—Closing the Workflow Gap

Mirko Novakovic: Yeah. Nice. And, and pretty successful so far. So congrats. And let's, let's a little bit deeper into that. What's the difference between an on call system and an incident response system? Because I think most people know from the past systems like PagerDuty or Ops and these systems, which were, as you said, more or less systems where you can configure who to wake up when. And then you get a call, right? But then you are kind of stuck. So, what's the difference between those systems and incident.io and which part of the process are you taking care of?

Stephen Whitworth: For most companies in the world that are running software in production, they have some kind of on-call tooling which integrates backwards into telemetry systems. And, you know, Essentially there'll be alerts running. They'll check, you know, is the CPU level of this pod. Okay. And if it goes above a certain level, fire off an alert, hit the on call provider that contains some scheduling of who's on call for what, and then you get woken up. The sort of the problem is, like, that's only just one part of the entire workflow that kicks off from that point. So yes, you need to be woken up to tell you that there's a problem, but then there's all the stuff that you have to do after that. So for example, what is broken? How do you know what's broken? That usually looks like engineers diving into Datadog deployment systems. You know, many of those things. It's like, what else does this affect that then is kind of you tracing upstream and downstream dependents, trying to group different alerts into things. Which customers are affected, you know, you may have one customer affected or all of your customers and everywhere in between. How do you communicate with them? There are many, many other questions that follow on after you know something's wrong. And I like to think of us as sort of, you know, helping take you the entire lifecycle of that.

[00:06:02] Chapter 5: Lifecycle Orchestration, Communications, and Postmortems

Stephen Whitworth: So to me, that is, you know, what's broken. Who needs to know about it? How might you fix it? And we can talk about some of the AI stuff a little bit later. And then ultimately, how do you stop it happening again? So what that looks like in practice is we build software that integrates into places where people are already solving these sorts of issues. So Slack and teams, you know, we're also in places like Zoom and Google Meet. We help run the lifecycle of your incident. So you often have sort of a process that you want to run, which might be, hey, if it's one of our top ten customers effective, please. Page Mirko you know, something like that. The state of art for that, you know, right now is usually a notion document and some hope. We build the sort of software that makes sure that always happens. We then have products that do customer communications, so folks like OpenAI, square, etc. use our status page to communicate with their customers when things go wrong. We then help at the end of the issue. So once you put the fire out, you'd quite like to stop it happening again. So we kind of collate all of the different information and timeline from what we've been helping you solve. So these are transcripts from Zoom calls. It's deployed from GitHub. It is, you know, emails that you sent out and were received by customers.

Stephen Whitworth: And we put together that to avoid, you know, a staff engineer spending two days writing this thing together. And then we connect that back into where work actually gets done in your organization. So linear GitHub, Jira, wherever you want to do it, with the idea being that like, we can help you close the loop between bad thing happening, help you respond faster, help you stop it happening again. So this is, I guess, kind of a new ish thing in that, you know, really there's only been folks sort of like us and our competitors doing this, I'd say for about the last four or so years. To me, it sort of felt like the obvious continuation of like, well, yeah, you need to be woken up, but what about actually fixing the problem? So, you know, to me, they're not really going to be viewed as very different products or categories. It's kind of just all part of the same thing. So part of our business is, you know, doing on call, which is what folks already have usually. And then a lot of this is also going in and replacing manual process or some home built internal tool that kind of does a little bit of what we do, but is being maintained by someone that's stressed out and all of the rest.

[00:08:35] Chapter 6: Multiplayer Incidents and Executive Visibility

Mirko Novakovic: Yeah. It's interesting that that process was never really covered, right? If you think about it, because it's such an important process and especially, I mean, I think when we spoke last time, you told me like you were, you kind of solved the multiplayer game compared to a lot of the other tools are more like single player, right? Because when an incident happens, I know it also from my times when I was an engineer, normally a lot of people get involved, right? Because it's not a single person who can fix it. You have to inform people. So it's really multiplayer and, and you, you do that too, right? You and not only in Slack, but you have an overview of what's happening. At which point in time I saw you have a board where you can see in which stage of the process the incident is. So, so it's really about bringing a team together and let them work on the incident together.

Stephen Whitworth: Yeah, 100%. I think a really nice characterization of that is like you know, it's like the CTO or the CEO turns up on the Zoom call and then immediately sort of like the atmosphere changes. And like, you know, the reason that person is usually turning up on that call is not because they're like fixing the issue. They just want to know that the thing is being handled reasonably well. And, you know, in the prior world, you sort of jump on the call, distract a lot of people, maybe slow down the response. We've built scribe, which sort of sits in your Google meet and your Zoom calls does transcription of the call, but also then uses AI to sort of pull out key moments, what's being worked on, who's working on what. And in reality, like that's actually what the CTO wants. They just want to know, like, oh, Mirko is on it. Okay, cool. We're doing this thing. Someone's taking care of customer comms. And to me, it's just a really nice example of like, you know, there's many different groups of people from engineering leaders to ICS to customer support folks to legal people that need to be involved, especially in more complicated incidents.

Stephen Whitworth: And it's a mixture of like, they're sort of I'm just observing because, for example, if I'm a customer support agent, I don't need to drive that incident, but I kind of want to know how to respond to people when they're asking if I'm a CTO. I want to know that we're taking care of it, but I'm not going to fix it. So there's all these sorts of different groups of people, and they have slightly different intentions. And we're trying to build a software that speaks to all of them, not just this like single player, super nerdy SRE experience. Like that's cool. There's lots of stuff out there. But ultimately, you know, our observation from, from all meeting at a bank in the UK is like, yeah, when, when software and banks go wrong, you know, it tends to affect quite a lot of other places apart from the engineering organization. So you need software that sort of handles that and can speak to it as well. So yeah, it's a bit of a bit of a diverse group.

[00:11:32] Chapter 7: How Observability and Incident Response Interlock

Mirko Novakovic: Yeah, absolutely. And I mean, I'm an observability vendor. So how do the, I mean, let's talk a little bit how those tools play together, right? Because if you are in that incident, you need a lot of information, right? You need the information. What's I mean, you just mentioned it, right? Could be the CPU load of a pot is too high. It could be that some code was deployed and it's not working anymore. So do you have integrations in those tools and you pull data for the incident out of those tools or how does it work?

Stephen Whitworth: I think sort of both of these tools having like the ultimate shared goal of like reliability of software, and you can kind of use them together to help get you more out of it. So for example, if I, if we're saying in the kind of incident spectrum, I would say that observability tools give you signals on what might be broken. And then the incident side of the world tells you sort of who, what when and gives you more organizational, broader context. Right now they feel like a little disconnected in that, you know, engineers are often like manually stitching these things together and sort of outside of tools like us, it's all just, you know, the incident response side is just manual process and people kind of gluing things together in slack. So, you know, in terms of how we integrate, so we have products like AI SRE that use telemetry as a source of information. So what AI SRE does is, you know, in the first 15, 20 minutes of an incident aims to pull together and triage across many different data sources. So, you know metrics in your telemetry provider, we have past sort of incidents that we have access to. We have customer support tickets and inbound. So I guess we are sort of viewing telemetry as like one of many things that you might need to look at during an incident, but certainly an important one. And then, you know, we don't want to be an observability company. Like we have no ambition to go do that. You know, it's not a strength of the company. And, you know, ultimately we sort of steer clear from that. So what that means is that we actually integrate with these observability and telemetry companies a bit more like how a human would interact with these products, which is sort of exploration versus like raw ingestion of data.

[00:13:56] Chapter 8: Integration Approaches and Rapidly Evolving AI Interfaces

Stephen Whitworth: So, you know, ultimately, like that's how we choose to integrate is sort of understand how humans use these systems by analyzing query logs and dashboards and essentially try to understand like what things should we trust? We then in incidents aim to browse these systems to help sort of, if we can't find the exact needle in the haystack, sort of find which part of the haystack they should be looking in. And then ultimately, you know, I think over time, I would love to make that a bit more bidirectional. So for example, like we have code analysis abilities, we have post-mortems so we can tell what improvements people want to make. And a large time is like, oh, we actually didn't have the observability or like the tracking to be able to catch this particular thing. And we can talk about how agents work together. Like, I'm not sure anyone really knows at this point, but like, I would love to be in a world where like, how can I help make observability platforms more useful based off of like the gaps that people have or the risks that they saw? Because, you know, I sort of see us starting and ending at the area of like, there might be something wrong. And sort of taking you through to helping resolve that issue. But there's many, many uses for like telemetry and observability sort of outside of that. And I want to be like good friends with people that do that versus you know, ultimately treading on each other's toes a bit too much, I guess.

Mirko Novakovic: Yeah. And how do you integrate with those tools using an MCP solver, APIs? Or when you say looking at the dashboard, are you actually logging into the tool with an agent and looking at a concrete dashboard, or how does it work?

Stephen Whitworth: Yes. MCP right now. And you know, ultimately there's, for example, it's very helpful also to pull like visual aspects of things. And people are very used to looking at graphs. So we'll also do the, the other approach that you mentioned as well a lot of this stuff is moving quite fast. And, you know, for example there's, there's also a world where I imagine some of the incumbent sort of telemetry providers would start to charge for this. So there's, you know, there's also like separate ways that you could choose to integrate and agreements that we could have with people. So it's all moving very quickly. So we're sort of building, building with what we have at the moment. But like if I just, if I just pick like an internal example, like as we deploy AI internally inside our company. We're building on top of Claude's like Cowork plugins, and I think they released that six days ago. So it's like, okay, well, you know, we're not, we're not doing sort of years worth of experience of how we do it. I think there's a lot of just figuring this out as we go along.

[00:16:42] Chapter 9: Will AI Platforms Become the Workspace?

Mirko Novakovic: We did exactly the same. So our internal I agent zero platform, we also went on the coworker platform and it's amazing, right? It's also pretty fast the way they do it. Like it is really cool. But yeah, I agree, it's moving very fast and nobody really knows where it's going, right? It's just improving very quickly. Right. So that's something that's really interesting. If you look back, I don't know when the last cloud model was released in November, December, that changed a lot. And then since then it feels like a year, but it's only six weeks, right. But so many things have changed. And yeah, that's, that's challenging, but it's also really interesting and fun ride to work in this time.

Stephen Whitworth: 100%. I think the world has never been more uncertain. You know, I was talking about it with my co-founder. So you're seeing companies like anthropic start to move into the sort of like, I guess, come up the stack and go and do direct applications. And it was sort of quite obvious that they would do things like coding in that, that feels like quite a closed loop system. There's a lot of money being spent on it. You can train and validate whether this thing you generated is actually correct in some mechanic, but they're now moving up to do things like legal. So there's then, you know, competitors like Harvey and Agora. And if you go look at the release notes for opus 4.6, there's then a benchmark on open RCA, which is a benchmark you use for root cause analysis. So there's a world where it's like, oh, you know, maybe anthropic and the LMMs come into the observability space in some mechanic. And that would, I think, change quite a lot. And there's a world, there's a worldview where like, okay, maybe we just all go live in a hut in the forest somewhere and, you know, there is no software anymore or anything like that. I don't think that's going to happen. It's actually sort of good, I think, for the industry, for these models to ingest more around things like telemetry versus just understanding, say code because code is ultimately just run in production and telemetry is the stuff that you, you use to maintain that. But I basically don't know what it's going to look like in a year from now, whether they'll have moved into it, whether the models would be much smarter at that, what that means for us. But that means it's fun and there's many different outcomes. We could sort of go shape and achieve versus, oh, well, yeah, we've been building on cloud for the last eight years and sure, you know, makes it a good time to be alive.

Mirko Novakovic: Absolutely. But it's also, I mean, I'm really thinking about the same thing, right? Is what, what will be taken over by these, I call them platforms, right? Ai platforms like cloud and, will the user be there then? Right. Also for you or for me. Will you troubleshoot your incident in cloud or will you do it still in incident.io? Oh, will you look at dashboards inside of Dash0 or will it. I mean, there are no these application integrations. Mcp was providing UI elements, right? So we can even do things and send charts etc. back to cloud and I it's not, it's not, not our environment, but I, for example, when I, we use HubSpot internally for CRM, I connected HubSpot to cloud. It's actually so good, right? You can ask questions like, okay, give me a deal analysis of the past two weeks. And then it gets all the data and the interesting part, like, how do agents work together? Then it calls a coding agent to build a dashboard for the result, and then you get a really nice dashboard where you can drill into all the deals by, by, by AE and everything. So it's kind of amazing if you see the result and compare it to what you can do actually inside of the tool, which is not such a nice dashboarding experience. So I'm using it in cloud now. Right? So that's a change that I'm already seeing for myself 100%.

Stephen Whitworth: I think. I mean, like we sort of have like the external application of this, of how do we want to go deliver things to our customers? But then we have the internal, which is how do we want to run the company? If I've sort of picked two products that are quite different right now, like sort of Salesforce and linear. So linear has like quickly adapted to this shift, has built in agents as a first class citizen into the platform. And it's essentially becoming the place where like we build and ship product. And it is like a multiplayer experience that allows people to collaborate on what is inherently a multiplayer kind of coordination problem. And I just don't see them going anywhere. I don't want to have a single player experience inside Claude for how we develop software and ship it across teams. I want sort of, you know, to see the delegation and like our cursor is working on this thing and Lawrence is working on this other thing. So I think if sort of companies move quickly and they solve this coordination problem, I think that means that like, you know, you're not going to get say you know, Claude is not going to become a rapper, I guess, for, for your interaction model. If I think about Salesforce, I mean, before Claude, people were using tools like scratchpad to avoid having to input deals into Salesforce directly. And that to me, it feels like there's some fragility there, which is like you're going to increasingly be used as a database. And I think you may also means that like more and more interaction might happen outside of you. And I basically can't tell right now if that's just because I think that more single player experiences may just move to happening in Claude and tools like that and more multiplayer stuff I think will be harder and requires applications to go solve it.

[00:22:27] Chapter 10: Designing for Flow: Be Where Users Are

Stephen Whitworth: But it also might be a function of like, how quickly are companies adapting? So yeah, it's a, I think it's a, it's a tricky one from that perspective. And I guess just from like our product view on the world, like we were honestly never too precious about like you using our dashboard or anything like that. Like, you know, our first product was like this cute Slack bot that you like deployed into your Slack workspace. You did slash incident. It created a Slack channel and there was like a nice web dashboard that had some stuff, but it wasn't that useful. Over time we've made that more and more useful, but like the kind of fundamental product principle that we've operated with is just like be where the people are organically solving the problem, which has meant that, right? We build bots that sit inside your Zoom calls. And now with AI SRE like we sit inside Claude like Claude code specifically. I mean like we don't actually really care as long as, you know, ultimately what we want to help drive is that workflow from sort of symptom to resolution to improvement. But we know that there is like 15 different tools that are used throughout that process. And we kind of don't care about being the one dashboard that means you have to log into. We just want to be able to see it, understand it, and then kind of ultimately drive drive through that workflow. So yeah, I think it makes, it makes it probably a little bit easier for us to like, adapt to the, the changing times, I guess.

Mirko Novakovic: Yeah. But that also means you are very focused on user experience, right? I think that's the big difference between, I mean, I don't want to blame Salesforce or HubSpot, but but whenever I use those tools, it doesn't feel really nice, right? As you said, right. Putting in the deal, etc. feels a little bit clunky. The user experience creating dashboards doesn't feel so nice where, for example, a linear. It's a really nice user experience, right? If you are also inside of the product, you like to use it. It's easy to use. It's modern, it's fun. Right? So and the same is as you describe if you are working in an incident user experience also mean be there where the user is right and they are in Slack. They use Slack for it. So why not naturally integrate into where users already have been? Right. So I think that's a big change right now that you really have to care about user experience, right? And understand where users are, how do they use products and make it a fun experience, right? I, I don't think that in the future, especially with AI, you will be able to compete to compete if, if your user experience sucks, right? I think that's just not working anymore.

[00:25:07] Chapter 11: UX as a Reliability Imperative

Stephen Whitworth: Yeah, 100% agree. The way we always thought about it is like we solve a problem where seconds and minutes matter. So if we have to make you click through 15 different forms or the latency is like two seconds to save anything. Like for, you know, one of our customers that is, you know, processing massive amounts of payments. If that isn't working, you know, we're literally talking like hundreds of dollars we might have lost in like a single back end call. So make it fast, make it enjoyable. And, you know, I'm not going to pretend that people love incidents. You know, it's stressful all of that side of the world, but like, we would at least to like, sort of try and make it fun and delightful otherwise it would just be a miserable time, miserable time for everyone. So, you know, do as much as we can to add in these little pieces of delight into the product as, as software gets cheaper to build, I think there will still be massive returns to like taste and intuition and design. Which weren't really there before.

[00:26:09] Chapter 12: Why AI SRE Belongs in Incident Response

Mirko Novakovic: Yeah I agree. And let's talk a little bit about the AI SRE agents. You mentioned it, right? It's part of your product. I looked at it. You have some cool feature, as you said, like scribe the transcripts and putting it into it. But my feeling is at the moment there are, I don't know, a hundred AI agents, at least that spun up in the last six months. And I always say we also have one. The moat is also not so big, right? Most people have a wrapper around claude and then some agent platform, and then you feed it with some data. I mean, we can talk about that. That's probably the hardest part to give the agent the right data, the right context, everything. But overall, it looks like there are hundreds of AI agents. I'm asking myself, how will that market evolve? Right? Will it be could be part of like the incident response kind of category. Also, I mean, there was this category of AI ops, which was a little bit similar, right? It was about looking at a lot of different incidents, grouping them together and trying to figure out what's happening. So I think it makes sense that these AI ops incident management SRE tools could be in your category, right? It could be a separate category. It could be part of the observability tool. So it's kind of hard or as you said, could be that Claude Entropic is moving into that space and it will be part of those platforms, right? And it could be, they will be everywhere, right? Who knows? So what is your perspective? I mean, you moved into that space probably because you saw AI and it makes sense, right? Helping. I'm looking at an incident. Somebody wakes me up at 4 a.m. in the morning. I already get a summary from my AI tool. What if it's important or not and where to look at. And then it helps me navigate and fix, right? I mean, that makes sense. So I can see why it's in the incident response category.

Stephen Whitworth: Yeah. I think the first few products we released were very focused on like helping you run the process of an incident really well. So I think communication, coordination all of that side of things. And that was useful. And it was net positive on the status quo. Because, you know, that only the existing providers are already solved, like on call. So we went and then did the rest. However, that leaves a big elephant in the room, which is like, you're sort of giving me a container to solve problems efficiently in, but you're not actually helping me solve this specific problem and why it happened. And that was really like where, where we saw the rest of the opportunity in this instance was like, okay, well, given that we have built this place where people congregate and solve problems, it feels like we have a sort of strong right to like put something there that is, you know, does a good job at actually helping you solve that. And specifically, you know, for us, that looks like helping people from the like alert and incident workflow, both from like understanding what caused it, what they should do next in some circumstances, drafting fixes, you know, pull requests, things like that, but much broader kind of, you know, just aging genetic capabilities. Like you can just search throughout your entire incident history to say, hey, have we ever seen something like this before? Or who, you know, for issues like this, who do we usually page and things like that that are a bit more sort of widely connected.

Stephen Whitworth: I think there's sort of two views of the world. Like one is, you know, observability companies end up doing this really, really well. And I, you know, if I sort of steel man, like the Datadog argument or something like that, it would be actually incidents are primarily a technical thing. And as a result, like the people with the richest telemetry data and the ability to build, say, custom models over that telemetry data above and beyond what is publicly available through LMMS are going to have some kind of, you know, big benefit on accuracy and the ability to do that. And people sort of go, hey, well, I already pay these providers lots of money. What's another, you know, $100,000, whatever you, you know, you choose to spend on it. And then they like return sort of go to them and you can see this with their also building on cool tools and incident response tools and all that side of the world. So it's like, right, that's one view of the world. I'd say like the second one is that, you know, a, like we just remain as like an on call process related thing. And it turns out that AI sort of ends up wrapping us in some regard. So you're seeing some of these companies move towards like AI for production. So you can just like interact with prod through these tools, which to me feels a little bit more like a specialized Claude. And that, that, that I think presupposes that you don't want Claude to do everything and you'd be happy to have different versions of Claude.

[00:31:01] Chapter 13: Switzerland Strategy and Platform Composability

Stephen Whitworth: Sorry. Different, more specialized versions of these models. I feel like I'm not seeing that's how people work right now in that, like in reality for like single player use, I think people are sort of gravitating towards these models that have a lot more context. And ultimately the more potent application of sort of AI for production, I still think is in like driving reliability and incident response and alerting. So maybe like a third view of the world would be the people that sort of shift towards and acknowledge like software reliability being the end game for this are the ones that will reap the rewards of it. And I think like, that's the shift that we are, you know, that we will make. Like we don't just want to do on call. We don't want to be a workflow automation tool. We view it as necessary and important. But ultimately, if I think about what drives reliability, it is detecting issues when you have them responding as fast as possible when you do and preventing them so they stop happening in future. And I think there's just a much wider suite of products that you can go build that sort of threads all of those problems together. And I think, you know, especially as we go into sort of enterprise customers, this is such a fragmented group of tools like, you know, people may have Datadog and Splunk and Grafana and all of these different things. And like, in reality, a lot of these products, they don't have that much of a strong incentive to build great integrations with each of them.

Stephen Whitworth: Whereas we are sort of a bit more like Switzerland in this regard, like we want to be friends with them. We want to make everything work together. So I guess I would view us in that world as like an investment that helps you get a lot more out of the many other tools that you are also investing in. And that might mean that we end up taking some dollars from the amount of metrics that you might plug into telemetry systems, or it means you can get rid of a bunch of users in ServiceNow. Who knows? I think, yeah, maybe a last comment is then around like the AI ops and on call and all of this stuff. Like our goal is basically to collapse all of this down into just like, you know, a reliability platform. Like there should not be a separate tool that you need for aggregating alerts and de-noising them. And then a separate one to page people and then a separate one to go wake people up. Like that is all one long workflow, which I think operates and gets better off the same data. So you should go and just sort of build towards that vision, at least in my opinion. That said, I'm super open to the fact that like, hey, it might be that like your AI SRE is way better than ours for some reason. And like that sucks, but you know, that's life.

Stephen Whitworth: Like I don't, I want to make sure that, you know, people can integrate those systems and operate them effectively on our platform. Like if you think about linear. Linear could say, hey, well, we're building a coding agent and we don't want cursor and we don't want any of these other people interacting because this is how product gets developed on linear. And like, this is our space. It's like, no, actually, you can be sort of like the container and the platform on which this work happens. And there's going to be 10 to 100 X more software like linear is going is making a really smart bet by making it easy for you to bring in that 10 to 100 x multiplier and do it on their platform. And I think that's how we approach things as well, which is like, you know, we, we want this to be a bit more composable. We want people to be able to bring agents to us. And if they're helping people detect problems, fix them, or stop them happening again, then there should be like natural integration points for them. And then obviously there's there's agents that we need to speak to on the Dash0 side and all of that side of the world. And we're still honestly just figuring out how all of that works right now. But I think yeah, a few different ways the world can end up, but obviously wouldn't be, wouldn't be doing this company if I, if I didn't think mine was the best. So try trying your best.

[00:35:14] Chapter 14: Autoremediation—When Can AI Close the Loop?

Mirko Novakovic: Absolutely. So my last question would be a prediction. When do you think will be the time that an incident opens and you will automatically fix it? So do the full process automatically with AI.

Stephen Whitworth: Good question. I'd say we can do that right now, but dangerously in that, you know, we can draft the fix and we are one sort of like auto merge away from you doing that. Or we could just push to master. I think that in order to do it safely, a bunch of stuff is going to have to change. For example, like code, like actually typing the characters on the keyboard used to be the bottleneck in the software development lifecycle. It's like, what should we build? And then once we know what we're building, it just takes time to build it. That isn't true anymore. And that's become not true in a very short period of time, which means that now you're seeing it's like, oh, well, what's next after that? Oh, it's code review. And now you're seeing companies just abandon code review. They're like, oh, well, you can just push straight to master. You can request a review if you'd like. But we don't need to have it happen anymore. You're then also seeing tools that sort of assist in code review. So we use a tool called Depth-first, which we really like a lot. This is for sort of security security you know, analysis. And there'll be many others like that that I think say, hey, there's a lot more software, but we're going to do a pass on what changes you might need to care about.

Stephen Whitworth: So I think potentially there's like a low risk stuff can just go in straight away. Stuff that gets flagged has to get human review. And then I think the next bottleneck after that will then be like, you know, it really doesn't feel like the way that we've actually deployed software has changed that much. Like, yeah, sure. Now we use Kubernetes and things like that. But if you're talking about like how the snowflakes and the matters of the world run, they have cell based architectures with like very progressive deployments where there's feedback systems on rollouts. And that's just really hard for companies like mine to go and build the like mechanics required to go do that. I think if, if more companies had more things that looked like that, then I would definitely feel a lot safer sort of rolling out changes because you know that the blast radius is limited. So yes, I think ultimately we're going to do our best jobs like trying to provide as accurate predictions as possible. But I'm just like, there will be an upper limit to how accurate those can be.

Stephen Whitworth: And like, I think the next most useful thing is for us to be able to make the blast radius of when we get that wrong smaller, because for most, most people, it's either switch on a feature flag or it's turn it on for everyone and it feels like you want some kind of self testing. Or like I rolled it out in prod, I didn't see any errors roll out further. So that was a long way of not giving you an answer. The short answer would be we can do it now, but it would stress me out. The longer answer is in order to not be stressed out, I think a bunch of new things need to be built. And, you know, it'd be nice if those 100, you know, ISO restart ups could go focus on that instead. But hey, this is, this is where we are. There'll be, there'll be enough startups, I think, focused on all parts of this for us to go around. But yeah, I'm interested. Like what, what do you think? Like you must, you must sort of see, I guess the other side of things.

[00:38:44] Chapter 15: Trust, Accuracy, and the Uncanny Valley

Mirko Novakovic: No, I would totally agree with you. I don't think that we are ready yet. Right. It's like, I don't know, going with an autopilot pilot on the street, but you know, you can crash everywhere, right? So it doesn't feel safe, even if it's 98% correct or 99%. That last 1% is horrible. Right. And I, I like the way you think about it, right? We need to have a lot of safeguards, maybe different architectures to reduce the blast radius and also roll back automatically and stuff like that. Right. So I totally agree. I think we are not there yet, but I can see that it's coming quickly. I do think 100% we are not years away. I don't think so.

Stephen Whitworth: I agree with you. I think the unfortunate thing is that robots are always held to much higher standards than humans are. You know, the classic self-driving thing is like tens of thousands of people die every year in the US from car accidents. And if you have sort of a Waymo shunt the back of another car. It's front page news. And I think exactly the same. Basically can't fight human nature. That's just how it's going to be. You know, I think that is exactly the same as how people interpret like, you know, things like AI SRE, which is it's fine, you know, until it gives you a single wrong answer. And then the trust is gone. And then you like to scrutinize every single thing, this thing says. And I think the just the unfortunate reality is like, there will be, unless you can get to a high enough level of accuracy, the things that you can be trusted to do will be very, very low. Basically, that drives the right incentives, which is like make it as accurate as humanly possible. But I think honestly, that's the key product challenge for a lot of the people in the space right now is like, you can probably get to a point where you're like, you know, you know, I've seen case studies of competitors that are hitting, say like 90% root cause accuracy. Like that's really good. Humans would find this thing quite challenging to do as well. However, I think it's very much like a product and a UX problem of like, okay, in that final 10%, you know, how do I not be confidently wrong or like get into the uncanny valley of like, cool, you just missed out some incredibly obvious thing. And now I can't trust anything that you say. And we go all the way back down to the bottom as well. So yeah, it's tough. And I think that's the thing that's holding this part of the field back right now.

[00:41:10] Chapter 16: Closing and Further Resources

Mirko Novakovic: Stephen. Thank you. This was a great conversation. Thanks for joining me on this call. And I wish you all the success for the future.

Stephen Whitworth: Thank you. Mirko. It's been a pleasure. I appreciate it.

Mirko Novakovic: Thanks for listening. I'm always sharing new insights and insider knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on

More Episodes

#40 - Breaking the Observability Model: Pricing, AI SRE, and a Developer-First Mindset with Juraj Masar

Episode 4036 mins2026-03-19

Juraj Masar

#40 - Breaking the Observability Model: Pricing, AI SRE, and a Developer-First Mindset with Juraj Masar

#38 - Beyond Kubernetes: Platform Engineering, Developer Experience and GenAI with Mauricio Salatino

Episode 3837 mins2026-02-19

Mauricio Salatino