host

Mirko Novakovic

guest

Anish Agarwal

Episode 3342 mins10/9/2025

#33 - Inside the AI SRE Boom: Anish Agarwal on Traversal, Finding Root Causes, and What’s Next for Observability

host

Mirko Novakovic

guest

Anish Agarwal

Listen on

Apple Podcasts Spotify Youtube

About this Episode

Traversal CEO and co-founder Anish Agarwal joins Dash0’s Mirko Novakovic to unpack why AI-powered SRE agents are emerging as the next big shift in incident response. A former MIT researcher and now Columbia professor, Anish explains how causal machine learning and reinforcement learning shaped Traversal’s approach to finding root causes in complex systems. The conversation explores alert fatigue, multi-tool fragmentation, why accuracy builds trust, and how automation may soon take incident management from detection to full remediation.

Transcription

[00:00:00] Chapter 1: Introductions and “Code RED” Moments

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I am co-founder and CEO of Dash0 and welcome to Code RED code because we are talking about code and RED stands for requests, errors and duration, the Core Metrics of Observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Today my guest is Anish Agarwal. Anish is the co-founder and CEO of Traversal, an AI SRE agent for enterprise businesses. He's also an assistant professor at Columbia University researching causal machine learning and reinforcement learning. Anish and his team raised 48 million in their seed and series A funding round this year. This is kind of part of the growing traction in this space. AI SRE excited to have you here, Anish and welcome to Code RED.

Anish Agarwal: Thank you so much for having me. It's a pleasure. We met probably 15, 18 months ago, and I felt like I've lived a few lifetimes since then. I'm sure you feel the same.

Mirko Novakovic: I can tell you that in the startup life, a year always feels like a decade at some point. So before we dive into this, I always start with one question. Which was your biggest Code RED moment in your career?

Anish Agarwal: It's a good time to be asking this question, because tomorrow there's a big deadline for a conference called ICLR. So the International Conference on Learning Representations. So it's like the big conference where deep learning that people have. So it made me think, as I was thinking about this question, that a few years ago, I was trying to submit a paper to NeurIPS, which is another big machine learning conference. And the results look really good, really promising. And then I think maybe a day before we were about to submit we were looking through the results and, like, our code. And there was, like, this really stupid mistake that we made, and none of the results made any sense. So that was a big Code RED because we tried to salvage the paper over the next 24 hours trying to fix that bug. And yeah, it was like a random seed that we didn't set correctly. So we kept changing the seed and kept moving the results in a weird way. So that's probably the biggest Code RED.

[00:02:12] Chapter 2: From Causal ML to AI SRE

Mirko Novakovic: And before we start into, into this whole area of AI, SREs I was looking at your resume, and, I mean, it's pretty amazing, right? PhD at MIT, you are a professor at Columbia University in this field of machine learning. So why did you pick this kind of topic? I, I I'm super excited because you could have picked every topic, essentially, but you picked observability or this SRE space, so. So why?

Anish Agarwal: Yeah, it's a good question. You know, like 24 months ago, about exactly 24 months ago, September 2023 is when we kind of started this journey. And at that point I never heard the word observability. I never heard the word SRE. So it feels like crazy almost I think for me, my motivation to create the company came from research. You know, I love machine learning research. And I always wanted to, to be at the, at the edge of it. And I think just in the last couple of years, with everything happening with OpenAI and anthropic, it just felt that so much of the frontier is now happening in companies. And so I wanted to be part of that. And I felt being at a startup, taking on a very technical problem with research, being part of the culture was my expression of it. So that was my motivation to create the company along with my co-founders, who I met at MIT. And we all had the same motivation for creating it. When we started the company, we had no idea of the exact problem we were going to take on, but we had a pretty clear taste in the problem. So our research was in a field called causal machine learning. So the idea being that, as you know very well, Mirko like correlation is not causation.

Anish Agarwal: And so how do you get these AI systems to pick up cause and effect relationships from data? I love the problem. It's like a very philosophical problem. Almost like what does Cause even mean. And, you know, viewing that through the lens of machine learning and statistics, I just really enjoy it. And I like thinking about it. And so we're looking for problems at the intersection of causal machine learning, reinforcement learning, which is about how to search effectively and how it is connected with agentic systems. And this is like, you know, 18, 24 months ago before AI agents were cool. But we were it was very obvious to us that this was something that was going to become very important. And so we were looking at where was the intersection of these two things. It was actually our fourth co-founder. He was from Citadel Securities, and as you can imagine, it's a company that cares a lot about uptime, being highly reliable and having very disciplined instant response processes, something that is important to them. And so he's the one that kind of was telling us of this problem of incidence and observability. And then when we spent some time on it, we're like, okay, this is a great problem.

Anish Agarwal: It's almost like we were built for this problem. We just didn't know it existed. And if you think about it, it's, you know what happens, right? You have some spike that occurs that you care about, like latency has shot up or whatever. And the problem is there's a thousand other spikes happening at the same time, right? Because when something goes wrong, it spreads like an epidemic through your entire software system. And so the problem is like, how do you figure out a given spike, a symptom of the issue or a spurious correlation or the root cause? So it's about like finding a needle in a haystack, right? With many fake needles everywhere. And that's exactly what our research was about. And that's exactly what these LLMs are good at. Right. Because logs and metrics and traces, that's what the haystack is composed of. And so it felt like where the generational wave of technology was going, what we would be very good at all fit really nicely. And observability is a big market. It's a growing market. It's an evergreen market. It's always everything has to be monitored always. And so that was our motivation, I guess, to get in. The problem was one of our co-founders told us the problem. It felt like founder Market Fit was strong.

[00:05:50] Chapter 3: The Scope of “AI SRE”

Mirko Novakovic: Well, that makes total sense. By the way, I learned of Causal AI when I talked to two PhDs from ETH Zurich.

Anish Agarwal: Yeah, they have a very good strong program there.

Mirko Novakovic: Yeah. And they were solving the problem of production of chocolate. So there are many things in the chocolate production. If you change something, then the output is either broken or not. Right. And they had this whole management of the process using Causal AI. And at that point of time it was really like I, I kind of understood, oh, this is something that you could apply on this topic in observability. Right. Because there are also many different things. Right? That if you change them, they can have an impact on any other thing in the system. Right. And so having some understanding of causality is super important.

Anish Agarwal: Yeah. My co-founder actually spent a lot of time thinking about gene networks. So we have a network in the Broad Institute at MIT where you have all of these genes that are connected in a network. And, you know, you're trying to figure out what collection of genes makes, you know, your eyes blue or brown or whatever. And the thing is, when one gene pings, you know changes that. It affects the entire network. And so it's the same problem where you have this network of things. One change affects everything. And so yeah, it's a pretty interesting universal kind of way of looking at a lot of problems, which is why I really enjoy studying it.

Mirko Novakovic: Absolutely. And now there is this new category I think, called AI, SREs. Right. How would you explain it to a user? Why should you buy it? What is it and how do you see that space evolving?

Anish Agarwal: Yeah, I think in some ways AI SRE is a category has been called this. That's what we're going with. But it's doing one part I'd say at least right now if I think about the category, it's doing one part of what an SRE does. Right. So if I think about it, an SRE function, it's both. It's about the design of systems that are resilient. Right. That will be resilient over time. And how do you actually make the right decisions for your infrastructure such that it scales over time? And the second part of it is the maintenance of these systems, which is once you've built something, you now have to make sure that if something breaks, you figure out how to troubleshoot it and bring it back to the intended service. So I'd say the category is called AI SRE. But really what everyone's focused on is the problem of maintaining the system, not designing the system at least so far. And that's where we are as well. And within that space, you know, if you think about what happens when people are thinking about on call. Right. It's really one if you have a lot of alerts, which is, you know, you have this constant stream of alerts because people will configure the alerts in some way, and you'll have thousands of these alerts that are happening. And it's someone's job to look through all of them and see. Are any of these alerts important? Is there a cluster of alerts that, you know, mean something. And triaging those alerts. Then there's the second part of it, which is you've decided something has become an incident.

Anish Agarwal: It's a big issue. You're bleeding and you now need to, you know, bring 30, 40 engineers into a war room. And everyone's, you know, every second is ticking. You're trying to figure out what happened. And so I'd say if I think about the AI SRE category, it's really about two things right now. One is about alert triaging and the other one is about complex instance response and also remediation of both of these things. So I'd say that's the place it is. There's obviously many, many other functions that SREs do like instrumentation design of systems, which I think is not what quotes and quotes AI SRE as a category right now is. And that part of it, though, is really painful, right? Because you have engineers, both SREs and just on-call engineers who have to be staring at dashboards, you know, trying to keep all of this context in their head of thousands of different signals. And when something happens, dive, dumpster, dive through all these systems to get to the answer. And it's very hard for anyone, engineer, because it's very hard to look at all of these signals at any one time because you just don't have enough context. And so the AI category is about or solutions are about. Can you essentially read all these different disparate signals and start diagnosing incidents far, far quicker with far less on-call engineers than what we have to do now. So I'd say that's the category.

[00:09:57] Chapter 4: Cross-Tool Advantage and the “Switzerland” Strategy

Mirko Novakovic: It makes total sense. I mean, I remember the category AI ops, it's a few years ago, right? That evolved companies like Moogsoft or Bigpanda were big where they basically attacked the first problem that you described. Right. Having a lot of incidents or alerts coming in and then figuring out, aggregating them and figuring out what was the initial alert, that category never really took off in my point of view. Right? I don't even know if that term exists anymore. AI, I don't know. I haven't heard it, but Gartner was really big on it at the time, and I totally see that what you described, the second step is really important, right? Supporting the engineer in finding the needle in the haystack. Right. Doing the root cause analysis and especially if you have multiple tools. Right. So that would be one of the things I would love to understand. Do you see your biggest advantage also to the agents that the vendors have, like AI of Datadog or something, that you can really do that over multiple tools, right? If I have a Splunk here, a Datadog there, Dash0 maybe, or other tools. Prometheus for metrics so you can connect to all these data sets. And as far as I read your blogs, you are read only, right? So you only read the data, you don't store it. So you connect to all these tools and then you help find the root cause in all these data sets. Right. And combining them.

Anish Agarwal: That's totally right. And that's why we've also found ourselves maybe early in our journey than we expected to be working with larger enterprises. Because when you start going mid-market and above or, you know, significantly above what some of our customers, they're using Splunk and Dynatrace and Elastic and Grafana and Datadog, I mean, they have every solution under the sun, right? And I think we have found that as the number of solutions you use grows, the pain of trying to root cause grows, grows non-linearly with that, because hopping between these different solutions is really, really difficult, right? And none of these originals, like these vendors of the previous generation, are incentivized to give you insights into data stored in someone else's platform because all of the pricing comes from the amount of data that you've stored processed yourself. We're marketing ourselves as Switzerland where we're not trying to store your data. We're simply trying to be the intelligence layer on top of it. And I think observability industry in general is quite fragmented, right. There's always this war of attrition. You know, I'll take a little bit, 10% of your data. And then, you know, some other salesperson says 10% of yours. And so there's this constant like back and forth. And so I think the fragmentation is high. Like even the best, the biggest companies in the space are maybe 10% market share, something like that. That's I think, why there's an opportunity for a company like ours to, to have a right to win, because you just never have all of your data stored in one place as you get to a certAIn size. And from our perspective, we're just building connected connectors to these things, and we pre-process the data. And by the time the system looks at it, it becomes like a quote unquote Traversal log or Traversal metric. So it's pre-processed in a way where the generic system doesn't care where it came from, Splunk or Datadog or Dash0 or whatever it might be.

Mirko Novakovic: Yeah, it makes sense. I don't know the exact number, but I once read, I think that the average number of monitoring observability tools of enterprise is 27. Yeah, that doesn't surprise me. So there's definitely a market to kind of consolidate over that because you also have the legacy tools for older tech and all these things. Right. And then how do you do that? Right. You connect your read. Do you do that through an API or MCP server? What's your favorite way of getting to the data?

[00:13:31] Chapter 5: Connectors, MCP, and Tool Design for Agents

Anish Agarwal: So at least so far the MCP servers of these various companies are quite early right now. Right. Because if you think about what an AI agent is, it needs an orchestrator of two calls, right? And it needs to have access to Rich, a set of tools to be able to complete its task, and the types of tools that these different Datadog or Splunk or whatever expose are still quite basic, and so you're not really able to use them in any sort of really interesting way. And so for us, we basically connect to the APIs. That's what we need, like the core APIs themselves. And in the background, we are converting all of that data into certAIn formats and then exposing tools on top of that that our agent can use. So we are building our own MCP server that our own agent calls versus the FCB servers of these different vendors. We talk to these vendors through the APIs.

Mirko Novakovic: Okay. So you basically harmonize the data through that MCP server so that your agent sees different logs in the same way that's from different vendors.

Anish Agarwal: Yeah. And the tools themselves on top of them. So one is about harmonizing so they look the same. And then the tools you give access to to the agent are proprietary to Traversal. Because there's an art to it because you make the if you make too many tools, each of them too basic, then the agent can get confused because it doesn't know there's too many options. If you make the tools too big, then it's not flexible enough and it can't do what it needs to do. So that's where the art of software design comes in. Of modern software design, which is like how do you make the tools just the right size so that they're both flexible, yet the agent is stable?

Mirko Novakovic: Yeah, it's very interesting. So when we built our MCP server we are building an agent on top of it and I'm not very deep into the machine learning AI world. Right. So but the way I saw it is that the way a standard agent works is very similar to a human. That was kind of the way I saw it. Right? So the tools that you provide to the agents and the way you provide the data is kind of the way you would provide that to a normal user in an observability tool. Right? So if you have a troubleshooting tool that helps you digest, I don't know, millions of logs. Then the agent can also use that. Very powerful. Right?

[00:15:54] Chapter 6: Human vs. Agent Workflows and Interpretability

Anish Agarwal: So yes, I mean, you have to get intuition from the way humans troubleshoot. But there are, I think, getting too married to the way humans do things can actually limit what these agents can do, because they're very good at doing things that we can't do. Right. They can. They have so much more context. They can look at so many more things in parallel. And so sometimes having the flexibility to not constrain yourself, to say, oh, this is the way a human troubleshoot, but trying to think from first principles. What is it that an agent system can really do that is also important to certain times. So there's like there's kind of trade off between interpretability because if you know exactly where like a human does and getting an agent to replicate that, then you have a very interpretable mechanism. But if you have, if you try to exploit what these AI systems can do, then you might have a much better performance system. But you lose interpretability, which makes it harder to like, improve the system. So there's agAIn, another trade off that happens.

[00:16:45] Chapter 7: Query Planning, Rate Limits, and Real-Time Constraints

Mirko Novakovic: And context is pretty important right. For troubleshooting, and I read this interesting blog that you wrote recently about the three things you have to be careful about, right? I think it was index time and fields, right? So the way I interpret it is that it's kind of very hard to narrow down the context in a way that you focus on that incident, because you can be too wide in a time range, use too many indices, use too many fields. So can you explain a little bit how you see that and how you are different from other tools and howI benefit as a user from that?

Anish Agarwal: Yeah, totally. So I think interestingly enough, obviously, if I think about a company, what they care about, the number one thing is accuracy, right? Accuracy is king. The second thing that is almost just as important is can you respect the rate limits that a company puts out to you. Right. So if you have a large company, they will have an observability team that is in charge of making sure, let's say they have a self-hosted observability. The observability team will be in charge of making sure that the observability is stable. It's up. Right? And so from that perspective, if you come in and say, I'm an AI agent system that is going to bang your observability system all the time, they're going to feel nervous. They're going to bring down their system. Right. And so you have to essentially give a very strong contract of these types of queries I will write. This is how often I will hit your system so they can feel comfortable with letting you have API access to these systems. Right. Because typically engineers only get UI access. No one really gets API access to these systems. So respecting the rate limits that these teams put on you is really important. And to us that you cannot just have to write any query like let me, you know, pull me all data for the last two weeks. That's the crazy query.

Anish Agarwal: We'll be pulling in like petabytes of data as a result. You a lot of the work that we do, I'd say over the last eight, nine months has been about how do you if someone gives you a contract or this is the types of queries you can write, what is the right way that the system query plans so that it's able to respect the limits but still get to the data needs to get to. And then there's a second part of this, which is time, right? Because when you have an incident, you don't want an answer to come back in two hours. You want an answer to come back in two minutes. And if you write very, very small queries, you might respect the rate limits. Then the time it takes you to get to an answer and build up the context you need in real time is too long, right? So there's all of these different trade offs you have to manage at the same time. And part of that is building the right offline cache of your data so that when an incident happens, you don't always just ping the system real time. And so it's all of these different trade offs you need to think about to design a system that both respects rate limits, but also gives you an answer in a short period of time when an incident occurs.

[00:19:25] Chapter 8: Measuring Accuracy: Bullseye vs. Directional RCA

Mirko Novakovic: Yeah, that makes sense. And you mentioned it's all about accuracy right at the end of the day. Right. How good is the answer they get for your problem? So the root cause analysis I saw you have a categorization right. Like a bull's eye. RCA and directional RCA. Can you explain where you think you are today with Traversal and how would you as a user evaluate the accuracy of a system?

Anish Agarwal: We've learned a few things, actually, with working with these large customers or actually all customers. Is that the way you build loyalty with your users is getting a bullseye often, right? And early in, in a pilot, in a partnership, if you get to a few bullseyes early, you will have a lot of loyalty from a company and a set of users. If you don't get there, then they'll forget about you very quickly. And so to get a bullseye quickly, you should. And to get a very high accuracy, you need to basically constrAIn the search space. And so essentially you need to work with the customer to figure out, you know, with these search parameters, if I tell you x, y, z before the search begins, you'll get a bullseye very quickly. And so what we've found is that if we spend a bit of time with the customer, say a few weeks, then we can get to the bullseye really quickly because we can constrAIn the search space in just the right way. And so there I'd say accuracy is, you know, I'd say greater than 90% if the search space is constrained enough. And that is basically work on both sides to get there. It's still very valuable and it's actually the most valuable thing you do. And then you can if people ask very broad questions like, what is wrong with my system? Right. That is a very, very broad question. That is ill posed. Almost any engineer that first comes onto a system like Traversal. The first question they ask is, can you tell me something that's wrong with my system right now? Right.

Anish Agarwal: And you want to service those kinds of questions too, because that's how you don't want people to first use your system. Only when an incident happens. You want them to be able to gAIn familiarity with your system. During a time of low stress. Right. And that's and that's why they want to explore and understand how you, you know about their system and learn, get insights about their system when nothing is wrong or something is wrong but not burning. And so in those kinds of systems, I'd say agAIn we have high accuracy, but then it's not the bullseye. It's one of the things where you get to an answer that's directionally right, but not something that's insanely specific. And what we've spent a lot of time over the last few months now doing is building a much more multi-turn, interactive user experience. And so then one thing we've been measuring now for accuracy is you get a bullseye within one answer. Or do you get a bullseye within two interactions or three interactions or four interactions with the user? And so the point I'm making is that getting the bullseye is key. And the question then is like, well, how many turns does it take to get to an answer? And so the two things that measures how much how specific is the first question they ask? And then how many turns does it take between them and you to get to the answer? And we measure both of those things pretty closely and carefully.

[00:22:25] Chapter 9: Confidence Levels and Instrumentation Feedback

Mirko Novakovic: Do you give the user an indicator how good the answer is that you are providing?

Anish Agarwal: Yeah, yeah. So for every answer we give, we give a confidence level and we explain the confidence level. And typically the way it works is for an answer. To be highly confident, you need to have many smoking guns that point to the same root cause. So your metrics, your traces, many different logs, your PRs if they're all pointing to the same thing. There's very high confidence if it can only find one thing that points that would cause it'll say it's medium or low confidence. And actually, an interesting thing that's come out of that is that these LLMs actually then tell you or give you recommendations to say that, hey, if I actually had these two more smoking guns, then my confidence level would be higher. And sometimes users use that to figure out a better way to instrument their data. Right. They use the confidence as a way of saying, oh, actually, okay, let me instrument that in the future. The AI will be more confident. And so that's becoming the confidence level in some ways is actually becoming a recommendation system of how to instrument your system better, which has been an interesting unintended consequence of what we've been doing.

Mirko Novakovic: Yeah. I also read an interesting article of Google. They were explaining that one of the issues is that a lot of tools, and I can relate to that from my Instana times, we always try to give you the one root cause. And they explained in the way of search. Right. Think about you would search for something and they would only give you one answer. And not like today, 10 or 15, right? And yes, you can put the one that's the most relevant on one. But then if it's not correct, you can still go to 2 or 3 and find your answer there. Right. Well when you give only one there's a high frustration level for the user. If, if sometimes that's the wrong answer. Right. So I could really relate to that, that sometimes you have to give the user a confidence level but maybe also different choices, right? Say, hey, we are most confident that this is the root cause, but it could also be this or this, right? Are you doing those things to give you only one or multiple things that you have found?

[00:24:28] Chapter 10: Presenting Multiple Hypotheses

Anish Agarwal: Yeah. So at least right now, based on just collecting data from user sessions we actually give between zero, not just one. We give between 0 to 4. Oh, okay. And so there are, there are a number of times where we'll say no root cause found. Because one thing I think that users hate, hate, hate is if you lead them down the wrong direction. Yeah. Right. And so I think actually they appreciate it when you say no root cause is found because then they know to move on with their lives and do what they were doing originally. And we found that if you go beyond four then it looks very noisy. It's like you don't really know what you're saying. You know, if you give too many options and people like, do you really know what you're talking about? And so somewhere between 0 to 4 is the is the sweet spot is what we found. The exact number is actually dynamic. So the LLM somehow decides based on the conference calls, how many it should show, but not more than four. And that's part of the multi-turn thing as well, which is that you can then say, hey, actually, this is the one that looks promising. Let's spend time on that. And that's one of the key things we measure is how many turns does a user have with the Traversal AI system?

[00:25:33] Chapter 11: Building Dynamic Dependency Graphs

Mirko Novakovic: Yeah, that makes total sense. And I also saw in this video on your website that you have a nice dependency service map somehow. Right. Explaining to the user how things are connected, and probably also to show where all these smoking guns point to. Right. That's it. That's right. So how important is that map and how do you generate that? I was kind of. Oh, this is not easy, right? If you don't have access to the underlying I would say system. And you only have the data like logs and things. It's not easy to build that map.

Anish Agarwal: No, it's that a big part of the work we've done is how do you build that dependency graph. And that's where I'd say both a lot of our research comes in. And also these LLMs are very good at it. Right. Because they can literally read the content of the log message fields and, and learn the causal dependency from the words themselves. Right. And so if you look at both the time series of the data and you look at the semantic content of the messages, you can start building these dependency graphs. And that's what we do. And it doesn't need to be perfect. And it's also dynamic and changing all the time. Right. What spatially and temporally that's been a huge part of like the technology we've built is building that dynamic graph over time. And that's both a function of good statistics and good LLMs to get you there.

Mirko Novakovic: And how do you see OpenTelemetry in that regard? Right. Because, I mean, what we have seen is that if you have OpenTelemetry data, the LLMs are really good in understanding it because there's so much documentation around it. Right. And so they understand fields or tags pretty well. And then they can also map it.

[00:27:07] Chapter 12: OpenTelemetry and Non-Standard Data

Anish Agarwal: Yeah I mean I think I think that's part of our as we think about technical qualification as someone is, you know, is using Otel. That's a great thing that makes them a very desirable customer because we feel we can add value quickly. And a big part of it is indeed that LLMs have been trained through world knowledge to be good at understanding data that's formatted in that way. And so I think it's a great thing for the community, obviously Otel, but it's also a great thing for AI. It turns out. So yeah, I think, you know, I'm a believer. So the more people that use Otel, the better.

Mirko Novakovic: And if it's not Otel. So sometimes you have to do the mapping or what do you do? How do you trAIn the model to understand things that are not publicly available? Standards.

Anish Agarwal: There's a lot of company documentation. Then you have to use it. So a big part of what we do is we slurp up the internal documentation, whether it's confluence or notion or your previous slack messages. And so those all are being used to, to learn these, these interesting dependencies. Obviously tracing helps. A lot of times people have service catalogs. And so you basically kind of use as many different inputs as you can. And to build a dependency graph that is as consistent from these different inputs that you can find. So yeah, I'd say there's no one answer. And every company, as you know, from observability is built differently. You need to have an opinion as to what you need to get there, and then use whichever way you can to get to that point. So I wouldn't say there's like one magical bullet where We're still learning all the ways you can get there, but at least a few ways that have turned out to be useful. Obviously, tracing service catalogs all help, and then obviously statistics and LLMs can then append and in a way that you just couldn't do a year ago. So that's helped also make it a lot easier.

[00:28:57] Chapter 13: Signal in Traditional Telemetry vs. Exotic Sources

Mirko Novakovic: Yeah, I was talking to a founder here and I found that very interesting. They were working on something to use actually video conferences as a source. And the idea was and I could relate to that. So, a lot of times you have things like, okay, a market team discusses that. They put an ad on Super Bowl and then your application crashes because you have too many requests. Right. And so the idea of him was, if I can get those recordings and the text, I can maybe correlate those things or two SREs talking about a database migration tonight and then something breaks. So you could relate to that conversation and say, hey, maybe that's the root cause. Those two guys talked about a database migration. Now we have a database problem. Exactly. That's the time frame. So I found that pretty interesting. So that you could open up the data sources to very different data sources than traditional observability data.

Anish Agarwal: Yeah. So I mean, I think if I think about LLMs in some ways, right, like the way, you know, with databases that you can store and, and query structured data and in some ways LLMs, if you use the lens of, of databases that you can store and query and index unstructured data. And from that simple leap you have, it has incredible implications, right? And in some ways, an LLM is the world's worst database. I think someone sAId that once and I found that very funny. To your point, I think some of these ideas of, oh, let me look at every previous like, you know, zoom call transcript or slack conversation and connect that with a root cause. It sounds good, but I found actually that there's a lot more signal in your traditional observability than you know. It's just that no one is monitoring it. There's so much if you look at the amount of data that's actually looked at, right, it's probably five, ten, 20% less data. No one ever looks at it. Right. And actually there is a lot of signal that data. You just don't have engineers who have the bandwidth to look at it. So you've actually had a different opinion, which is that the observability data is a lot richer than you think. You just need to have a genetic system that's able to comb through it effectively. And to do that, then you need to think about like rate limits and infrastructure constrAInts because it's petabytes of data. So at least that's been the opinion we've taken. Obviously there's other sources of data are important. But first you should exhaust the traditional stores because they're very powerful.

[00:31:21] Chapter 14: Rethinking Data Pipelines and Log Design

Mirko Novakovic: That's an interesting point. And would you say that this could also change the way I mean, as observability vendors, we think more and more about pipelines, right. And tools like Cribl have, have, have done a great job of, of building those where you try to reduce the amount of data you are storing. Right. But to your point, it could be interesting at some point to not do that. I mean, not thinking about price or something, but to store a lot of data, because now you have this agent which can look at all the data and maybe find signaling inside of the data that a normal engineer never would have found. Right.

Anish Agarwal: That's right, I think. And it's interesting. I think both can be true in some ways. I think one is that the logs that we create now, they're meant to be human readable, right? Like a lot of times. But you could consider a world in which it is no longer meant to be, you know, a human squinting at a terminal and seeing what that log is saying. Right? But it's meant to be parsed or grokked by a genetic system. And so you might make the log a lot more voluminous than currently, just have a lot more data in it than you currently have. Right? So a single log might be just much more descriptive about the system than you have. It might use human readability, but it makes it more agent parsable and rich And at the same time you might say that, hey, actually there's a lot of redundant data between different kinds of logs and that can be reduced. And so I think both things might happen. I think we might find, you know, as these generic systems start becoming the systems that are querying your data, you might figure out where is data, which data streams are actually effective and which ones are not. You might find that you actually need everything. We find that actually only 20% is effective. 30%, I don't know. And so that might be true. And I think at the same time, the way we structure logs, the way we structure metrics will change and probably become much more long. It's interesting to see how it plays out. I couldn't wait for telemetry. Looks like in the next few years is going to change in really interesting ways. That I'm excited to see and be part of.

[00:33:20] Chapter 15: From RCA to Guardrailed Remediation

Mirko Novakovic: Yeah, it makes total sense. Yeah, absolutely. And how do you see this space evolving, especially in terms of automation? I think one of the things if you look at those systems, what you would like is not only get the root cause, but you also want the agent to do the right things to, resolves the problem, right? Restart the container or roll back the code change that caused it. Right? So do you see that happening or is there still not the confidence of the users to do that? Or how do you see that evolving over time?

Anish Agarwal: It's possible. And we are doing that actually with some very complex enterprises. We are actually doing things like restarting pods and scaling pieces of the infrastructure, pulling rolling back commits, changing code. So it's already happening. And I think it's if I had to predict two years from now, it would be happening in a much more rich way. I think we'll have end to end outcomes from, you know, an incident to a full fix. CertAInly. I think when you begin with a company, no one's going to give you write access and let you do whatever you want to do. And so you have to earn the right to that by showing high accuracy on the root causing, part of, of the journey. And I think if you can do that effectively over time, then I think the next journey, there's at least the journey we've been on, is that then the companies will give you well, they'll give you basically a white list of commands that you can run. So they won't just say, do whatever you want. They'll say, these are 50 commands that you can run. Now, given these 50 commands and given the root causes you have, can you go, you know, come up with a sequence of commands that will go fix the issue in some sort of constrAIned guardrAIl, heavily guardrAIl way. And that team are getting comfortable with is what we've found especially if the command is something that is not going to bring down your system. It might be expensive to do that command like from a resource point of view, but it's not bad for your system. I think those are the types of things where we're already seeing yeah. How it plays out over time, I think. I think people will get more and more comfortable with it over time.

Mirko Novakovic: Yeah. And is it fully automated or do a user has to acknowledge it that I think that's also an intermediate step that you say, hey, I suggest doing this because of that and press a button to do it. Right.

Anish Agarwal: Yeah. So that's very much the case right now. You have to press a button. We don't just go do it. I think that's probably best for everyone involved. Yeah.

[00:35:42] Chapter 16: Workflow Construction and Routing

Mirko Novakovic: No. Absolutely. And what kind of workflow systems do you see doing that? I mean, is that just a simple command or will there be workflows that the agent decides or will you model workflows. That's also something I'm thinking about is like, will there be kind of new workflow systems like the N8Ns for observability or SREs?

Anish Agarwal: Thirdly, I think I think like you, that will happen. That's what we're doing in some ways. We're manually constructing workflows for different kinds of remediations. Or actually a lot of companies have these scripts. They have a lot of automation scripts, like different kinds of runbooks or other systems. And so the problem is they have thousands of them and they don't know which one to run. But agAIn, an AI system can semantically understand what each of these different workflows do and then route you to the right one. And they just become a router in some sense. And so at least right now we're in the world where it's all manually constructed at some point, versus the genetic system fully constructing that workflow. There's been one company with which we are doing that. So it's called Cloudways, which is like a web hosting company, whether it's guardrAIl. But the system, the AI system is actually constructing the workflow in a very constrained way right now, mind you, but it is constructing the workflow and every other company, it's a pre-existing workflow that we just routing to and how it evolves over time. It'll be interesting to see.

[00:37:03] Chapter 17: Disruption of Observability UX

Mirko Novakovic: I have one personal question that I'm really interested in and I'm trying to think about, and it is how this whole AI world and agentic AI will change the observability world, right? As an observability vendor, because I think what I can see is that tools like your tool will essentially use observability tools through an API. And then we kind of lose the control about the user. Right. Because why would a user login into an observability tool. If you or your agent does the work for you. So. So we will at the end be only be a database, right. Which from a I would say value perspective. It's not that valuable anymore. Right. Do you see that happening or do you see that more working together. You provide something and then the user will still log into observability. So how disruptive are you seeing that topic and how disruptive will that become for this for this whole space?

Anish Agarwal: I mean I think it will disrupt. So I think it's an inevitability that it happens, right, that there will be a fundamental rethink of how users interact with telemetry data. I don't think in five years you log on to a system and you look at a bunch of dashboards and, you know, I think it'll be much more a system is doing the workflow now. Who does it? It's unclear whether it's a company like us, which is, you know, trying to just be a purely a genetics company, whether it's an observability company that grows into this? I don't know. I think, but what I do know is that the thing will change who, who, how the field plays out. It's interesting. That remains to be seen.

Mirko Novakovic: Could also be a tool like Cursor. Right. Winning the game. We are integrating into cursor through an MCP. And sometimes it's really powerful to see how through and through your IDE you ask questions and then you get direct feedback translated into code changes. Right. So I could also see that happening. Right. That those IDEs get much more powerful in terms of what use cases they do. Right?

[00:39:10] Chapter 18: IDEs vs. Production-Grade Troubleshooting

Anish Agarwal: Yeah, there are a lot of engineers who, you know, developers who look at use tools like Datadog every day for something or the other, right, to get a sense of system health. Right. So I think for like simple queries like, you know, pull me this piece of log or like pull me this dashboard. So for these simple kinds of queries, I think the Ides can do a good job and already are doing a good job. I'd say for like complex troubleshooting of incidents. You have to kind of live and die by that. That has to be what the company is based out of, just just from the amount of infrastructure work that is required to make it happen. I mean, who knows? But I think it would be very hard for an IDE player to do that. I think they can do simple retrieval tasks, but not like complex troubleshooting workflows. And you can actually argue the other way, which is where I think, you know, some of the tools are which are on the production side, might also start eating away onto the IDE side. Right? So I think it's a competitive space. It's a big prize. I'd say there's competition from every which way. But it's, you know, big price.

Mirko Novakovic: Yeah, absolutely. And exciting times. Right. It's really good to be in that space during those times where things change and, and that always opens up possibilities for, for vendors like us, right, who are new in this space and have to fight the big guys who are there with billions of dollars of revenue already.

[00:40:25] Chapter 19: Closing Reflections

Anish Agarwal: Yeah, yeah, I think it's been a privilege working on this problem. I think, you know, we say we've fallen in love with the problem. Like, it just. It's a really good, fun problem to be working on. I think if I think about the team here, like everyone who's here, like if I think about what combines them, everyone just engineers love this problem and they want to solve the problem. So it's been really fun, honestly. So, like, you enjoy going to work every day.

Mirko Novakovic: Oh that's great. Anish, it was a pleasure talking to you. I will follow your journey very closely. It's fun seeing that space evolving and seeing how much money goes into it. Right. And I think it is absolutely a problem that I think the whole space try to solve. For 20 years, it never got really close. So I think this time I can see it happening. Yeah.

Anish Agarwal: Yeah, I hope so, too. Thank you for having me. It was a pleasure. Mirko.

Mirko Novakovic: Thanks for listening. I'm always sharing new insights and insider knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on

More Episodes

#34 - Rethinking Observability: eBPF, Bring Your Own Cloud, and the Future of the Monitoring Market with Shahar Azulay

Episode 3438 mins2025-12-18

Shahar Azulay

#34 - Rethinking Observability: eBPF, Bring Your Own Cloud, and the Future of the Monitoring Market with Shahar Azulay

#32 - Data Observability at the Source: Ido Bronstein on Upriver, Bad Data, and How To Monitor A Future Full of AI Systems

Episode 3232 mins2025-08-28

Ido Bronstein