[00:00:00] Chapter 1: Introduction and Guest Background
Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I am co-founder and CEO of Dash0. And welcome to Code RED code because we are talking about code and red stands for requests, errors and Duration the Core Metrics of Observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Today my guest is Andrew Mallaband . Andrew is the founder of Breakthrough Moments, where he advises tech companies on growth, engineering, messaging and go to market strategies. Previously, he has helped build industry leaders like Turbonomic and Opalis from the ground up, leading to access to EMC Microsoft. And that's something we have in common to IBM, right? And recently I read some of your analysis and I'm really happy to talk about them. Welcome to Code RED Andrew.
Andrew Mallaband: Thank you. Mirko. So it's a pleasure to be here.
Mirko Novakovic: I always start with my first question at Code RED. I always ask about your biggest Code RED moment and which essentially is it okay, what happened to you in your major outage or oh no moment. Right. And maybe one of the tools like Turbonomic, which was not so much a troubleshooting tool. Right. But maybe you had these moments where the tool helped you, maybe even the AI helped you to solve a problem.
[00:01:32] Chapter 2: Personal “Code RED” Moments
Andrew Mallaband: I personally, in my career, did not spend a lot of time on the front line. I spent a lot of time interviewing customers about, you know, what their challenges were. But my own Code RED moments, I guess they were more related to the businesses that I was working for. We were trying to close the quarter end. You know, one year it was Christmas. We were trying to close this massive deal at Cable and Wireless. It would double the revenue of the company I was working for. And my wife made me go and live in a hotel because she. Between Christmas and New Year, she couldn't stand, you know, the pressure and you know, I was supposed to be there with my family, but I was completely vacant. So that was my biggest Code RED moment that I remember in my career.
Mirko Novakovic: It's a good one. And by the way, not a lot of people. I mean, I work a lot in the go to market space. Know that I had my former CEO gave me the advice to always send my wife at the end of the quarter into holidays somewhere, because the last week, whoever has been in these situations where there is this large deal that is not allowed to slip and you need to get it, the pressure is really tremendous, right? I mean and and you get stressed out and also if you win it, it's a really nice feeling, right? But it is stressful. So I can relate to that Code RED moment. And no AI probably no AI and observability tool can help you here right.
Andrew Mallaband: Yeah. So I mean I guess my other Code RED moment was again, it was kind of related to the intensity of the business. You know, we had a year, I think it was the third year at Turbonomic where, you know, we were pioneering a business model where we were doing inside selling as a primary model. And I was the guinea pig for enterprise selling which, incidentally, eventually became the, you know, how we went to market in 12 months. With a very small team, we closed over 50 deals in the enterprise space. And, you know, we were like half the revenue of the company. And at the end of that year, we'd worked so hard, I burnt out. Right? So that was a Code RED moment for me. And he got into meditation and mindfulness and all these things and looking after myself as well, because I've been running hard and I wasn't doing any of that.
Mirko Novakovic: Yeah, I get that, I get that. Today is my 49th birthday, by the way. And it changes a lot right? Over time when you're younger, you can do that differently. But after a while, you have to be more careful, right? With everything mindful do some workouts and stuff like that. But getting to your observability in 2025, right?
[00:04:25] Chapter 3: Key Observability Challenges for 2025
Andrew Mallaband: Yeah. I mean, to provide some context, I've worked for two, two companies particularly who were focused on I over a 16 year period, and I got the opportunity to come back and revisit this area of observability. And I 18 months ago, when a new vendor basically asked me to get involved, and it gave me the opportunity to get exposure to, you know, many, many customers across the world who were involved in deploying observability solutions. And there were really four themes that came out of those conversations. Cost. So, you know, my bills are increasing in an unexpected way. How do I control that? Data deluge. I mean, that part of the reason for the cost, I think, and so many organizations were overwhelmed with data and trying to make sense of it was a problem. And the consequences of that was the impact on productivity, because people were spending a lot of time in, you know, incident management processes and doing postmortems on, on, on incidents, trying to figure out why obviously that's stealing time from other activities, which are affects the delivery of, of, you know, new capabilities. And then the final area was around data observability. So more and more companies are running real time applications, you know, for personalizing user experiences, recommendation engines, fraud detection, cyber security, supply chain management. And in all of these situations, data freshness and data integrity are absolutely critical, right. Things fall down like a pack of cards if you don't have alignment there. So scratching below the surface of all of these things, there was really one consistent theme that I observed which I think is letting people down and that's that a lot of organizations don't actually have a strong data strategy that underpins observability. I was motivated by these findings, really, to write this observability 2025 series, which digs into these challenges, and it talks about how we might go about solving some of these problems.
[00:06:56] Chapter 4: Cost and Data Overload in Observability
Mirko Novakovic: I really liked it. It's a really good read. I can just encourage people to read it. I love it because I mean, as you probably know, we have built a Dash0 around a few of these observations you have done. And one of the observations we also hear, and I would love to dig into that a little bit, is the cost problem, right? I mean we hear it all the time that people need observability, but it's really expensive. And one of the reasons for that is that you pay for the data that you are sending, and there's a lot of data, and there's even more data all the time with microservice applications, clouds, new, new things coming up. Right? And not only the cost, but you mentioned it because if you have a lot of data and you're trying to find the needle in the haystack and your haystack is just so much bigger now, it doesn't make finding the needle easier, right? So it takes more time.
Andrew Mallaband: And I think these things are only going to compound as well, because we've now got these things called agents that are going to manage. And there's a whole new dimension of things that we need to start looking at.
Mirko Novakovic: Yeah, absolutely. Or just the whole whipe coding. Right, which will let developers create at least code much faster. And that means there will be more code deployed, probably into production. Also code that is not so much controlled by the developer. So it's a little bit more unknown. Unknown code. Right. So this is also interesting. I don't even know what the challenge will be, but I can imagine that based on my personal experience, it's always easier to analyze and troubleshoot your own code. Right? Because you have written it. But now you are essentially troubleshooting code that's written by an AI. Maybe controlled by you or supervised by you, but it's still not your code anymore. I mean, again, you will get more data, but also more unknown data that is not that familiar with you.
Andrew Mallaband: And you may get more failures as well.
Mirko Novakovic: Yeah, absolutely. So this will be very interesting. We see the same, right? We see that that it's more and more data. People are struggling to make sense out of the data and it's getting more expensive. Right. So I've seen the charts on right. Data is going up. Value is going down. Right. And at the same time, because with more data doesn't come more value if you can make sense out of it anymore.
[00:09:11] Chapter 5: Intentional Data Collection and Distributed Tracing
Andrew Mallaband: That's right. And I think that I mean, there are some technological things that you can do, but I think it was in the fourth article in my series, I wrote about observability and business value. And one of the things that I said in there was you've got to really start by defining what your intent is. I spoke to so many people who they couldn't answer the question of, you know, what are you collecting and why? They were more focusing on, well, I need a tool. And, you know, I'm going to go and discover things and collect the data and build some dashboards, but they weren't really thinking about the intent. And in my mind, that is, well, what are the business transactions that actually matter to the business? And you know, that could be user interactions. It could be data pipelines. If these break, then they're obviously going to have a business impact. And that's really where people should start. Rather than trying to, you know, just discover the universe of everything and collect all this data. Now, in doing that as well, I also suggested that the most important signal that we need to focus on is distributed tracing as well. Because that helps us to actually understand the dependencies and inherently it actually contains causality. We can see how, you know, when a problem arises, how it propagates through the system. And we don't get those clues by looking at metrics and logs. They're kind of the things that you need to zoom in on once you understand the causality in the environment, right? Because they'll reveal more interesting details about what it is that we need to do to try and diagnose the problem.
Mirko Novakovic: Yeah, I totally agree. I would, I would love to talk about two of, of the dimensions because we are also talking about it internally. I would like to first start with the, with the, with the data that's sent in. And you said it's not intentionally. I had this discussion today on a LinkedIn thread where I was saying what in my point of view is one of the problems is that most of the data you send to observability tools is not coming from your developers at all. Right? Most of the data is coming from third parties, right? And you're just integrating. You have all these plugins. Oh, there's a plugin for this technology let's say Oracle Database. And now I'm getting all the logs and metrics of the Oracle database. And then I'm adding my getting my AWS account. And there are 50 AWS servers now getting all the metrics and logs from AWS services, which is not the data I have added to my code. It's just data from third party and it's very hard to understand. What do I really need? Right. Where's the data coming from? I can't even control the data. So I think that's part of the problem that most of the data you send into observability tools is not even it's not your data, right? You are collecting information, telemetry from third party tools and services, and then you try to combine that with your application data that is relying on those services. Right?
Andrew Mallaband: Yeah, I think we've got to ask why I am doing. That's the fundamental thing that we need to, you know, what value is this going to add?
Mirko Novakovic: Absolutely.
Andrew Mallaband: You know, that's something that people need to do. And I think if they ask themselves that question, then they will probably find that they collect a lot less data than they are today, which I think will have an impact on the cost. And it will also give more clarity to the people who use these systems as well. Right, because they'll be dealing with less noise.
[00:12:47] Chapter 6: Third-party Data vs. Developer Data
Mirko Novakovic: No, absolutely. And then the other point was tracing. I can't even agree more with you. But here is also the reality from 25 years. So I saw tracing for the first time 25 years ago with Wiley, Interscope. Wiley had tracing for Java applications and I was totally intrigued. I loved it, right? And then I worked with other tools like Dynatrace, Pure Path. Right? AppDynamics business transactions. So tracing is there forever. But even today I see that most even developers, they rely much more on logs and metrics still, and I haven't figured out why, to be honest, in 25 years, because I'm I'm such a big fan of tracing, but I still see even today with Dash0 when when I see at the, I always look at the ingest and what's ingested, how much tracing, how much logs, how much metrics. Tracing is always the third thing people do, right? It's always Way's kind of logs and metrics. Sometimes it's metrics. First you create some dashboards and then you add logs for your debugging logs and stuff like that. But then tracing which, as you said, right. It adds the context, it gives causality, it lets you understand interaction and maybe even if an error occurs, who is affected. Right. It gives you a blast redius.
[00:14:09] Chapter 7: The Case for Tracing—Historical and Contemporary Views
Andrew Mallaband: We used to call it service management. Right. You know, when we had tools like Mercury back in the day. Right.
Mirko Novakovic: Yeah.
Andrew Mallaband: Yeah, yeah. And you know, I guess that was part of the incentive. Dynatrace. I mean, if you look at the way services are being constructed today not only do you have lots of, you know, distributed components, lots of microservices, but you also have this organizational decomposition as well, where you've got lots of different teams who manage a piece of the puzzle. So, I don't know, maybe developers care less about the end to end, which is really what distributed tracing is about. You know, I think this is very relevant from an SRE perspective, though, because they're trying to piece together all of these components to figure out, you know, why when things break. If I'm responsible for a piece in the machine, then probably I'm less interested about the end to end.
Mirko Novakovic: I really remember very early Dynatrace days. Bert Greifendier, the founder. There was only pure pass, right? There were no metrics, nothing in Dynatrace. I remember him saying I will never build dashboards, right? That was until version three when he had dashboards in right because people demanded it. It was the same with Instana, right? We wanted tracing and I said, oh, no, tracing is the most important thing. But then customers, if you hit reality, the reality is just. And then you saw like a Datadog. They only had dashboards and we said oh, dashboards. They are not so useful for developers. They need tracing. But then you saw them taking off with dashboards and metrics at the beginning. And then they added logs, by the way, and then tracing. So they did kind of the right way from a customer perspective. And I did it the wrong way. I started with tracing right. Because I was so intrigued. And so I learned my lesson about tracing. Yes, I think it's the most powerful thing in observability. Right. And it gives everything you need, especially context. It helps you with troubleshooting so much. But until today, I still think it is not the most loved. Right.
Andrew Mallaband: Well, I think I think it is the thing. If you want to go anywhere with all these SRE tooling, you need it, right? Because what these vendors are attempting to do is to build you know, the graphs of the environment in order to basically represent dependencies, because the idea of them is that when something breaks, they have this knowledge base that shows the dependencies in the environment, and they have an algorithm which traverses the dependencies, and it carries out various diagnostic tests. There's no way to build those graphs without distributed tracing. From what I can see.
Mirko Novakovic: I agree. I agree that the customers do not always agree, right? That's the point.
Andrew Mallaband: I mean, I saw this because, you know, I spent, you know, more than a year talking to customers about AI SRE toolings on behalf of a vendor that I was working with. And one of the biggest challenges of actually demonstrating any value to a customer was the fact that people didn't have this right. And so you would go along, you would install the product. And, you know, having set the expectation that you were going to do all this magic for the customer, but the magic never happened because the customer didn't have the data.
Mirko Novakovic: Yeah, absolutely. And that's something people need to understand even with AI. Right. Causality is really important. Right. And you need that because otherwise way at my. My former CTO at Instana always said, if you have only metrics at some point, everything correlates with everything. Right, right. Somehow. And if you don't have causality, if you don't have the graph and dependencies, it's very hard to understand if this metric here on the left and that on the right, even if they seem to have a correlation, is there dependency? Right.
Andrew Mallaband: Yeah. Unfortunately Harry Potter doesn't work in observability space. So you know we're not going to there's going to be no magic here. There's no magic answer.
[00:18:12] Chapter 8: Business Transactions, Customer Experience, and Product Analytics
Mirko Novakovic: No. Absolutely. And we also discuss internally. So today we have tracing and spans. But we are discussing this notion of business transaction. And I totally agree. This should be one of your main things you should care about. Right. They call it user journey or business transaction, whatever it is. Not that easy. How you map that to tracing in my point of view, right. Sometimes a business transaction is just a distributed trace, but sometimes it could be multiple traces that are a business transaction, especially depending on the way you do your application. Right. And that's kind of making it really interesting at the moment. We are figuring out, for example, if I can help us identify those business transactions for you so that you don't have to configure them because I think configuring them is complicated.
Andrew Mallaband: Right? One of the other things that I also observed is that a lot of organizations have a separate team of people called customer experience customer service, who are facing off to customers, and they're using digital experience tools, you know, from Adobe and people like that, that monitor the customer experience. And, you know, the customer experience could be bad because of the design of the software. Right? It's just a poor user experience. It could be that the customer, you know, hasn't been trained properly. It could be that the issue is actually a systems issue, right? And so I also think trying to bring together digital experience with observability is also an opportunity as well.
Mirko Novakovic: Oh, totally. And there's also a whole category called product analytics. Right.
Andrew Mallaband: Yeah.
Mirko Novakovic: Like which is essentially mostly from product management, but it does the same. Right. I was talking to a larger enterprise in the Netherlands and the CTO there, he was saying, hey, I pay this amount of money. It was all, by the way, seven figure, right? All more than $1 million. I pay $1 million plus for this user experience here. I pay $1 million plus for my amplitude. I pay $1 million plus for my Datadog, and I pay $1 million plus for my Splunk for Siem. And essentially, it's all the same data, right? And so how can you help me with that? Right. And I was like It makes sense, right? It makes sense that you have all these different use cases for different personas who have different tools. And essentially, it's all the same data, more or less. Right.
Andrew Mallaband: You know, then we have digital adoption platforms as well as what I would consider observability data in those. They give us another lens on what users are doing and, and we've got data observability as well, which is again, you know, it's the Monte Carlo's of this world. I mean, I wrote an article recently, I quoted a guy called Heinrich Hartmann there who is the head of SRE at Zalando. I did actually speak to him on several occasions in the past as well. And, you know, one of his biggest gaps was how do you know, data integrity challenges in the environment? You know, when we think about the observability space, we're often looking at it from a performance perspective and not a data integrity perspective. Absolutely. You know, from an SRE perspective, you can't really differentiate. If this is a problem. There's a problem you need to bring into this world. For the SRE as well.
[00:21:56] Chapter 9: SRE Agents and Automation in Observability
Mirko Novakovic: Absolutely. Yeah. You have tools like Monte Carlo, right. For those types of challenges? Absolutely. Yeah. But let's talk about I mean, we were talking a little bit about AI and SRE agents. What is your definition of an SRE agent? And what would you say is the state of, of or of the market products if you, if you can, because you have written about it. I would be curious about that.
Andrew Mallaband: I guess it's a fairly broad definition, right. That there are tools out there that are doing what I would have defined, you know, from my past as runbook automation capabilities, Abilities where they're basically taking housekeeping work or you know, the SRE team stuff that that, you know, they, they would like to be able to automate that. They haven't yet. And, you know, it's fairly easy and simple stuff to do. I mean, there's a company called Run who are doing a great job at that. And they kind of started off as a runbook automation vendor around Kubernetes. And, and they've kind of evolved from there. And then you have other players who are trying to automate the incident management process as well. Yeah. I mean, it's distinctly different from housekeeping. So, you know, when I get an issue help me to try and diagnose what it is. And you know, provide me with all the information. So I don't have that overhead. I can do it faster. I can get to the root cause quicker and hopefully recover the service faster as well. And there seems to be a new company in that space every day of the week. You know, there's a lot of people focusing on that. You know, Microsoft are in the game. As of a few weeks ago. They've announced that they're, you know, they have a capability there incident io that were kind of formerly in the Pagerduty space. The CEO there said on LinkedIn the other day that they're now going to move in on that space as well.
Andrew Mallaband: You've also got you know, companies like causally Dynatrace, there's a company called Nofire AI who are a new company who are trying to focus specifically on how to do causal reasoning, which is distinctive from you know, many of these other vendors who are really focusing on how do I correlate data through a knowledge graph? It's kind of yet to be seen, I would say. I mean, I use the term nascent in my article because I think there's very little significant evidence about how quickly the offenders are making progress. I mean, part of the reason there is that it's actually it takes a long time to bake. I right. You need a lot of runtime experience to refine it. I mean, when we did Turbonomic, we purposely went to market with a product which was focusing on reporting and capacity planning rather than real time decisioning and resource allocation, which was the fundamental premise of the product, because we could put stuff in the marketplace. We would actually gain knowledge because we had the data and we could run our algorithms against live production data and actually see what results we were getting and refine the AI, right. And we could do that in a way that wasn't intrusive to the customer. A lot of these companies don't have that, or, you know, they don't have that luxury because they, you know, they're building off synthetic data, which isn't necessarily representative. So, you know, until you've got a lot of customers, I mean, maybe it's something for you to consider.
[00:25:51] Chapter 10: Data Foundations for AI and Observability Tools
Andrew Mallaband: Mercury. You've got 160 customers. And I guess that's a big opportunity to really think about how to refine how you're using I mean, it took several years, right? And, you know, we went through the same experience at smarts as well, which was another company I worked at in the network management space. Again, there we it took time. And I think that's the reality of where we are with this. And the other challenge that I think a lot of these vendors have, again, is the data. If you don't have the right data in place, then you're not going to be able to produce the results. So I think it goes back to you need a foundation in order to, you know, to build on. Which includes your strategy about the content that you're going to follow. Right. You know, what data am I collecting and why you need telemetry pipelines which are able to shape the data. I think you need to have a way to store it cost effectively as well, right? Which is where data lakes come in. Yeah. Because we are going to have a data explosion. You know, things like clickhouse, you know, are very important technologies in that regard because they change the economics of actually storing and accessing the data as well. So, you know, many companies have got to get those things right. I think in order to exploit the benefits that I can give them.
Mirko Novakovic: No, absolutely. I mean, we are having an internal project called Agent Zero, which is our agentic approach. And I have to say, I was pretty skeptical and we built an MCP server. What you do right around it and had to figure out what to expose and how to expose it, which functionality. And today we have an agent that does troubleshooting. Right. So if an alert pops up, you can click a button and then the agent autonomously connects, reads the message, and then connects and tries to figure out the root cause. And it's mind blowing. I was actually shocked how good it is. And it's not that we to be very transparent here. There's no real IP because we are just using Claude as a model and use an agent and then connect to the MCP, and then you have to figure out what data and how, how to describe it and which functionality to expose. But then it does everything without any guidance from us. Right. And the result is really, really good. So I would say it's really on the level of a very experienced troubleshooting person. I mean, I did troubleshooting ten, 15 years of my life when we have different scenarios in the demo application where we tested it with real data, and also inside of our application, where I would say I would get there in an hour or two and that thing get there in five minutes. I mean, we have to test it more and figure out. And pricing is a topic, right? How do you use that? It's also expensive. And it is. It comes without any kind of usage of some hardcore machine learning or I mean, you essentially need no AI skill set to use that, right? Which is really interesting because before that causality machine learning, you needed really, really hardcore people to do that. And now you just use a model. And the result is really the best I've ever seen in the observability space. Right?
[00:29:26] Chapter 11: OpenTelemetry and the Future of Automated Troubleshooting
Andrew Mallaband: Right. But I think what data you actually feed this is actually very important.
Mirko Novakovic: Absolutely. Yes.
Andrew Mallaband: So and I think the luxury that you have is that you're building a platform that provides a foundation for that.
Mirko Novakovic: I also think that OpenTelemetry in this regard is a benefit because it's well documented, right? I mean, these models get trained on publicly available data. Essentially. I mean, you can feed more data into it, but the you have the semantic convention. So all the fields are properly documented. So the AI actually understands what HTTP status code means. Right. And then it gets a reference to what the actual meaning of the fields are, what the 500 is, what the 404 is. Right. And all that data is kind of there. And then if you use that and you provide the LLM with that data and it understands that this is OpenTelemetry, I think that also has benefits. Right. Because now you can benefit from all the publicly available documentation and everything and use that for the troubleshooting process, right?
Andrew Mallaband: Well, it would be interesting to see what your results if produces.
Mirko Novakovic: I will show you. I can show you. It's really really down to saying, hey, this problem here is call of this service, this method with those parameters and, and and here are suggested fixes. Right. And now, by the way, you can also connect that to your cursor or whatever AI coding tool now because they allow you to connect to MCP servers. And now you can combine that. And you could essentially say in your IDE, a developer could say optimize the performance of this service, and it will use this MCP server to connect to Dash0, get the real time data of that service, and then use that information to optimize the code. Right.
Andrew Mallaband: Right.
Mirko Novakovic: So it's kind of pretty interesting use cases that will evolve in my point of view, where observability systems will also feed into these coding platforms.
Andrew Mallaband: I think is another one as well, which is I've kind of been talking to vendors about this for a little while now about you know, at the end of the day, if we're missing telemetry, why not have capabilities where we can actually discover what the gaps are and then based on those gaps, actually have AI write the instrumentation that's required and, you know, create a closed loop around that and, you know, shift left here so that you're actually building the instrumentation, you know, because one of the big issues here is you know, it takes time and effort to instrument code, right? And that is often at the expense of features. You know, you might get resistance from development teams and, you know, product managers. So why not find a way to actually do that in software?
[00:32:25] Chapter 12: Automating Instrumentation and Closing Observability Gaps
Mirko Novakovic: Yeah, absolutely. I would ask you, because you are talking to so many clients. What are they seeing and what keeps them up at night when they think about observability, monitoring and AI and the future of the platform. Are they really into agents yet, or are they still struggling with the initial problems we were discussing? Cost data. Is the problem really AI, or do they think that this will solve that problem? Or how do you see?
Andrew Mallaband: I think there's a cycle, right? People like AI and the story around AI because, you know, it sounds great, right? It's compelling. Everybody wants to have it. But then you have the reality of, you know, the data gap that people have. People do not have all of the data that they need. So when reality hits, I think I'm finding that people are now kind of looking more at the foundational things that they need to do, because otherwise they'll never be able to exploit the benefits of AI. Cost is clearly a major challenge. But again, I think if you look at the data problem and you look at the foundational aspects of what you're doing, you can also solve that problem as well. So fix the data strategy and you solve two problems in one hit cost and AI. Yeah.
[00:33:49] Chapter 13: Industry Sentiment and Closing Thoughts
Mirko Novakovic: That makes sense. Andrew, was really nice conversation. Good chatting with someone who has been in this space for a while, right. And understands the issues and. Yeah. Thank you for joining Code RED.
Andrew Mallaband: Thank you very much for your time.
Mirko Novakovic: Thanks for listening. I'm always sharing new insights and insider knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.