[00:00:00] Chapter 1: Introduction and guest background
Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I am co-founder and CEO of Dash0 and welcome to Code RED code because we are talking about code and Red stands for requests, errors and Duration the Core Metrics of Observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Today my guest is Spiros Economakis. Spiros is founder and CEO of NOFire AI, a platform using causal and agentic AI to predict and prevent production failures before they happen. He's a long time engineer, leader, author of Argo CD In practice and advisor to global SaaS teams. Exciting to have you here, Spiros.
Spiros Economakis: Happy to be here, Mirko.
[00:00:51] Chapter 2: The “pina colada” incident and lessons in reliability
Mirko Novakovic: I always start with my question. What's your biggest Code RED moment? And I already saw on the website, by the way, you had the pina colada accident or something. So tell me about that.
Spiros Economakis: Yeah, exactly. So one of the moments, you know, that saved the, you know, how I think about reliability was what we now call the pina colada incident with my, you know, ex teammates. The whole team was at an offsite. I had just came back, you know, from parental leave. I had a newborn, and I was the only one awake and on call that day. And imagine that I was already liking a couple of contexts. Right. I was just came back and there were a lot of recent changes. The system landscape, you know, has evolved. So 15 minutes before, you know, my call really is going to end, right? The US customers started logging in and, you know, the system started literally falling apart. The users weren't able to log in. We have a couple of concurrent logins. And the funny stuff, it was that it took us about, I think, 1 or 2 weeks to figure out what is the real cause? And the fix for these two weeks was literally to fail over to the next DB instance. So, you know, calm down. One of the instances, recover to the other one. And for like 10 or 15 minutes everyone's logging in doing this consistently. This, you know, just stayed with me not as a now that's mostly because, you know, very few people, they had, you know, a real understanding of the problem. Back then I was alone, I didn't know exactly what happened. And I had all the telemetry of the world. I had no traces, logs, metrics, dashboards, alerts which were triggering. But still it was hard for me to understand what has changed, you know, to understand the behavior of the system and totally find it, you know, even in a day. It took like one or two weeks. So this is really one of the Code RED moments, because you can imagine after this, we also had the root cause analysis to the customers' postmortems. You know what happened after this when everyone, you know, logged in.
Mirko Novakovic: So did you have the pina colada before or after the incident?
Spiros Economakis: No. The guys were drinking pina colada. I was in the incident. That was the issue.
Mirko Novakovic: I get it, I get it. Well that's good. And was that one of the ideas why you found it? NOFire AI or.
[00:03:14] Chapter 3: From firefighting to prevention: NOFire AI’s shift-left vision
Spiros Economakis: Yeah, exactly. This is actually, you know, how we think that we should start, you know, this and start tackling the problem.
Mirko Novakovic: I don't know the tool in more detail, so we should talk about it. But the way I understand this and you can correct me, is that it is a tool that works as an agent in your environment, as a developer. So to check dependency, reliability issues, etc. before you actually merge and deploy the code, right. So essentially not getting to production then root cause analysis, but do the investigation before you actually do the mistake, the change and then prevent problems. Is that correct?
Spiros Economakis: Correct. Correct. That's correct. It's the full lifecycle, right. This is what we're trying to tackle. Right. Because what we believe is, you know, the ability starts when you really write the code, actually. Yeah. Not when an incident forces you to look for answers. Right. And I think this is where the landscape is changing, too. Because if you see, you know, right now with advancements of AI, what happens. Right. Today we see faster than we understand. And not only AI. AI right now, as you know, creates code quickly, very fast. Easily. Right. But the connection right now between a change and how a production system behaves is not that strong as before, right? We still are kind of phoners reviewers, let's call it. But the speed of creation has outpaced the speed of reasoning right now. So. And when that happens, this is how I call, you know, the, the ownership triangle. The ownership starts to drift. So responsibility, control and knowledge are a whole lot of things. Right. Because we feel we deliver first aid, but in practice we don't understand. So are we accountable for something we don't fully understand or we cannot even see? Right. That's how small change right now can affect, you know, five other services without anyone noticing. And that's why we will not fire this shift left their ability right. Which we discussed.
Mirko Novakovic: Yeah. Actually, this morning I read a comment on LinkedIn. I think I read it and somebody said, I don't care if you generate all your code with AI unless you understand it, right?
Spiros Economakis: Yes, that's exactly, exactly. I saw the same comment. And it's true. Right. It's about you still having the ownership. You should still have, you know, the ownership of this. But how we can also influence the coding agents to do a better job. Right, to do a better job having the context of the reliability.
[00:05:46] Chapter 4: Architecture and workflow: agent-to-agent integration
Mirko Novakovic: So how does it work? Is it an MCP server or how do you integrate? So walk us through it.
Spiros Economakis: It's this agent agent approach. So it's an MCP server, right? Which of course, you know, we integrate with your system, we discover all the dependencies. We understand all the changes, the drifts, the golden signals, how each service connects to the other, and what are the cause and effect relationships and through that time so you can reconstruct what happened. Also, you know, yesterday or other days and this actually gets all the context from the coding agent without really having access to your code, but getting the most useful information of what is the change, what exactly that is. Right. And do a reason based on different kind of aspects. The knowledge we build consistently in a real time way. Right. What I told you from your infrastructure, Kubernetes, AWS doesn't matter from your observability. Start getting information about, you know, at this moment what is happening or there are too many deployments. What is the stage, you know, of these kind of services which you're going to interact. So we predict, you know, the cascade effects if you have past incidents, identify that this same change, you know, has been applied again in a past incident, so you should avoid doing this. So right now there is a agent to agent approach of, you know, whatever coding agents use are NOFire AI reliability knowledge to do the right thing. It doesn't matter if you are just, you know, an engineer who is writing code or if you are an SRE who is creating alerts because you can get the information. Is this alert because I'm drawing overlapping, right? Or do they drive slowly? Do I have alerts which are not useful? Are false positives you know, or they have too many things which are the same. So you can even influence you know, how you write alerts, how you create infrastructure, how you do coding. So there's a couple of ways.
Mirko Novakovic: Okay. And where is it integrated? Is it a GitHub action that's triggered when you commit code or is it something you manually trigger in your cursor or cloud code and then ask for.
Spiros Economakis: It's both, right. It's both. It's actually when you do have this like the cursor or the cloud code agent, whatever you call it right now, who influences, you know, in the workflow. So it changes the workflow we have before and changes to be more reliable because you get the reliability and production context. And you have also, you know, during the pull request where you get the right information back as a team to understand this is the change, this is what it touches. This is what affects, this is what is missing, and this is the current behavior of the production that it gives you. Like a score analysis of go right now or no go signals with some kind of scoring. Right.
Mirko Novakovic: How do you do the scoring?
[00:08:21] Chapter 5: PR gating and risk scoring
Spiros Economakis: That's the secret sauce. Mirko.
Mirko Novakovic: Hahaha no, but what is the result? So it's a score between 0 and 100. And you say if you get below 0 to 10 and if you go below a number then it's risky or what. How does it work? Yeah, yeah.
Spiros Economakis: Yeah. It's 0 to 10 and just tells you that, you know, if you are under some kind of certain threshold based on the production behavior, of course. Always. Right. You are okay to go or there's not many, many things happening at this moment. And this changes a lot of stuff in parallel. So be careful about this change or include people from other services, for example. Right.
Mirko Novakovic: Okay. Makes sense. Makes sense. I wanted to ask. NOFire combines causal AI and agentic AI. Can you give us a little bit of a perspective? What's the difference between causal AI and an agentic AI and where and why you're using both?
[00:09:11] Chapter 6: Causal AI plus agentic AI: why both matter
Spiros Economakis: Yeah. You know, from my experience in all these years, because I was a practitioner or I was leading teams for a really long time on call teams, they really struggle, you know, because they don't have enough data, right? They struggle, as we see with the latest environments, because they don't have the full context. Right. The way to understand what's actually happening underneath all these signals changes, drifts, telemetry, whatever. And for years, if you think about the reliability, automation was built on predefined runbooks, if this happens, something is going to be triggered. It's going to be fixed automatically. But it was very static, right. It was fast but brittle and it only worked for scenarios we have already predicted. Right. So we use causal AI because causal reasoning changes the starting point instead of counting spikes. Right. Or start looking at how events influence one other what region first, what it set the motion, how the behavior eventually leads to the user. Think about it like that's kind of a senior will try to reconstruct, you know, when there is an incident or there is a problem, right. So the causal gives you the starting point of where actually you're going to start digging, not just going through, you know, all the surface and identify builds on top of that structure. Why? Because right now it's the dynamic way. It's not any more static as we discussed earlier. Right. It can reason through the situation, adapt its thinking, pull the right evidence with the fragmented tooling we have out there. And not only, you know, start converging this in a meaningful part of the problem instead of changing, you know, every alert.
Spiros Economakis: So but to be honest here. Just put data and you have seen this. You know you also have a product in Dash0, right. Just pushing the data in AI can really reason, right. Just throwing raw logs or looping on LLM 20 times and all this stuff that approaches. First of all, it can be really slow not to go through that phase. Super expensive, especially with the tokens pricing and, you know, and the foundation models and I would say sometimes it's also really wrong. Right? So if the AI doesn't understand the behavior of the system, which is the causal AI, right. How all these are things connected, then we are actually just guessing. So that's why we did this. The combination matters because we feed the identical AI, right, with a real representation of how the system behaves. And it can reason right now more accurately, more efficiently. Right. Because starting starts through the whole surface. Because right now that this change has propagated or there was a temporal change, or there was a deployment which triggered other things, you know, like a temporal or new services added. All these are important because this is how you know right now you combine the dynamic way of understanding what problem we are getting, the causal precision of the things are connected, and just trying to reason better and faster and much more accurately right now. Right. That's why we believe, you know, and we see also in practice, you know, from the customers we have right now that it's very accurate and very, very efficient so far.
[00:12:16] Chapter 7: Building the dependency graph without eBPF
Mirko Novakovic: And you're talking about context and dependencies. So you are having integrations into different tools I assume. So you integrate with observability tools etc. to get data. And then you have to derive you kind of have to derive the dependencies from that data. Or how do you do that?
Spiros Economakis: Actually there are two ways, right. The way we use actually we have an agent which is running in your cluster, right, in the Kubernetes cluster. And it doesn't use Ebpf in this case, but what deciphers all the, you know, the, the dependencies between the different kind of services. And not only that, you don't need OpenTelemetry. In this case, you just need the agent to just decipher all the information without eBPF from the rest of the tool, right? Let's say that we integrate AWS. You get more dependencies, but you can connect them right now because you know that it's an external dependency, right? So you decipher that and reconstruct the graph through the agent. So you have the whole system picture right now. And based on this right, we start creating what I told, you know, the knowledge of how the system behaves, plus getting information from that telemetry data to understand more behavioral effects in terms of, you know, what is the performance or the latency, not the golden signal aspects which combine this knowledge. So it's not directly integrated with the observability stack only. Right. Because, you know, it's fragmented. The data are not correct. You know, one vendor uses a different thing. The other vendor uses a different thing. So it's very hard sometimes to reconstruct the graph directly from observability.
Mirko Novakovic: No no no, that's totally understandable. But most of the SRE agents, if you look at them, they mostly do it just with integrations. So you have your own agent running on the infrastructure? Yeah. Using Ebpf to understand the system dependencies.
Spiros Economakis: Not using BPF. Actually we are using DNS in this case, you know, to identify the dependencies. So it's you know, it's nothing there is no intrusion here in your system. Right. We are not having capabilities on your eBPF. We just decipher all the information through DNS and can support everything, right? Kubernetes, DNS, bind, DNS everything. What is, you know, vision support. So you decipher the information from there versus just doing, you know, eBPF with admin capabilities in the kernel and other stuff.
Mirko Novakovic: Well that's interesting. And it works only with Kubernetes then.
Spiros Economakis: Yes. But while you add more integrations or you give the DNS, you can decipher more, right? Because I think about I have a service which calls an RDS instance somewhere. This is an external dependency. And you can see that this is an external dependency. So when you do the integration, you can get rich metadata from AWS or. I know that this RDS has more nodes underneath, for example. Right.
[00:14:57] Chapter 8: Tracking drifts and temporal changes
Mirko Novakovic: Makes sense. So you create this dependency graph of services based on DNS information that your agent grabs from the system. And now you integrate, for example, with the Datadog service. And now you can map that information back to the graph and you understand dependencies, drifts, etc. you probably also store configuration data or how do you understand drifts in configuration, etc. you have history.
Spiros Economakis: Yeah. Yeah. That's a good question. It's mostly events right? We don't store configuration per se, the value. But you know the drift of this value has changed. This has actually okay. This is the storage. For example the persistent volume claim has changed for example, or other things like that.
Mirko Novakovic: Yeah. Makes sense. It's also more efficient. Right? I remember back when I was still at Instana, when we designed the system, we had this dynamic graph, which also was a dependency graph between services, but also dependencies between the infrastructure. Right. We had dependencies. Like thing is running on a server inside of a pod. Calling a service. And we also had this idea of storing only changes. Right. We store changes whatever happened so that we can basically go back and back and forth in time and see what happened.
Spiros Economakis: Yeah, exactly. So you can get that's the causality, which is, you know, the game, right. You get the structural changes, the temporal changes. You can get all the changes through the time. Right. Configuration changes. What exactly has been triggered. So of course the big you know from Instana it's a big data problem right. You need to store thousands of data, especially in large infrastructures right. And complex environments. But this is what it gives you also you know increases also you know the accuracy right. It's not only that you have some information from the observability stack, which is a very hard problem sometimes because it's very hard to identify, you know, Grafana is different, Datadog is different. Splunk is different. You know, everyone is different in terms of how they have data modeling. So it's very hard to construct a graph from there, right? And not only it's very hard to identify also the drifts we are discussing right now.
[00:16:54] Chapter 9: Where NOFire fits: the “understanding layer”
Mirko Novakovic: Yeah, absolutely. So where do you see your tool in this whole chain? I mean, you have the coding edges on one side. You have the observability vendors. On the other side you are kind of in the middle, right in between coding and observability. So, how do you see that landscape evolving for observability for your category coding agents?
Spiros Economakis: You know, we hear this a lot lately, right? In conferences on AWS Reinvent recently. You know, there were a lot of discussions about observability, you know, slowly becoming the data layer than the reasoning layer or the dashboards going away. What is happening, you know, with all this, this drift? Yeah. And honestly, I don't believe observability is going out, is going away. It's still going to be fundamental. It's still need to have telemetry. But teams everyone right now you know they're trying and the customers they are trying to know to judge and debate. Are you collecting too much data? And we are learning too little from it. Right. So you see right now also the other, you know, you know, the other kind of there's the AI, there is also the bring your own cloud right now story about okay, I can have the telemetry on my cloud. I have cheaper storage, you know, because the volume costs have grown faster than the value team. You know, the value teams extracts from it. So there's different kind of advancements here. But even if we solve the data problem. Then there are other problems here. Understanding what we discussed a few minutes before. So where I see the things are going, I think is is a layered model, which I have in mind, you know, observability still becomes the data layer, right? And there is also a layer which I call the infrastructure layer.
Spiros Economakis: Right. It's been about efficiency, cost control and how you decrease the cost heavily. Or you keep only the right data information. And there is on top of that, right. The understanding layer. So you utilize some of this information which connects to the different kinds of signal. And does this connection relationship that are cause and effect to help teams, you know, reason and understand what is actually happening. Now there is a fourth thing which includes all this stuff, right? Which includes I am the super vendor which does everything. But still, I don't think, you know, observability can keep up with pace today in terms of costs. That's why people are trying to find different ways. Right? So I don't think that it's going away. It won't be replaced. It's going to be the foundation, but it's not going to be the center of gravity. Right? I think we move more to understanding and prevention. And definitely it's not about collecting more telemetry. Right. So my tool, I think it's on the understanding layer, I would say which connects all these scattered information right to a cause and effect relationship to have better understanding. So we are actually on top, I would say on the observability layer. We don't touch the efficiency and cost in this case.
[00:19:52] Chapter 10: The evolving UI: agents over dashboards
Mirko Novakovic: It's a good question. Right. We are discussing this internally also and I'm thinking about what observability will be with AI, right. Yes. There is the option that the observable becomes a data layer, right? I mean, obviously you need the data, right? You need the telemetry somewhere you need to store, not maybe everything. Maybe you can do it more intelligently, but you need to store the logs, spans, metrics, events somewhere. And you need to be very efficient in querying them, because even the LMS and the agents need to query the data and understand the data. Right. So you need that query database layer very efficiently and scalable. And then the question is what becomes the UI right. Is the UI or the user interface. An agent or multiple agents could be an agent for troubleshooting. Could be a tool like yours for understanding the situation before you actually push code into production. So that's a good question, right? It's really interesting to see how this space will evolve. And what's the user interface? Right. Is it a cursor. Is it slack? Is it its own UI? What is it?
Spiros Economakis: Right, exactly. You know, we debate, we do the same, you know, questions internally too, right? As you said, I would say the one thing which we see definitely, you know, the UI for doing queries, query language things or for creating dashboards, I think definitely we will see in the next few years is going away. Right. Because right now, if you think about it, connecting all these tools right now, you cannot create ad hoc dashboards, not only from my data or from my Prometheus or from my, you know, specific telemetry and observability vendor I use, but from everything in an ad hoc way. Right. So give me the dashboard, you know, for this and this, including all the tools I have and the different services. So I think we see this shift already, right? The people, they cannot keep up on learning more languages and they cannot keep up creating more and more dashboards to understand. So they want to understand fast and very easy. You know, in a very natural language way to get something back. Right. So definitely that's something that is changing. What we haven't actually yet. I think it's also something which is interesting, you know, from your side is what you said. Is it the data layer? You know, observability. Is it something else? Do you keep only specific information? Right. Insights only. What? What is it? That's something which is interesting. Fascinating, actually.
[00:22:25] Chapter 11: Data access dynamics and vendor strategies
Mirko Novakovic: Yeah. It's also interesting to see how the observability vendors will react to it. Right? I mean, they could close a lot of the data to those agents, right? I mean, you could do rate limits because you say, hey, why would an agent carry all my data and create a cost for me and then the value is in their tool. So I can see vendors at least rate limiting or putting limits on the API's, MCP server so that essentially the user has to be in the observability tool, because that's the only place where you can really get access to the data. Right. So I think it will be interesting to see how this will evolve. And if those vendors like you then start building their own data layer, right? Because you say, hey, I have to own the data, right? So I'm building it. So I can see both sides, right? I think it will kind of emerge into the same category. I think finally you will have to get some of the data because you cannot be dependent on external data only. And on the other hand, I think vendors like us, we have to create the agents because we cannot give that space to external agent vendors and, and leave that to them. Right. So, I mean, that's why we released our Agent0 platform with SRE agents, with dashboard integration agents. But it will be an interesting dynamic in the market. And both categories will have a market, I think. Right.
Spiros Economakis: Yeah. And that's right. And you know there are different approaches by different vendors. Right. There is the approach you have shared about the big tent, you know, vision of Grafana, right. Everyone comes here in the great and everyone is going to benefit from this. That's a different approach from Grafana, right? Everyone can sell. Everyone can be part of the platform. We are not limiting and so on. Definitely. Maybe this is going to change. Prices for LLM and tokens are going to go higher. That's something we will see for sure in the next couple of months, I believe in the next few years. So you need to be efficient, right? From that standpoint. So you need to start thinking cleverly. What do you do there? You get this access faster only by specific information that shows everything.
[00:24:32] Chapter 12: The fragmented tool landscape and OTel adoption
Mirko Novakovic: If you have multiple integration I see right to multiple tools. So in your customer base what do you see? Do people have multiple tools or is it only one? So how does it look like? Is it still pretty scattered or.
Spiros Economakis: Everyone has a will. I want to go to the unified observability that everyone has and are willing very few to literally actually do right. So what we see is, you know, every company and especially large enterprises, they have, you know, in their bags, acquisitions, new teams came in. So lots of the tools are, you know, still very, very scattered and very little effort to go and unify them. Right. And there are multiple ways to do this. The one is that, okay. Let's put everything to the vendor. Right. But they are thinking if I do this it's going to explode. So let's see what is the alternative right. Okay I'm going to do an open source project running this in my data, you know, in my data center or my infrastructure and try to maintain the data retention, all these things which you can be clever and do to save costs with S3 buckets and other stuff. But it takes effort and time and priorities are like, I want to see with AI, everyone is in the sleep mode. I want to sleep more. I want to sleep more. I want to sleep more. So very few start investing right now to unify. But we have a customer right now with literally like 11 tools from different kinds of things, like, you know, from MongoDB, Atlas, from slack, you know, slack data, you know, Splunk, Elasticsearch, scattered logs here and there, a different kind of log vendors one for metrics, one for logs. There are many, many weird things happening here. There is a willing I think the we are not there yet, you know, for the people to go and do it. And the other thing is, what we see heavily right now is the adoption of OpenTelemetry. You have probably seen this too, I suppose. Seems that OpenTelemetry finally, you know, started being like the number one to go, but definitely has its own challenges to scale, which we see also with you know, the customers, it's not implemented correctly. It's not configured correctly. Many, many things can go wrong.
Mirko Novakovic: No, absolutely. But why do you think people are not consolidating? I always ask myself the question. I mean, if you have 11 tools, different tools, it's not that hard to consolidate, right? And it makes so much sense to have the data in a single, single source, right where you can also create better context. So I'm always trying to understand why people, is that more like the developers and SREs liking a tool. Don't want to switch because they're used to it or because, I mean, with OpenTelemetry it's easy to to find a common language, and it's also kind of easy to consolidate. Yes, there are dashboards, alerts that you have to migrate, but it's not like a rocket science project, right? So I'm asking myself, why are they spending so much money on different tools? And also why do they not want everything in one consolidated view, which I, in my point of view, makes sense.
[00:27:36] Chapter 13: Why consolidation lags and the CFO trigger
Spiros Economakis: Given the time. And I'm telling you by personal experience, Your CFO comes and tells you, why do you have all these 4 or 5 things? It's the same observability tool which is using for different aspects. We need to consolidate. Let's find a better deal if if it won't happen. I think it is a problem for CFO. It won't also the pressure to the teams. We need to consolidate faster. Right, right. I don't see this coming yet. At least lots of people they are in. Okay, let's leave with this, because it has too much effort to migrate from one vendor to another one. I have a, you know, a very close friend who is doing a migration right now from one vendor to another one. It's like seven months project to move everything, you know, alerts, metrics, dashboards, everything. So it takes a lot of time. And there are other priorities which they need prioritize right in parallel. So if the CFO won't come and tell you we're spending too much on many of these things and you are not a budget line item, I'm telling you, I have seen this in practice. We did this in my previous company. We moved to unified observability for exactly this reason. And so we will see it, I suppose, because lots of costs are going up right now. Now everyone is using AI. Tokens are very high way. So people that were studying why we are spending too much here, too much here. So I think the consolidation will come. Is there a way to make it less painful? I think that's that's the answer. How we can make it less painful to go from one to another one.
Mirko Novakovic: And I mean, using your tool is kind of a way of doing it. Right. Because you could start putting your understanding layer on top of it. Yes. Connecting five different tools. And it doesn't matter for you if you consolidate, it's still the same user interface to the user, right? Because your user interface doesn't change. Is that correct?
Spiros Economakis: Exactly, exactly. Mirko. Right. So you consolidate everything under one. So the consolidation happens without pain. At this moment, though, the costs for the other vendors are still there. So you still have to fight in some cases, like okay, I have also on top of layer, you know, and you have ten other abilities we should consolidate under one thing or two things maximum try to find the best but needs to become a problem, need to become a budget problem. If you are not in this budget red line are people or teams. You know they have it in the bucket list. But it's, I would say low in the priority.
[00:30:01] Chapter 14: Using NOFire as a consolidation bridge
Mirko Novakovic: Yeah. So how do you mean, you're kind of defining a new category right. In, in in this space. How do you find your users? I mean, you're not an AI SRE agent. That's a category right now that's new, right? You're not an observability vendor. You're kind of somewhere in between. So. So what's your perfect user and how do they find you? Yeah.
Spiros Economakis: You know, right now, you know, there's a lot of content which is happening. There's a lot of events. We are going there's a lot of cold reads out, just to be honest. Because, you know, I'm in the space for a really long time, you know, I think different kind kinds of reliability things. I have met a lot of people. There's a good network here. So things have started working right now. But I think when people see the prevention and, you know, understand how the, the, the system works and that the predictability right now is becoming solid, I don't have surprises. Not many surprises anymore. This, you know, changes the whole discussion immediately because they don't want you to have all this engine every day. They don't want to be on this wall almost every day. Even if you got this root cause analysis fast right. Still, you have to dig in the details. Still, you have to identify more things. So preventing early is very is something which everyone tries to find at this moment. Until you start the discussion. Right. Everyone is focusing very much on the RCA, which is very, you know, very efficient. You can get really fast for those tools. A lot of things it's very useful. We also, you know, have this foundation underneath but in practice, the prevention, the predictability shifts the discussion immediately because people they want less incidence, they want to maintain excellence. They want to be safe when they release. Right. And the team is also more calm without stress that if I click this button or have a continuous deployment, everything's going to explode. So this gives to the at least the engineering leaders. We're discussing the guarantees and to discuss more and how this is happening. So the discussion are fast very fast in the beginning. Also you know, the you know, showing the platform gives us a different kind of boost because you see in practice how things are can be actually better and safe and predictable.
[00:32:18] Chapter 15: Positioning vs. chaos engineering and “full context embedded”
Mirko Novakovic: Yeah, absolutely. I was just thinking about the category. Right. Because at the end you are in the reliability kind of yes space. Right. Which is also kind of a space of chaos engineering if you think about it. Right. Somehow. So, so yours is a little bit like that. Right. Just different as far as I understood. Right. You are. You don't have these cows experiments. You take real data from production and then you try to derive from that what would happen if you deploy that code or if something happens. Right. So it's not the same, but it's kind of very similar in the way you want to address the right reliability problem. Right.
Spiros Economakis: Yeah. We call it the, you know publicly and lately you know, a bit more we call it like the full context embedded right from the concept. Right. That's what we're doing when, you know, we're in a company. Right. We get an SRE joining or embedding a team to influence the best practices, how the system behaves, make the people understand and other things. But right now you have it just under one integration, right? Who is integration? You can just get all the information on your code while you are actually doing, you know, the decision making right and give you prevention capabilities and understanding without having to always stay close to you. So SREs is focusing more on how I can scale the capacity planning. You know, the cost optimization and things. Right now the ideas get what they were missing all these years. So we call it full context embedded, which can be in different kind of ways that the pre deployed during the deploy on runtime or, you know, learning continuously from all this interaction and from all the different kind of data sources. So you get a continuous understanding, right. With this I would say generic I would say is more static way. Right. We have seen all these kind of, you know, create experiments, try to do chaos engineering. We're doing all of this in the previous companies. But still, if you feel about it, it's a game. It's still a game day, right? You're trying to tackle the problem just once a week or once per month. It depends on the team, right? It's still very retroactive versus proactive. It's still useful. You need chaos engineering just to, you know, to practice with your team. But it's not that you get the real thing right when you code. When you the real knowledge, real understanding. But, yeah, you're defining a category. You know how it goes. It's very. It's a very hard thing. But we see a lot of, you know, interest so far.
[00:34:51] Chapter 16: Predictions for 2026: cost, UI, and prevention
Mirko Novakovic: Absolutely. I can see it. The full context. Full context is kind of a word that a lot of people use at the moment. And I think we all learned that agents only really work well when they have the full context. Right. So that's something I think we, we, we all are learning on the way of building those agents, and we, we we say the same, right? We also say, oh, you need full context. Otherwise the results will not be that good. How do you see this space evolving? What do you think? What's your prediction for 26? We are almost at the end of 25. So how is this space evolving? What will happen to observability in this whole space? What do you think of 26?
Spiros Economakis: Honestly, I think we will see advances in efficiency and cost, right? Because vendors need to tackle this problem. And we will see now heavily trying not to do clever things about what I keep, what I'm maintaining, and so on, so you can get the momentum they had. Because there's a lot of discussion about this right now. So we'll see a lot of efficiency and cost. I think this is going to happen through AI. Honestly. It's not going to be just, you know, data retention and clever policy and other stuff. I think it's going to be still through AI. And the other thing, I believe we will see a transformation and observability in the UI heavily. I think this is changing. This is going to change completely next year and we will see lots of companies start thinking. Also, you know, what we do in terms of prevention, right? They will try to identify I have all this data, how I can use them properly to prevent problems faster earlier. So I think this is also something which we see rising up, you know the last couple of months. So yeah, 2026 is going to be fascinating. In terms of cost. I believe UI user experience will change completely and will slightly start touching the prevention there also from other companies right now, this is what I believe is going to happen.
[00:36:48] Chapter 17: Closing and acknowledgments
Mirko Novakovic: Yeah, absolutely. I totally agree on the user experience part, by the way. I think user experience with genetics will change totally, right? We have to rethink the way we design software with agents first in mind. Right. So absolutely. Yeah. Spiros. Awesome. That was fun talking to you and looking forward to seeing how your tool evolves and your company grows.
Spiros Economakis: Thanks a lot, Mirko. Thanks for the invite. It was a great discussion.
Mirko Novakovic: Thanks for listening. I'm always sharing new insights and insider knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.