Episode 3150 mins8/15/2025

#31 - Code RED LIVE: Beyond Hype - The Real Impact of AI on Observability

host
Mirko Novakovic
Mirko Novakovic
guest
Ben Blackmore
Ben Blackmore
 #31 - Code RED LIVE: Beyond Hype - The Real Impact of AI on Observability

About this Episode

We’re taking the Code RED podcast public! Join Dash0 CEO Mirko Novakovic, CTO Ben Blackmore, and Principal AI Engineer Lariel Fernandes for a no-fluff look at AI in observability. We’ll dig into:

→ What agentic observability might actually look like
→ How OpenTelemetry enables the AI ecosystem
→ How AI shows up in real engineering workflows
→ And what still needs to be built

Transcription

[00:00:00] Chapter 1: Setting the stage - AI’s role in observability

Mirko Novakovic: Hello everyone and welcome to Code RED live! I have with me Lariel Fernandes, Principal AI Engineer at Dash0, and Ben Blackmore, co-founder and CTO at Dash0. Welcome to Code RED. And today we want to talk about AI and AI in the context of observability. So I want to get started by discussing what AI actually means. Right. These days, I think in observability we always have used some sort of AI. Is it statistics or anomaly detection? Machine learning? But these are causal AI. But these days we talk more about Gen AI. Lari, let's start with you. Do you want to take us on the journey and what AI means and what the interesting part is for us?

Lariel Fernandes: Yeah, absolutely. So as you said back in the day, we used to have, like, observability tools powered by AI. Right. So you expect to have things like Pattern detection. Clustering. Search. Indexing. Anomaly detection or causal identification. And you expect people to use these tools to do a better job in observability. And then nowadays with Gen AI, I have yet another layer on top of this, because it will start moving to a scenario where AI has an agency like the humans working in observability. It should build on top of the existing algorithms that we have traditionally to get things done automatically. And it opens up a lot of possibilities.

[00:01:39] Chapter 2: Reality check on coding agents and productivity

Mirko Novakovic: Yeah, that makes sense. And we are discussing a lot at the moment the coding agent. So I want to start and Ben, I told you as probably a lot of CEOs I want that performance boost. So please test all the possible coding agents and see if we get to that ten x productivity. Right. So give me a little bit. Take us on the journey. Right. I think you tested all the tools. And where are we today? What are we using today, and how are things going?

Ben Blackmore: Yeah. That was definitely exciting to start with this, right? Because the promise is huge. So if somebody tells you everyone gets ten x faster, it sounds really awesome and really cool and you don't want to lose out on this. And it also maybe gives you that hope that you can keep the team really nice and small, right? And have everyone just super productive. Because that works better also for a product that we are building right where we pay a lot of attention to the product and product love. And yeah, so we went on a journey testing it. Right. Discussing a lot. Looking at various tools right from obviously there's cursor, there's windsurf, there's cloud code, and there are so many others. Both on the like let's say just the coding side, but also the code review side. Right. Because there's a lot of other tasks that engineers are doing. And yeah, we have spent a lot of time looking into those and testing those, right. Also overcoming the initial skepticism that of course many engineers have and we included. All right. When somebody is going to take away what you really like to do and yeah, I mean, I've told you. Right. And I think you posted that a few times as well.

Ben Blackmore: I've, I've spent a few thousands on a variety of tools. I'll be honest, some trial fatigue also is setting in, and yeah, we have found a few tools that that our engineers like, right. That they get value of. I would say first and foremost, that's cloud code as, as a tool that you can use, I mean, first, first and foremost on a terminal, right. It's not as integrated as a cursor. But it's working remarkably well for larger products such as ours, right, that are exceeding just a web app or whatever. But unfortunately I have to say that we are not getting a ten x boost out of it. Admittedly, we are still learning. We are learning that this is a tool in a tool belt, and that you have to learn how to use the tool, right, just with anything else that you have. And that we're seeing, depending on the task, that it's a lot faster, like building. Our prime example was building out these alerting integrations where the scaffolding can be done fully by any coding agent. Really. But then there's all the other tasks that engineers have that are not automated. And then this is where you still aren't a lot faster, for example.

[00:04:36] Chapter 3: Introducing MCP and Dash0’s agent toolbox

Mirko Novakovic: Yeah. I mean, still, we are excited about it. Right. And we are very early on a journey. I think the models will get better. We will learn better how to use those tools. The tools will learn better how to use the models. So and that's also why we went on this journey to integrate into coding agents. Right Lariel. So we built something called an MCP server. Right. So what is that? What is MCP? And why have we build it.

Lariel Fernandes: Yeah. Let's talk about MCP. So for anyone who hasn't heard about it, MCP is this protocol that basically allows AI agents and assistants to interact with some environment where they can either perform actions on your behalf or get data to have more context to solve the tasks. And then at Dash0, we have our own hosted NCP server, which we have extensively tested with different AI systems, including claud code and windsurf. And it basically allows the agent to have access to the same features that a human user would like to see in an observability platform. So as a human user if you're Sure. Debugging an issue you're usually interested in traces. You're interested in looking at logs, metrics, running queries, exploring what resources are available. And there is also some higher level tools for debugging, like the triage of logs and all traces, which is really powerful. And then all of this we try to mimic in the form of like a toolbox that the agent has access through the MCP. So it's a very easy setup, just pasting a piece of configuration in your client or cloud code, or you can start asking it questions about what's going on in your cloud infrastructure. Why is this breaking? How can I make this faster? And it becomes even more powerful when you combine it with different MPC service. Because if you have something like for linear GitHub or and so on, and it can correlate the context from your documentation to your the tasks that you have at hand to what's going on in the infrastructure and make code changes shape. Get feedback from the infrastructure. It really builds this nice development cycle from the Asian perspective.

[00:07:07] Chapter 4: Triage as a human-modeled capability for agents

Mirko Novakovic: Yeah. When we will look at our SRE agent later. Right. The prototype that we've built. But what I found very interesting is that those agents we have are very much like a human right. I mean, it's very interesting if you watch an agent who does autonomously, uses the MCP, it gets a result learned from it, right? Interprets the text that it gets and then tries to do the next action or get the next data. It does it in a way. So if, if, if a human struggles doing it, the agent will also struggle with it, right? If you provide functionality that the human can use to troubleshoot, the agent will be better. Right. And you mentioned triage. Lari, maybe Ben, you can explain a little bit what triage is and why it's so good in troubleshooting and why it's also so helpful for an agent then.

Ben Blackmore: Yeah, that was a really fascinating learner, right? Treating it like inhuman. And what we have in there, Dash0 triage, which is, you can imagine it as a form of pattern analysis over spans and logs, like identifying commonalities within, like, two sets of data. Like, let's say you want to identify what is special about my errors.

Mirko Novakovic: Then you can do it, by the way.

Ben Blackmore: I can, I can also show it. Give me a moment.

Mirko Novakovic: We haven't discussed it, but it's live, right? I can ask whatever I want. Yeah.

Ben Blackmore: Unscripted.

Mirko Novakovic: And because I think it's nice to see what it really does. Right. And then how an agent uses it.

Ben Blackmore: Yeah. Right. So I don't know whether I'm on screen, but.

Mirko Novakovic: Yes.

Ben Blackmore: What we can do in Dash0 is to analyze what is special about the errors of my product catalog service, for example. Right. This is within the OpenTelemetry demo here. And what it can very nicely service is in this case for tracing data spans that have an error. What is special about them? Is there any attribute that stands out and what you can do in Dash0 here is we are summarizing that for you. And we are basically surfacing, for example, here, that this specific product ID correlates very highly with errors. Right. So this ID is occurring on error spans but also unsuccessful ones, which is pretty interesting to see. Right. Also from the point of formation hypothesis that this is occurring on both. Maybe it just means it's flaky. Or it is always failing. Maybe this is two different span types. We could further drill down into that, but we can also just read this first line as a form of sentence. Right. What is it that is exactly failing? What RPC method is it? What operation name. And so this is a really powerful capability at the end of the day to surface information within tracing data and logging data. And where this gets really interesting and where this learning is coming in is where we have this capability. And you can also do way more with this like analyzing specific areas for example, to troubleshoot latency problems. And the very first attempt of integrating this in an MCP server, what we did is we basically gave it the API, which means you have all the flexibility in the world.

Ben Blackmore: And what we learned is that the LLM basically had no idea what to do with it. Right. So this very naive approach of just exposing the API and all the knobs that there can be didn't really work. And so as we were experimenting, we were starting to think about, hey, we also have learnings from users, right. We were also shipping it to users. We were learning from them. What are the patterns they're using it for? What do they find most valuable and how would they like to use it? And so we just started to repackage it basically, right. Give it less options and somehow treat it like a user. And at that moment it really understood finally what to use it for. Right? And this was a really interesting learning, both for that capability but also for others. Like how do you summarize information about a service for example? Right. So we tried the similar things there and then the same thing happened. Right. If you model the sidebar content that we have in the product, for example in a textual interface for the LLM, then this also very notably improved. Then the following steps that the LLM was taking.

[00:11:29] Chapter 5: Designing LLM-friendly outputs (markdown, schemas, and tokens)

Mirko Novakovic: And how does it work Lariel? So we return a text representation. Or how does it work if I, if the LLM and we I think we use cloud right at the moment. So if Claude essentially that LLM calls our MCP and user's trash, what is the answer and what does the LLM do with it.

Lariel Fernandes: So that's a really good question because we're actually still iterating on that matter. So at the moment we are betting on markdown content. So we basically model what the user would see in the UI as a piece of markdown. That's then the LLM can interpret. And also have to remember that there is a human in the loop in most cases, that there is a human who has interacted with cloud to ask to solve a problem and the results of the tools they need to be consumable and digestible also for the human. And then a markdown is a nice format that LLM to understand. And humans also get a nice render in whatever user interface that they're using to communicate with it with cloud so that they see what was the result. Then they can also judge whether the solution proposed by cloud is well grounded on the information that was returned. Right. But we're still experimenting with those because the MCP protocols really are nice. When it comes to like output formats. So you can have structured output formats conforming to a certain schema. You can have embedded content like images, like a snippet of a dashboard, or you can have embedded UI elements, and it really opens up lots of possibilities. So we're really experimenting with what works better both for humans and for AI.

Mirko Novakovic: Yeah, that makes sense. And inside of the markdown. Then there's literally text and information about the traces or what, how can I imagine that.

Lariel Fernandes: Yeah. So we try to make it as compact as possible. Because you see when working with telemetry data within LMS, the problem is telemetry data can be very token hungry we say. Right. So it can become very verbose, which means you'll pay a lot for processing those tokens. It becomes slow, it starts increasing the likelihood of hallucinations. So when we try to model that outputs in a way that it is digestible for our larger models. And sometimes this may involve some kind of post processing. So we can employ different algorithms like from a think of traditional machine learning algorithms, things like cluster ization summarization forms of representing information on a higher level. So instead of giving a raw data like that is very verbose to agents, you give them what has already been after the numbers have already been crunched. Right.

[00:14:39] Chapter 6: Leveraging OpenTelemetry semantics to boost agent reasoning

Ben Blackmore: Yeah. And I can show this also in cloud code what it just looks like. And I'll just shortly screen share. I don't know, can you see it now?

Mirko Novakovic: Not yet.

Ben Blackmore: Somebody can put me on.

Mirko Novakovic: Yeah, yeah.

Ben Blackmore: It's on screen. Thank you. Studio.

Ben Blackmore: So, for example, when you are integrating this in, let's say, cloud desktop, right? You can see when you query something what it's doing through these tool calls.

Mirko Novakovic: So you ask why is my cloud service failing. So that's that's the question. And then it connects to our MCP server and checks for the card service. And what's happening. Right.

Ben Blackmore: Exactly right. Right. So first of all it starts finding out what is really the name of the service. Right. Because very often humans are using a small variation of it, not like the exact name, the technical name. Right. And then right, it's using, for example, the service catalog in order to identify it, right through which it's already getting some, some highlights. And if you look at it and this is a bit hard to see here in this format, it's easier to see for the following call. Where it's than getting the details is that we have been experimenting a lot with a tabular structure, so Lariel has been trying a lot of formats just to see in what format can we represent this nicely and concisely, as she has said? And then picking up on the point of summarizing it and putting it in a format similar to how a human would be able to conceive it or understand it, is essentially following the structure that we have in the sidebar into product.

Mirko Novakovic: And I have a question when I see those things. Essentially, the column names are tag names from the semantic convention. Right. Otel semantic question. Does that help? Because it's an open format and probably the models are trained on that documentation. Does it help the model to understand what it's looking at?

Ben Blackmore: Yes. Absolutely. Right. This is one of the huge benefits of OpenTelemetry standardization, not just on the data format, which we have had before in various permutations. Right. But the semantic conventions as they came out, that was really a game change. And this will be a game changer going forward as more and more people are adopting it. And AI can just learn about this. I mean, anyway, learning about it because it's in their training sets that is so widely existing on the internet, but they can learn what it means, what some of those mean, right? Some of them, of course, are more self-explanatory, like a cloud availability zone, but they are much more nuanced ones that they can then work with down the road. And that's really the game changer, right? That extends both to the attribute keys, but also to things like metric names that OpenTelemetry is also expanding to. Right. So if an LLM wants to inspect, let's say the HTTP request frequency or duration, right. They can then know how to or let's say they have some of that knowledge within their training set. Ideally they still have a way still to figure out what is available. And we have some of those things within the product, right, that they can then learn what is available for a specific service, for example. But that is a true superpower. Then in the long term.

[00:17:56] Chapter 7: From chatbots to SRE agents - practical autonomy with oversight

Mirko Novakovic: Yeah. And that's kind of where we are coming to. Right. How to use that in observability. Right. I think there are multiple we have seen, I would say, multiple iterations already in a very short period of time. We did this. Our competitors did it. I think at the beginning everyone was because we all knew ChatGPT. Everyone was kind of in chat mode. So at the beginning I think we saw a lot of these, I would say chatbots, right, where you could ask questions in the chat, hey, give me all the logs of service x, y, z. So it kind of replacing the query language or the filtering with a more conversational form of of of user interface. I'm not the biggest fan of it to be very honest it always looks a little bit like Clippy. In those days, right. Where you have some sort of bot popping up. And I think we are also seeing that a lot of the vendors are not that enthusiastic about it anymore. And then I would say the next step, and that's where we are right now is more the authentic world, right? So where people talk about agents and agents are, I would say, little tools that take over simple tasks or tasks. Right. For, for that normally human would do and we are talking about a category now I think Gartner is talking about.

Mirko Novakovic: Others are talking about what's called the AI, SRE agents. Right. That's essentially a and there are companies out there who are building AI, SRE agents, which is an agent that connects to those MCP servers of one or multiple tools, tries to figure out what problems are and then in best case, automates, It's a task, right? Like scaling your infrastructure up and down for cost reasons or, I don't know, restarting something because there is a problem, which we all know also from our history. Humans tend to be very cautious. Right. Letting an agent autonomously do something your production avoids. It's a little bit like autonomous driving right at the moment. I would say we still have a driver's insight, and we look at what the agent is doing, and then we probably say, yes, do it or not do it. But we are also working on that type of agent, right. We have we internally call it Agent0. And maybe then you can, you can pop it up and Lariel can talk a little bit. What we are doing, it's kind of an experiment at the moment, I would say. Right. And we want to release it in the next few weeks, months. But it is an SRE agent and. Yeah. What are we seeing here?

[00:20:40] Chapter 8: Agent0 - Dash0’s SRE agent prototype

Lariel Fernandes: Ben or me?

Mirko Novakovic: Lariel. Go ahead.

Lariel Fernandes: Yeah. So basically, the all Agents0 is a this assistant for all debugging a certain alerts. So you should have in your failed checks or something, or in critical or degraded state. You can go to the agent tab and have it kind of diagnose what is going on with that service, what is causing that failed check and proposal or what should change or how you could possibly fix that. And in order to, to achieve that. So it's basically a language model connected to different tools so that the tool that allow it to look at what metrics are available and build queries or run queries or look at the results. And in this case because we own both the back end and the front end for the agent, where we have the opportunity of showing the results of two cards in a really nice way. So you can see, like the query that was executed and the snippet of the chart, and you have a tooltip on it. So it's fully integrated into the UI. And this is a really nice and in comparison to what you would get in a generic like agent or client UI. And in the end, what you have is a diagnosis of the problem or a different hypothesis have been tested and based on the data that was retrieved. You have a human or digestible explanation of what's going on in that system.

Mirko Novakovic: Yeah. And that's exactly the case here that we have seen before, right. There's a catalog service within higher error rates. So this is red right. And then we do a root cause analysis. And it essentially probably used that feature that the feature read, and it basically tells us that there are two product IDs that cause that error right here. Yeah.

Ben Blackmore: Exactly. Right. And this is so the interesting thing is this is, of course a very early version, right. This is mapping to also what we have seen in cloud code. There's a lot more that you can do based on this to make this a lot more advanced. But I think what's really interesting here is how to integrate this also into a product. Right. Because you kind of need to understand what it has been up to. And you also need to cross validate what it has seen. Right. This is the idea this is not fully there in this version just to see what is it exactly doing. So it basically can treat it like an expert right. I know Mirko you have been doing troubleshooting for a living. And not everyone is able to do this, right? So having a body there, even if the body is a quote unquote robot can be really helpful for many, right? The camera, your a real enabler.

Mirko Novakovic: Yeah. I have to say, when I saw this the first time, I was pretty shocked. Right? Shocked in a positive way. How good the analysis was, right? I mean, I also know that it doesn't work all the time, right? And you have to investigate. And we are still figuring out, as Lari mentioned, right, how to present the data, how to, to do this integration best. But at the end of the day, if you look at this here, I mean we are talking about thousands of spans. If you have errors and you have to find why the root cause. Right. The needle in the haystack. And essentially this agent did this job. And I don't know when you let it run probably a minute. Right. It takes you a minute.

Ben Blackmore: And I haven't reloaded it because of rate limiting of anthropic.

[00:24:19] Chapter 9: Reliability, variance, and evaluation challenges

Mirko Novakovic: But in a minute it figures out what could take you hours or days of your best developers to figure it out. And it gives you a nice explanation. And now again, coming back, you could integrate it in your coding agent and let cursor or windsurf suggest the fix, right? Because now it can connect to the code base to GitHub or whatever, right through another MCP. What variable meant by using multiple MCP servers? Right. And now it could say, oh, and by the way, do this and that. Right. So it's very powerful. And you can just imagine in the future how multiple agents work together. You have access to multiple MCP servers. Those servers will get more and more intelligent. So I think we are pretty bullish on this agent topic. And we will release more and more of these root causes as agent agents for automations. I can imagine agents that will update your dashboards for a new release, right, or updates your alerts for a new release. I think there are many tasks that can be done with that technology.

Lariel Fernandes: That's a really good point, because when people talk about agents and observability, they often think of root cause analysis and incident resolution because it's an obvious use case, right? Yes. Saves people a lot of time. When time is critical. But we don't need to limit ourselves to that use case because there is just so much more that you can do. And like a platform team with an agent like that, if you think of, like, the work that humans have been doing as a SRE and platform engineers, a lot of it it's not really instant resolution, but rather preemptive maintenance doing stuff like migration or improving infrastructure and or performance, looking for bottlenecks so that they can address making things more secure. Identifying a security issues. And lots of those things can also be automated with an AI agent especially when it has context of both the telemetry and code and documentation and open issues on linear so they can connect all of those contexts. And we got already some very good examples like success cases with our on our MCP server when combining it with either cloud or client. One of them did the other day has been when I needed to update our instrumentation for language model or clients. And there is this new generative AI semantic convention. So I needed to in theory, I would need to read what changed in the semantic convention and then look at the library and then change it or test it, see if the telemetry comes out or compliant. And basically I was able to automate that with cloud because it could look at the docs, make the change or run a script to generate some, some signals, look at the data on Dash0, match it, reiterate. And these are a couple of minutes I had a fully up to date or instrumentation Implementation of for language model clients, which I would have taken a couple of hours to work on.

Mirko Novakovic: Yeah, and what I like about those use cases that you really automate, the stuff that people don't like to do, right? I mean it's as you mentioned Ben, some of the coding agents. Right. The fear is that, I mean, as an engineer, I was a developer myself. You kind of love developing code, right? So if somebody takes you away from that part, it essentially takes away one of the most fun part of your job, right? Which is probably why you would resist more. But if it takes away the job, for example, I release a new service and now I have to update 40 dashboards and 30 alerts, right? And put that new metric in or change the metric name. That's nothing that people really enjoy. Right. And if you have an agent that takes away that work. Nice. Right. And we already did something very similar with log AI. Right. We one of the things that you have to do if you have unstructured log, you have to define all these patterns, right? For example, if you have a log and this has a product ID and a product name, you have to define. Oh, by the way this is how this log pattern looks like. It's like a regex right. And sometimes you have to maintain hundreds of these red x's and they change. And then you have to update them. And so one of the first things Lari you worked on was saying, hey let's remove that burden. Right. And let's try to analyze the logs automatically. Let's detect the patterns and then derive the attributes out of it. And it works pretty well right.

Lariel Fernandes: Yeah, yeah. Maybe a band can show a screen of the log pattern. So I need an overview about it. But yeah, as you said, nobody likes ultra right or regular expressions. If someone says that, they like it, I don't believe them. And you know, those patterns, they can change at any time, sometimes without your developers knowing that they're going to change your ship, for example, or have an update in a dependency that is writing the logs. You don't know that the new version is going to be writing the logs in the different parts. You don't know that you have to update regular expressions, so it would be really a mess having to manage those by hand. And then what we did with the log, AI was to identify those patterns eagerly, every time that there is a change in the distribution of log texts so that we can display this nice view in the UI where users can filter logs that match a known pattern. They can search for structured fields. And then one thing that I really like about this is how it goes beyond the identification of the pattern. It also gives semantics to the pattern by giving names such the to the fields that it extracts, and the names make sense in the domain of the application because it takes into account the resource attributes, the semantic conventions to really understand what kind of application this is what it is talking about. So what field names make sense? And that's a domain. And for developers when they deploy something new and then they go into UI and they see that their logs are understood, it's really magical.

Mirko Novakovic: Yeah. And Lari, I just want to mention, because we have some people in the audience here, if there's any question with regards to AI and observability, how we use it about those patterns, please, please feel free to ask questions and we'll try to answer them. Right. You're very welcome to get into that conversation. Yeah. And what we have seen here is that we really extracted the attributes from the message automatically without any patterns. And it would adopt. Right. What are the challenges here? I mean, I know that we with some customers, we. I know that we process millions of locks per minute. Second. Right.

Mirko Novakovic: It's on a high scale. And it can imagine that you can't call an LLM for each of those locks. Right. So, how does that work?

Lariel Fernandes: Yeah, a very common challenge of all AI in observability is that things get expensive really fast. If you don't take care of her. Right. So we had to go really careful about this and come up with clever caching strategies and also strategies to identify when the logs coming from a certain application, they change in distribution. So in statistics we call we usually talk about a distributions. Right. And when you think about the logs that they have a certain word frequencies that can be identified with traditional natural language processing algorithms, and if the distribution in those changes, then you usually suspect that something in the patterns has changed and that it's a good occasion or the opportunity, it should try to re-evaluate them. So we have many mechanisms in place to apply the the full logic of the pattern mining as few times as possible to reuse as much information as possible so that people have up to date or patterns with names for the future that makes sense in their domain, but without it becoming a expensive thing for us, especially when we when we remember that this is a feature that we offer for free. Right? People only pay for the data that they sentence and that they get this out of box. So we need to make it really efficient. And that's to meet the most challenging and also the most exciting part.

Mirko Novakovic: Yeah that works. And by the way, I.

Ben Blackmore: Struggled with in the past. Right. A lot It's like we had great ideas, we had some great algorithms, and then we couldn't get that to production. Right. So that was a really terrible situation to be in. When you later realize after having invested so much time that, God damn it, I can't get it out, I can't deploy it. So this was from day one, this time a huge consideration. And I would urge everyone that is in a similar situation to do the same. Right. It needs to be a day one concern. If you go, go down any path like that.

Mirko Novakovic: Absolutely. Before I ask my question, I had one about false positive. I see that Baumbach has asked a question here, and it's a very good one. Right. And I think you my perspective on it as a user, because I'm demoing the feature almost every day. Right. And the question is if you if you have that agent and it does a root cause analysis of the same problem, is the answer always the same, right. And I can tell you no. I mean, I did that test a lot of times, and at the first thing I see is that the answer is almost always pretty good and accurate. It's not the same, right? It's different. But I think the result is pretty much. In this case, for example, for the prior catalog service, the answer is almost always correct. I haven't had a case where it was not correct, but what was interesting is that the way the agent finds the problem is very different. But you saw the different steps that it says, yeah, I mean the names of the catalog. So sometimes in your case right now it was 4 or 5 steps. I had situations where it was ten, 11 steps, very different ones. So at least what I've seen is that the way the agent approaches the problem is different almost a lot of times, but the answer is pretty much correct. So give me your answer.

Ben Blackmore: No, I can only fully agree there, right? There is always some sense of randomness in there, and this is one of the core challenges here, right? It's getting it more reliable and getting more consistent. That is one of the core challenges. Just building a prototype is the easy thing about this, right? Getting this reliable is the real challenge at the end of the day. And in my opinion, the path will never be 100% consistent. There will always be some sense of randomness. I would say the end result is the the crucial thing. This is and this is now not just related to what we are doing. Right. This is basically every kind of use case around it. At the end of the day, as long as the end result is the same and as long as the steps in between makes sense, that is really good. And as long as you're watching out for these steps in between what is doing right. For example, in our case, there's a lot of queries that it's trying to generate. And there are a lot of situations where in our experiments it's just generating the wrong query. Right? It somehow does something gets wrong and you need to figure out why. And I would argue that it's probably the same for a lot of use cases where the tool calls are non-trivial.

Lariel Fernandes: Yeah. But this question is really important because it brings up the topic of evaluation. Because one thing we still want to do is to check, for example, the robustness of the the agent solutions. So if you twist a bit the way how a certain question is asked, or do you get the same answer or not, you get the same path or on average, the amount of stacks to get to the answer are the same as the amount of errors that the two calls that are invalid in order to get there. Is it roughly the same amount? These are things that we can easily test when we have a set of questions for which the answer should be roughly the same. But then again evaluating something like that requires you to have some observability environment that the agents can interact with that is already known. So we can know what to expect from each. Of tasks. Right? So if you have a chaotic environment that is changing all the time, then you don't know what to expect. So it's difficult to evaluate. And then we usually go for the OpenTelemetry demo for all evaluations because it's an environment that we are already familiar with. And the problem is the large language models are the most modern ones are also already familiar with the OpenTelemetry demo. So sometimes you can observe them cheating it, employing prior knowledge about the data set instead of actually investigating the data. And that's really annoying because it's the data contamination scenario.

[00:38:13] Chapter 10: Developer productivity - promise, limits, and learning curves

Mirko Novakovic: Interesting. Yeah. We have the next question of Florian, and he asked if he actually saw posts about I actually decreasing efficiency of developers. And what's our take on it? I want to say a few things. I was recently on a keynote and there was a CTO of windsurf, and he said 95% of the code at windsurf is generated by AI, right? And I want to say that in my past, I learned that number of lines of code is not a good measurement for efficiency, right. So I would say the biggest problem I see at the moment that we see all those numbers are around efficiency. And it is we almost everyone in this space knows that it's very hard to measure the performance and efficiency of developers overall. Right. And probably number lines of course, is not the best one. So I think always be careful about those numbers. I think from our experience. And Ben you can tell it I think we see an increase in productivity with those agents. But we also have to say while you are experimenting, if you are experimenting with five six tools, I think the experimentations can take some of the productivity, right, because you're just doing stuff to experiment. But if you would just say, okay, I use one tool, I've evaluated it, and then just look at how much better you are. I think we would see an increase, but it's not ten x or five x or anything. It's a tool that makes you more efficient and like that's good right?

Ben Blackmore: Yeah. And that's exactly right. You have learning time for the tool. And I myself, I think I said that when I started this whole evaluation. Right. I realized I just, I can have cloud code and I can basically quote unquote programs during a meeting. Right. I love programming, right. I'm now a CTO. I'm spending much more time. I said this earlier in Google Docs now and in meetings, and I figured, oh, maybe I can just program in the meeting, right? Just give it a prompt here and there. I can find time for it. And then eventually I realized that I have now seven different pull requests open. I have no freaking time to somehow move any of them forward, right? Because all of them were in a state that I still had to touch up on, on it. And we have, but we have definitely seen cases where you start off with the AI, and it leads you in a direction that you otherwise wouldn't have taken, and that you only afterwards realize, oh, this is actually not what I would have done if I had spent five minutes thinking about it. Right? But because you didn't start thinking about it, you didn't realize it. And this, for me, boils really down to this is a tool, and you have to figure out how to use the tool and when to use the tool.

Ben Blackmore: And this is for us the biggest learning. So we have some internal guideline documents on it. We have a channel where we are sharing learnings about how to use it and what to use it for. And there are some common learnings, right? So people love using it for test generation. They love using the cloud code GitHub action for code reviews, actually much more than any of the other code review tools. Interestingly. There are other things where it's working really well. Like I mentioned, integration writing, but there are a lot of cases if you haven't learned about that yet and you just start using it. And I think to study that, that Florian is referencing was to study where they had open source contributors work on issues both without and with coding agents. And I think they are I so I don't know the specifics of how much they knew about those agents before. Right. And exactly the use cases. But it might very well be that they were also still learning about them. Right. And then maybe it's generating 500 lines of code really quickly. But if it's the wrong 500 lines and then it doesn't help, right.

Mirko Novakovic: Yeah. And I think overall, I think what we are seeing in the market is that this topic is pretty much oversold in some cases, at least for the moment. Right then people say 95% of the code is written. It can suggest that 95% of the work is now automated, which we all know that coding is only part of the job of a developer, right? And so that's one part of the story. And so but I think it's also.

Ben Blackmore: But I would still say I'm not to sound too negative. I'm still really excited for it. Right. Because the reality is, if we look back at how long we have been using this like Claude Sonnet 3.7 came out, what, beginning of the year? Now we're at four. So a lot of the coding age that people are using aren't really old in any, any sense of the word. Right? So it will be really exciting how they are developing, let's say, in a year. And I don't think anybody can really know. Right.

[00:42:56] Chapter 11: Generating dashboards and IaC with agents

Mirko Novakovic: And maybe another. By the way, we have another question here. What about creating panels in dashboards? I think that's a good question. Right. Can you create dashboards that AI write for for a third case?

Lariel Fernandes: That's a really, really good point. And it's actually what I was going to comment in the sequence. So I suppose no one likes to write infrastructure as code, but a coding assistance can also write that and then at Dash0. We use the perses of format for defining dashboards. Obviously you can design dashboards or on your own in the UI and export the definitions, but we have tried applying coding assistance to that task as well, but giving them context of what kind of dashboard we're looking for. In that case, it was something for AWS Lambda functions. And we can even give it some, some sketch of like, you draw it yourself or what? How you expected you to look like. And then because of semantic conventions, you can really quickly figure out what metrics to look for, what all attributes names to look for, and then come up with the queries. And then through the Dash0 MCP, try out the queries against the live data to make sure that they work, and then in the end make a pull request with the Perses definitions for new dashboards, and it has worked like a charm. But dashboards look beautiful.

Mirko Novakovic: Oh that's cool. So essentially using a coding agent with our MCP to create the dashboards already, right?

Lariel Fernandes: Yes. Yeah.

Ben Blackmore: And so we have some plans for that like we mentioned. I mean, we looked at this, this Agent0 version, which is of course at the end of the day, the Holy grail if you automate everything. Right. But along the path, there are so many little things you can do to make the life of the people better. Right. There's like prompQL like protocol explanations that we have that I honestly always found a bit gimmicky. But when you demo it to people that don't know protocol really well, they are honestly happy about it because it's actually helping them close a small knowledge gap they have. Right? Yeah, I think are there are a lot of small things you can use it for that aren't that super shiny, but that actually help in the day to day. And this is one of them.

[00:45:10] Chapter 12: Beyond coding - reviews, alerting, and documentation RAG

Mirko Novakovic: With another question here. Goran Patel. Other than coding, is the other place you are using Asian development workflow. Good question. Right. Especially I mean, I heard you talking about review fatigue because now agents produce more code and you have to review more. So how can agents help with reviews or where are we using them in addition to coding.

Ben Blackmore: So your code reviews are a really interesting one because, I mean, I guess everyone here is probably doing them or doing them to some degree. Like, we have a lot of tools to statically analyze it. Everyone has probably tests and everything else. And yes, life is too short to learn. Right. And so I was really excited about code reviews because honestly, if, let's say you make code reviews mandatory pull requests mandatory, right? This is a blocker in the development workflow. It slows down engineers. And of course reducing that time will improve cycle time. So I was really excited about it. And so that's why I invested a lot of time. It's like trying different code review tools. And I'm not that super happy about it and not not that excited about them because a lot of them were very, very verbose. At the end of the day, right. So they create a lot of noise, a lot of contexts. That's one of the problems with generative AI generally. Right? If it produces a lot of it feels like it's being rewarded, quote unquote, for generating a lot of text. But you don't really want this, right? You don't want somebody picking on something that is absolutely zero impact. And this is why we have been looking for alternatives to that, for example, which we found when the Cloude Code review a lot less verbose and a lot more helpful.

Ben Blackmore: But yes, we also have other workflows, right? So Lariel mentioned developers using it to generate dashboard panels. They're experimenting with it. We have some people, some of our platform engineers have been using it just to learn about alerting patterns for new technologies they haven't used before. Right. So we have all of our alerts maintained as code. And so they were asking it, can you give me a heads up? Give me a starting point so that I know where to start with it, with that technology. So that is that is really helpful. And one of my favorite things, which is not quite an agent, is of course, I have to do a bunch of paperwork now. So I have like a cloud project with all of our security and compliance documentation. And when I get a question from a customer about a certain topic, I, I ask it, hey, what document is this defined again in? Because I forgot again which of those 50 documents is describing exactly this one point? And that is something that is helping me a lot in my day to day. But it's not glamorous in any way.

Mirko Novakovic: It makes sense.

Lariel Fernandes: With the server now, we all can also do a rag right over our documentation on our domain. So one of the use cases, I think, when doing support is like finding something in, in the docs, finding the answer to a nasty customer question, something about our part of the domain we are not familiar with, and you can easily get a solution and a suggestion of what query to build. Asking the agent and having it look through the docs.

[00:48:22] Chapter 13: AI-assisted dashboards in the UI and reusable patterns

Mirko Novakovic: Yeah, and I will wrap this up here with the last question. And that was again a software with yes. When we answered, it was more about how to generate the code for a dashboard. Right. Which is Perses in our case. And a lot by the way, a lot of our customers and also are we ourselves, we use dashboards as code, right. We deployed with code which we totally support. But of course you can also add dashboards and panels through the UI and Dash0. And the question is can we also support this with AI and gen AI? And I would say yes. I think it's a very good use case that you could use as as Jurassic points out here, life is too short to learn from PromQL. So probably you could just say, hey, build me dashboard for my problem here in Kubernetes. And the AI would pick the right metrics and build the dashboard for you in the UI right away. Right, I think. Makes it totally sense.

Ben Blackmore: Yeah, that's also a if you think about I mean, most people have somehow the same dashboards for common things. This is why they are dashboard libraries at the end of the day out there on the internet. Because how you look at a Kubernetes node, for example, is pretty identical, right? How do you monitor a temporal server? Right. One of the things that we're using is pretty much identical. How do you look at it at a specific database. Right. You don't want to start from scratch there. But of course you have some differences for your own things, but then it can be really helpful to get an idea of how to start.

[00:49:51] Chapter 14: Closing remarks

Mirko Novakovic: Yeah. So this was fun. Thanks, everyone for joining this episode of Code RED Live for the first time. And thanks, Lariel. Thanks, Ben for joining.

Ben Blackmore: Thanks for having us.

Mirko Novakovic: Thanks, everyone.

Ben Blackmore: Thank you. Have a good one.

Share on