Episode 1336 mins11/21/2024

OpenTelemetry Origins: The Evolution and Horizon of Observability

host
Mirko Novakovic
Mirko Novakovic
guest
Ben Sigelman
Ben Sigelman
#13 - OpenTelemetry Origins: The Evolution and Horizon of Observability with Ben Sigelman

About this Episode

Ben Sigelman, Co-Founder and CEO of Lightstep, joins Dash0’s Mirko Novakovic to share his journey helping launch Dapper and Monarch at Google, a nightmare New Year’s Day troubleshooting Google Weather, and how he views OpenTelemetry and observability evolving alongside AI and LLMs.

Transcription

[00:00:00] Chapter 1: Introduction and Code RED Moments

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I'm co-founder and CEO of Dash0. And welcome to Code RED code because we are talking about code and Rad stands for requests, errors and Duration the Core Metrics of Observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. So I'm really happy today that I have Ben Sigelman, co-founder and CEO of Lightstep, recently and also one of the inventors of Opentracing, which turned into OpenTelemetry was one of the persons inventing Dapper at Google. So I'm really excited to have you here, Ben.

Ben Sigelman: Pleasure to be here. Thank you. Mirko.

Mirko Novakovic: And we always start with our Code RED question. So my question to you is what was your biggest Code RED moment?

Ben Sigelman: Well, just in terms of feelings, I think it's an easy, worst ever Code RED moment for me. I was at Google, I was working initially on ads and then eventually on dapper, as you mentioned, and I loved it. It was fun. So I took a 20% project, which was a thing at Google, obviously, and developed Google Weather. So if you type in Weather San Francisco, it brings up those little results. And now that's like a staff project and everything. But initially it was just a side project I did and I, you know, did it end to end, launched it. And I didn't really think much of it. It was kind of on autopilot, frankly, but because it was a 20% project I didn't like, it didn't have real monitoring or anything like that. And it was all good though, because it was so simple. But then it was New Year's Eve and as happens on New Year's Eve, you know, I was out, you know, having a good time. And I'd had, like, maybe a couple too many drinks and everything. And it was living in New York at the time. This is January 1st, 2005. And I was 25 years old. And then at 3 a.m. in New York. It became midnight in Pacific time. And the weird thing about Google, I don't know if this is still true, but back then, for dumb reasons, local time in California was actually a big deal at Google. Like a lot of log files and things like that would wrap over at midnight Pacific time.

Ben Sigelman: And so at 3 a.m. eastern time, it became Midnight Pacific. And long story short, Google weather completely broke. Like totally broke. Like everything stopped working. And then I'd set up, you know, an alerting, a very crude alerting system that was just black box monitoring. It was just, you know, synthetic test, basically, to make sure this thing was working. It wasn't working. And then at 3:00 in the morning, frankly, like, pretty inebriated, I was awoken to the fact that my service, I'm the only person who understood it was gone globally. And there was a lot of, you know, people kind of noticing this and getting upset about it. And then I had to figure out what was going on. But because I hadn't set up any real observability that was like involved many deploys to production just to test theories, and it was disastrous. I did eventually fix it before 5 a.m. my time. But it was sobering in many ways. And definitely by far the worst debugging experience I've ever had in terms of just feeling like I really messed up. I don't know what the moral of the story is, aside from, like, you know, you should have a strategy for observability before you launch a service and maybe back up for on call. But other than that it was just kind of a fiasco. But that's my Code RED moment. I think in terms of just being over my head.

Mirko Novakovic: It's a really good one.

Ben Sigelman: Or bad one.

[00:03:32] Chapter 2: The Importance of Observability

Mirko Novakovic: And by the way, I think we both were CEOs of companies. And basically I looked at the times. It's very similar. Right. We both founded in 2015, Instana and Lightstep. We sold end of 2020 you sold in 21 to ServiceNow. So very similar paths, probably also a roller coaster ride for you. But one of the things I learned When you have more go to market and sales is that you can't really sell observability if there is no pain, right? It's kind of sad, but most people don't think about observability when they don't have a Code RED moment. And once they have a code red moment, then they really understand that they need something to help them fix it. And was it also a trigger for you to get more involved into observability when you had that moment or.

Ben Sigelman: Yeah. At that time, I was already tinkering with dapper, although dapper at the at that point was actually really not what I would call like a general purpose observability system. It was really specifically used for performance analysis, which, you know, as you know, is kind of like a subset of the sorts of things that someone would use, like a Dash0 for and wasn't being used as much for real time diagnostic stuff and, and root cause analysis use cases. But it was interesting because later, after working on dapper, I also worked. I developed monarch, which is Google's multi-tenant monitoring system that it's kind of like you know what? We would, I guess, call like a metric system today or infra monitoring. And with monarch I did think a lot about actually that experience with Google weather, as painful as it was because there are a lot of things that would actually be in the best interests of our users with monarch in terms of just eating their vegetables and setting things up correctly and stuff like that. And they never did it. Nobody ever did it.

Ben Sigelman: It's the honest truth is that most people who have busy jobs that are trying to launch services do not want to spend a single extra second thinking about monitoring and observability. So if it doesn't work more or less out of the box, it's not going to happen. And I think that was something I had to think back on, like, remember what that was like when I was actually trying to launch a feature and didn't really care about. I just needed to check the box on monitoring. That's the level of effort I can expect from people. So it's this weird thing where there is a lot of pain for sure, and I think it is probably in people's self-interest to do a really good job instrumenting their code and setting up everything and configuring it all properly, etc. but if your solution doesn't work good enough out of the box, it's not going to work. And I think that was the thing that I learned more than anything, actually, from having to actually deploy features on a tight budget and so on and so forth.

[00:06:12] Chapter 3: Instrumentation and OpenTelemetry

Mirko Novakovic: Yeah, and I listened to a few of your podcasts, and you also said that one of the good things that Google, the good practices, that there are some central libraries everybody has to use, right? Yes. And by putting instrumentation into those libraries and if you use them, you actually get really good visibility out of the box, right? If done right. So really a good best practice for any project is to not reinvent the wheel everywhere. Right? Which is kind of what engineers tend to, but use a set of really common libraries throughout all projects.

Ben Sigelman: Yeah, that's a great point. And you're absolutely right. I have said that, and I actually, frankly, haven't read it in like a decade. But the dapper paper, I think, says somewhere in the middle of it, somewhere like, oh, by the way, you know, this was something we were able to do at Google because of that. The fact that the central libraries were instrumented well enough that most people didn't need to do anything. And when I started Lightstep and the reason we ended up doing opentracing was basically this realization that nobody wanted to instrument their own code, especially to integrate with a proprietary vendor. Right. But instrumenting your code to comply with an open source standard was easier. And if we can get that built into central libraries that are connected to things like Kubernetes, etc. and cloud services, maybe, just maybe, we can get to a point that approximates what we had at Google, where, you know, the things you depend on are mostly instrumented. And I think with OpenTelemetry, we're getting close to the point where, you know, most of your dependencies probably have good enough instrumentation. Like that's a big, big improvement over where things were. You know, when we started our companies in 2015, when I think instrumentation was just a Wild West and it just didn't work.

Mirko Novakovic: Yeah, absolutely. And what I can tell you by talking to a lot of people here at KubeCon. Kubecon at the moment is that kudos to you and the team and the community. People are now really aware of tracing and really aware of OpenTelemetry. So almost every attendee I was talking to directly knows what tracing is, what? Opentelemetry is. And I have to say, ten years ago, most of the developers were not aware what distributed tracing is and how to use it. But on that topic, I would like to ask you one question. I mean, you are basically the inventor of distributed tracing somehow with your team. And but if you would rank realistically today what a developer is using to observe code in terms of metrics, logs, traces. How would you see that stack up?

[00:08:43] Chapter 4: Tracing vs. Logging in Development

Ben Sigelman: Well, first I have to issue a slight correction. I really don't like being credited with inventing dapper. I wish I did. I wish I was that smart. The person, the people who really did the critical exploratory work were all really distinguished CS researchers that came to Google when Digital and Compaq merged and then kind of failed together. And then a bunch of really impressive people came to Google at that point, including, you know, and those that group of people invented Google File System, spanner, Bigtable a whole bunch of blob stores, like gRPC, like all this stuff is due to that group, including actually the instrumentation I mentioned earlier. They just Google got really lucky. They landed like 50, like world class PhD CS researchers very early in their lifetime as a company. And then those people went and just made a bunch of really good technical decisions that had like huge internal rate of return over decades. So Mike Burrows, Sharon Perl and Dick Seitz, I think were the people who, in my mind really had the key insight with dapper. They weren't foolish enough to try to get into production. I think what I did was I was 25 and I was excited and I was like, I'm going to take this and make this work. And at that point they hadn't like come up with the term span. And like there was still some stuff that I, I did have some in my team did do some actual work, but the key insight was there. So let's make sure that's on the record.

Mirko Novakovic: Yeah.

Ben Sigelman: Because I don't want to be given credit when I don't deserve it.

Mirko Novakovic: I think it's because your name is on the paper, right?

Ben Sigelman: I know that's true. I also was foolish enough to write the paper. Which, by the way, we submitted to a conference in 2006, rejected, submitted to a conference in 2008, also rejected. And then in 2010, someone at Google wanted to publish a paper that cited it and said, where did that get published? And like, it didn't get published because no one wanted to publish it. And so we put it up as a technical report on Google. It was never actually accepted as a scientific paper, although apparently people read it. But it's ironic to me. I think the scientists said, this is not science. This is a bunch of engineering work, which is true. And they didn't see a hypothesis. It was an experienced paper. I was like, okay, fair. That's why I'm not a scientist anyway. To get to the heart of your question about what does the developer actually rely on? I mean, there's what I like the answer to be and what the actual answer is. I think the actual answer is that especially for developers, since you ask what developers who are checking in code, I still think that logging rules, that's just like everyone knows how to do it. When things aren't working, you always come back to logs especially for your own service. I do think that, you know, obviously for on call stuff like metrics are probably the easiest way to get a signal about basic health resources and so on. And tracing has its place for providing context and threading things together. But logging is still the king, I think. I would also add that logging and tracing really are a continuum. I mean, logging about transactions is a type of tracing, and tracing is definitely a type of logging. So you could argue that maybe they blend together at some point. But yeah, if I had to answer your question concisely, logging.

Mirko Novakovic: Yeah, I would give the same answer, right? Based on my experience, and I'm kind of a little bit sad, right? I was always an advocate for tracing. I love tracing, especially if it's done which OpenTelemetry has these days. Right? A lot of auto instrumentation agents. So you don't need to spend too much work to get it up and running. I mean, the best way of investigating how your microservices are interacting and troubleshoot things. So. And give you a dependency view of your system. So I do think tracing is really what developers should use, right. But the reality is it's really logging right. Most people use logs I totally agree. That's why I asked this question. And you already said you know, what has to be done in the future. So how can we get tracing better into the industry and make sure that I don't know, in five years from now, if I ask you the same question, you say, oh, it's definitely tracing.

Ben Sigelman: Yeah. And as I said, I. I'm not giving you my aspirational answer. I'm giving you the real answer. Right. Yeah. I mean, especially when you go into kind of brownfield enterprise environments. And I don't mean that in a disparaging way. I actually really admire people who get things done in that environment. But that world has a lot of legacy code, and there's almost no way you can expect it to have clean tracing instrumentation. So I actually think there's a pretty happy story here. Almost all logs are about transactions. And at that point, it really is semantically a part of a trace, whether or not it's going through your tracing collection pipeline and into your tracing UI is a different question, but semantically it's part of the trace. And what I'd like to see is a world where, you know through some sort of magic, the instrumentation that people add to their code, or maybe in the future that some LLM adds to their code, frankly, that that instrumentation. As a developer, you don't really specify where this is going. You're saying this is about a transaction and hopefully there's something that, you know, the runtime can pick up from a thread, local or whatever to say, look, you know, what's my login context? And also what's my tracing context? And just send it so that whatever tool you're using, you can see it like, it seems weird to me.

Ben Sigelman: Actually, if you're logging about a transaction, it should be part of the trace, and you should also be able to see it in your log viewer. In my mind, a log management system for a developer SRE is really just a different approach to sampling and a different approach to the UI. That's much more log line centric, but it's the same data, like it's all event data. It's structured event data and you can view it however you want. And I just would like to remove the dichotomy between these two things. Like from an instrumentation standpoint, logging about transactions is the same as a trace annotation. And we just need to solve that from a product standpoint so that you can see it in whichever context you happen to want to see it, whether it's a distributed transaction or the, you know, just the firehose of log lines from a process, I don't really care. So that's what I'd like to see. So it's kind of a yes. And as far as logging or tracing is concerned.

[00:14:46] Chapter 5: Future of Instrumentation and AI

Mirko Novakovic: Yeah. And you already mentioned LLMs or I, I recently talked to a startup that's actually and I think you're an angel investor in this started that's what they said. That is actually trying to solve that problem. Right. With logs to essentially help do better structured logging. Right. Because that's also sometimes a problem. Right. How do you log. Right. Is it just unstructured log line or is it more structured with a lot of key value pairs attached and real fields. Then it becomes very close to, to a span right at the end of the day if it's so and and and I like the idea that you use LLLMs or, or like co-pilot in your IDE to help developers do much better logs in context and, and. Yeah, help them with that.

Ben Sigelman: Yeah, totally. And you know, one of the things that bothers me about OpenTelemetry especially the conversation around OpenTelemetry with end users. It's this binary idea of whether or not you have or have not added OpenTelemetry to some given service or piece of infrastructure or whatever, which is really strange idea in my mind. It's like, you know, almost all of the things you could instrument are not instrumented, which is fine. I mean, if you instrumented them, all, the data volume would be completely unreasonable, right? But the decision about what to instrument and not to instrument is pretty critical. And also, unfortunately, at the moment, I mean, you can do some filtering in the collector and things like that, but we don't really allow people to have much dynamic control over what's actually instrumented and not. And that's a big weakness, I think, with the way the basic design of Otel today, like the only filtering you can do is outside of the process in The Collector. I would love there to be a world where you, you know, maybe from a developer standpoint, you don't even see it in your IDE. But there's some layer again. This seems ready made for an AI because it's way too much work for a person, but it's pretty obvious work to just go in and add a whole bunch of like default off trace points and log points and instrumentation points all through a code base that can be flicked on and off dynamically and could provide a lot more detail when there's an incident that's occurring or whatever that could be dynamically controlled.

Ben Sigelman: Like something like that seems really exciting to me, actually. And again, I think part of the reason why there's a lot more logging than tracing is that doing correct tracing instrumentation, it's not like hard, hard, as I was saying, like if an engineer put their mind to it, they could definitely figure it out, but they probably don't because it's not their job to do that. Whereas if you had an LLM, you know, going through and doing that work, I've actually prototyped that just for fun. It's not difficult at all to get an LLM to add good enough tracing instrumentation to basically anything. You know I really like the idea of having that kind of like ready and available without doing a redeploy, and that type of work could significantly change. I think the root cause analysis experience, both for humans and for some agentic type of system, because you just have a lot more to grab onto right now. When you're debugging something, you often have this one span or this one mystery log line where it tells you something bad is happening, but not why. And then you don't have anything you can. There's nothing. There's no next step, you know, and I think that that's a solvable problem, I really do. But it requires richer instrumentation and better dynamic control over the telemetry just to handle costs. So anyway, yeah, I'm excited about that.

Mirko Novakovic: It's a very similar what you do with logs, right. You have the log level where you say this is info, this is error. This is debug in production. You already don't turn on debug with this because it's just too much data. But then those systems allow you to basically attach to the process. Say now turn on debug because I really want to debug. And then you get all the information. Why not do the same for tracing. Right. Have a debug tracing. Right. And then you only turn it on if you really need it.

[00:18:39] Chapter 6: The Role of Business Metrics and Observability

Ben Sigelman: Right? Yeah. Or. Yeah, there's lots of ways to think about it. Exactly. And this is one of my big regrets with the Otels is that we didn't really invest very much in terms of dynamic control, and that's all happening in the collector right now, which is okay. Like it's better than not doing it at all, but it's actually too late. Like the amount of data that would be coming out of the process if you wanted to. Do, you know, verbose if you want to do v equals two egress from every process in the collector. It's just not tractable like it's way too much data.

Mirko Novakovic: Yeah, you have to do it on the instrumentation side, right? Either dynamically a lot of languages support dynamic instrumentation also. Right. You could add that dynamically even with AI right to to say, hey, this is my problem, dynamically add that instrumentation to the code so that I can analyze it more. Yeah. I remember one of the first versions of Dynatrace, they had this pure path, which is essentially also something like a distributed trace. And when you had a call, you could click on that and say add instrumentation. And then it would show you all the methods that were called inside of the call that you have selected. And it would add that instrumentation to the code, right, dynamically. So it would connect to the agent and do Java. It was only Java based, right? It would Java dynamic Instrumentation, which was a cool feature. Right. Because that way you could manually add more and more visibility for things that you wanted to see. It doesn't really scale. Right. As you said, you have to remove it later on because it's just too much data. But it is a very interesting feature.

Ben Sigelman: Absolutely. Yeah, that I remember that as well. Yeah. Dynatrace, especially in JVM environments, had some really cool tricks. Yeah, it still does. Yeah.

Mirko Novakovic: On my flight over here, I was sitting next to one of the evangelists and we had a chat in the plane because he came from Austria, I came from Germany and it was interesting, right? I mean, I always thought that they have one of the best engineering teams in the space down there in Austria and and do some really, really good stuff. With regards to instrumentation especially right, though, I have to say, one of the things I also heard from him, he says that more and more customers actually say, we don't want to use your agent anymore. Yeah, we really want to use OpenTelemetry, right? The collector, because we want to make sure that we can decide where to send the data, how to do. And we don't want to have that proprietary instrumentation anymore. And I don't have a percentage of people ask me from all the data I have, I would say around 20% of the data today that's coming is open telemetry and and 80% is still proprietary, but it's growing pretty fast, right? So I do think that in the next years we will come to, I don't know, pretty quickly to 50% of the data is open telemetry.

Ben Sigelman: I think that's absolutely right. And I mean, I heard that over and over and over again. I mean, especially actually when Lightstep was acquired by ServiceNow, I spent, you know, three years there. I'm not there anymore. But I talked with a lot of service tech customers, which are mostly obviously, you know, this is generalization, but most service customers at least that we were talking with were you know, very large enterprises with brownfield environments and a lot of different applications that were often, you know, acquired and had different tech stacks and so on and so forth. So they were really challenged with just like, you know, trying to come up with processes and systems that could work across a very wide surface area of technologies and so on and so forth. And there was a pretty strong push at a lot of those organizations to figure out a strategy that was kind of Otel first, and it wasn't so much about a specific concern about any one particular agent, whether it was Dynatrace or Datadog or something else. It was more just a desire to have a common data layer that they could build off of. But I have to say, it was always a little bit interesting to me because I'd asked them, you know, what's your observability strategy? And the answer often was basically OpenTelemetry.

Ben Sigelman: I'm like, that doesn't actually make any sense to me. I mean, I would say it to them in a nicer way, but I mean, OpenTelemetry can absolutely be part of the strategy. But if all you do is take your proprietary agents and move everything over to Otel, I do see how you have gained some optionality, but that's going to take you realistically at a big enterprise with lots of priorities and resource issues. Years. And then at the end of it, what do you have to show for it? Like, you haven't really saved any money. And you definitely haven't improved, you know, Mttr or SLOs or what have you, in terms of business metrics and I would always say, look, you should totally move to Otel. That's great. Like, obviously I like Otel, but you need not you should you need to have some wins on the business side that will be coupled to this migration on a quarterly basis. Or at some point someone's going to say, what have you been doing for the last two years and just pull the plug on this project. You know, like moving to an Otel is not an observability strategy. And I unfortunately, I think there was quite a bit of that going on. I don't know if you saw the same thing, but.

Mirko Novakovic: I was a consultant for a long time, and I did software development in large enterprises. And tragically, I've seen that not only in observability but in many other dimensions that a lot of projects, even big projects, software development projects, were not driven by any business outcome. They were just technology decisions, right? Hey, I want to move from, I don't know, COBOL to Java or from Java to Node.js, or I want to have this monolith in microservices. So the whole things that they were doing were really driven by technical decisions, right? And not driven by, okay, what, what will actually change for the customer if this is now not a monolith but a microservices, what's the benefit of doing a five year project with hundreds of people and migrating it, which has a lot of problems also. So I totally agree. You should always have a strategy. You should always have business outcome. Which by the way comes to a question. I would also like to ask you how you see it, because I was always thinking of connecting observability with business metrics. Right. I once was at Ryanair and they used Newrelic. I'm not sure if they still have, but they had in every office they had a big screen with a KPI and that was how many tickets they were selling, because that's essentially their business, right? They sell flight tickets. It's the biggest European flight company now. And, and that was an Uralic dashboard. But it was not really connected to the trace. But especially if you trace it, you could actually use the instrumentation to really count and maybe even get with payload. You could even count the amount of money you're collecting. And so and then if something drops, if you see your metric drops, how many tickets you sell. You could actually click on it and directly deep dive and get into a troubleshooting mode, right? So I always like the idea of connecting business with observability data. Have you seen that somewhere?

Ben Sigelman: Or I mean, I completely, totally agree with you when it doesn't happen, which unfortunately is pretty common. It's usually for kind of dumb organizational reasons. More than that, anyone thinks it's a bad idea. I mean, I think from a business standpoint, people are often, you know, kind of flying blind, especially on that kind of next level of detail about the success or failure of a transaction. I'll tell the story in a second about this in terms of how meaningful I think it can be, actually. But yeah, I completely agree. And I will say you know, from a vendor standpoint, it makes a ton of sense to push in that direction, because if you succeed in doing so, you're delivering a much larger value proposition to your customer. But it's really a win-win thing because the customer is actually genuinely happier and better off once that integration has happened, but it just one of these things where you have to get that other leader on the other side of the business to care about this tooling decision and some instrumentation or a dashboard or whatever. And that can be challenging, right? I think some vendors have made that easier than others. And it makes perfect sense to me. I will say, actually, going back to Google, where people prided themselves on being very rational and data driven, it was quite a challenge, actually, to move dapper kind of up the stack, as it were, like out of the the server back end world and into clients where we could actually measure the things that matter to our users and customers.

Ben Sigelman: And it was fascinating. There was so much pushback about that. But we eventually did do it. And we got to a point where Dapper Traces, or at least a sample of them, had instrumentation going into web browsers and mobile devices. Okay. And when we did that it was just fascinating what ended up happening. Like, and it was in order to do what you're saying, because at Google there was a very, very clear and A completely obvious correlation between latency and money. Like the lower the latency, the more money you made, period. In terms of ads and all sorts of other things. User retention, etc.. So user latency was everything, and so people would spend huge amounts of time optimizing, you know, the critical path, their backends, in order to make sure that we could shave off like 1% of latency and so on and so forth. But then when we actually pushed instrumentation into the client, it turned out that I'm making up the numbers, but it was order of like 30 to 50% of latency was just network latency in non-US regions or further from Google engineers. Basically, that ended up leading to like a full scale redesign of the way that Google even thought about their data centers.

Ben Sigelman: Like they went to a model where they built these satellite data centers all around the world that were physically close and network wise, close to their users, that all they really had was a front end load balancer and then a completely privately owned fibre optic connection to one of Google's real data centres somewhere else that had a much shorter hop than whatever the internet had decided to give these people otherwise, and it reduced latency by a huge amount and increased revenue, of course, as well. But it was really the result of finally measuring, you know, what the customers and users were actually experiencing. And the thing that's most remarkable to me about this was just how much pushback there was about doing that measurement in the first place. There's a lot of people saying, oh, we don't need to do that. Like, it's all kind of linear. It's all sort of the same. And that was just totally untrue. And then having the data in front of us, it was obvious. Oh, wow. Okay, this is a serious problem. And we can optimize our back end all that we want. But when you see that 20 or 30% of the pie chart is just that, that hop to get to Google's data centers, well, you got to solve that problem first. So anyway, I thought it was interesting and absolutely beneficial. I wish more companies prioritize that.

Mirko Novakovic: Yeah, and I can see that Google is much better in bringing their real business KPIs together with something like observability and traditional companies because it's it's almost a tech company overall. Right.

Ben Sigelman: And I think for product people to just it allows you to, to write better SLOs that are really like user centric SLOs and things like that, that I think are a much better way of tracking uptime than Mttr and stuff like that anyway. So I think it enables a lot of really good habits in terms of both product engineering and the business. So yeah, totally agree.

[00:29:52] Chapter 7: LLMs and Future Disruptors in Observability

Mirko Novakovic: And that's something where I can also have probably. Right. If you go more on the business side and, and you need dashboards or something, I can see that. I mean, I've already seen solutions for some cool stuff for creating dashboards and UI with just natural language, right? If you could just say, hey, train the dashboard for the sales of the tickets, including revenue, and that would automatically figure out which traces, which payload you should use to build those metrics. That would be awesome, right? For people using actually real observability data to create those business metrics.

Ben Sigelman: Totally.

Mirko Novakovic: So where do you see the future of observability generally going? Oh, let me say it that way. Our LLMs disrupting the observability space in the way that it will be totally different. So what's your thought on that? Because I've seen companies like flip, AI and others that are really built on the idea that it's it's, I would say LLM first. For me personally, I don't have a clear opinion on that yet. Right. So I'm not sure. I haven't seen a a really big change yet. I see a lot of optimization potential, but I would love to get your I mean, you're closer to the Silicon Valley and probably there's a big hype right now around AI and that it will change everything. So it would be interesting to see what you are seeing and how you see that space evolving.

Ben Sigelman: It's a great question. I'm hesitant to even provide a response because I'm worried about how wrong I'm going to be. It does feel early days, but I'll take a stab at it. When LLMs first came on the scene, I guess it's kind of shocking to me that it's been almost two years since ChatGPT was released, but I guess that's correct, right? I was totally blown away. Completely blown away by them and thought they were capable of just about anything. And then at some point, I decided to kind of roll up my sleeves and actually try these things out. In terms of some observability use cases and so on. And, so I did some prototyping and I guess this is where I'm at right now. Again, with the caveat that maybe we'll reach AGI in two years and I'll look like an idiot. But for LLLMs and even sort of just broadly speaking, more kind of agentic approaches to things, I do see a lot of upside for observability when it comes to finally doing the toil that humans weren't willing to do to make observability better. So actually, at the very beginning of our conversation, I was talking about how like, I, as a developer who had was trying to get Google weather out the door, just didn't feel like doing the right thing for instrumentation and monitoring and observability.

Ben Sigelman: I think developer human developers will continue to feel that way forevermore. But there's absolutely no reason that we couldn't have an LM that was handling that side of things for them, maybe even as like a post compilation step or something like that, or at least something that you don't see in your IDE. If it's going to clutter up the code too much. But I think that we could end up in a world very easily where you have kind of near perfect instrumentation of everything with verbosity level set correctly so that you don't preempt, you know, the actual instrumentation that humans have added or whatever. Like, that seems like a very solvable problem. And I think that that could significantly improve the signal to noise ratio of observability, like a lot of observability right now, is bad because instrumentation isn't there or is not very good. So I think that we could see like a significant improvement on that side. Actually, I'm not really involved Otel right now, but if I was, I would be strongly advocating to kind of like push the project in that direction. So I think there's a lot of upside there. And then on the root cause analysis side, I also think that there is a lot of upside in having some kind of agentic system. I'm not convinced it's actually a large language model specifically, but some kind of agentic system sitting there and kind of testing hypotheses in the background.

Ben Sigelman: We experimented with some stuff like that at Lightstep without the smart AI, and just doing some basic statistical correlation, and there's upside there for sure, but I think we didn't do a very good job of it. I think if you could build something that would just follow a whole bunch of trails concurrently and then tell the human which ones are most promising, that feels like a supercharger. What I'm not expecting, though. I mean, I would love to be wrong about this, but I'm not expecting an AI that actually does RCA for you. And I'm definitely not expecting AI that closes the loop. Does the remediation step, and also like submit the PR that, you know, truly fixes the root cause in the right way. Like I think that we're pretty far from that. I think it's still going to require a human. I'm more expecting the AI to be like a very, very patient but not very talented intern. Or maybe like a thousand of them. That's what I'm expecting. And, and I think they have such a fleet of insurance could do a lot of good on the instrumentation side and on just weeding out hypotheses during an incident. So that's what I'm expecting.

Mirko Novakovic: Yeah. And I would agree. I still think that for root cause analysis, something like bubble up from honeycomb is still the best feature in the market, right? It has nothing to do with AI. It's basically taking two sets of data comparing where you have more or less tags. Right. And that's what developers do when they troubleshoot. And it helps you because machines are good at it. And I can see that enhanced a bit. Right. By the way, our first feature is with AI, we, we, we are just releasing it, is just making structured log out of unstructured logs. So we basically look at the logs and, and then we create the tags for you, and then we map them to the semantic convention of OpenTelemetry. Right. So that it's correctly mapped to the otel spec. So the first experiments we did are actually pretty good.

Ben Sigelman: That sounds both very practical and very powerful and completely aligned with that first category in my mind. Just like make the data itself, the telemetry itself, higher quality and higher signal. And that sounds awesome.

[00:35:40] Chapter 8: Reflections and Closing Thoughts

Mirko Novakovic: But thank you. I know you left ServiceNow a few months ago. I know when I left it, I also took time off and I enjoyed it a lot with my kids, with my family. I wish you that you have a great time and you get some relaxation from tech. So thank you for being here on my podcast.

Ben Sigelman: My pleasure. Mirko. Thank you so much.

Mirko Novakovic: Thanks for listening. I'm always sharing new insights and insider knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on