Episode 2234 mins4/3/2025

Debugging the Future: How Lumigo Modernizes Developer Observability

host
Mirko Novakovic
Mirko Novakovic
guest
Erez Berkner
Erez Berkner
#22 - Debugging the Future: How Lumigo Modernizes Developer Observability with Erez Berkner

About this Episode

Lumigo CEO Erez Berkner joins Dash0’s Mirko Novakovic to discuss the evolution of cloud observability and the critical role of AI in debugging modern architectures. They dive into the challenges of tracing serverless and hybrid environments, why traditional observability tools weren’t built for developers, and how AI-driven insights are shaping the future of automated troubleshooting.

Transcription

[00:00:00] Chapter 1: Introduction and Guest Background

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I am co-founder and CEO of Dash0. And welcome to Code RED code because we are talking about code and Rad stands for requests, errors and Duration the Core Metrics of Observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Today my guest is Erez Berkner from Lumigo, CEO and co-founder and before founding Lumigo in 2017, Erez was leading the cloud products of checkpoint, a security company in Israel. So welcome to Code RED Erez.

Erez Berkner: Thank you. Mirko. Thank you. Great to be here.

[00:00:48] Chapter 2: Code RED Moment and Initial Observability Challenge

Mirko Novakovic: Yeah, great having you here. And I always start with my first question. What was your biggest Code RED moment?

Erez Berkner: It's a very interesting question because I actually remembered, you know, it was days when I was a developer were actually a big production outage for one of our customers due to our product. And it happened to be a big credit card company. We managed somehow to basically shut off all network communication to their production server and to the HA, to their availability. And then I was a developer that was called to figure out what's going on in that production environment. And that was, you know, I mentioned this because it was a while ago, but it was stressful because for a large company to have a downtime, it's millions of dollars every minute because, you know, when your credit card doesn't work for some reason, you just pull out another card or cash so the transaction is lost.

Erez Berkner: So that was a very stressful environment and stressful situation actually measuring minutes. And I'm mentioning this because not because it was so horrible, but also because I was so frustrated. Because when I wanted to try to understand what's going on, where everything is failing, I start looking for, you know, for logs and, you know, in network environments, there's a huge amount of traffic and packets going back and forth. So also, you know, identifying the stream of requests and to make the story of what comes before and after is critical for me as a developer to understand what's going on in the production environment. You know, I need to provide a very quick answer and resolution, but at the same time, I had the right logs weren't there, so I couldn't really tell. It was like a black box. And also the logs that were there were, you know, billions of logs in a matter of minutes and now trying to make sense of those and trying to go from one execution part to the next one and the next one. Trying to connect those dots was possible, but it took a lot of time and I ended up writing on the fly. Additional code to add logs, deploying that into the production environment of the customer as an ad hoc. And overall, it took a couple of hours. But that was a very stressful environment and experience.

[00:03:21] Chapter 3: Founding Lumigo and Addressing Serverless Challenges

Mirko Novakovic: Yeah. And every, every developer knows those moments, right? When you start hacking, redeploying things in on the fly. Yeah, that's a stressful situation and you need observability for that. And so then you founded Lumigo. And by the way, I'm an angel investor. So I'm a big believer in Lumigo and the story. But when you started when I also invested, it was pretty much a company built around a new concept of serverless functions. I would say like AWS Lambda, and you build an observability tool that was specifically built for that type of applications. Right. Serverless applications. There were, I think, at the time, multiple startups, a lot of them out of Israel, actually. Multiple companies like Epsagon that was acquired by Cisco and others that built that. So tell me what you saw as an opportunity there and how observability for Lambda is different or for functions.

Erez Berkner: Yeah, 100%. So I think it's a straight continuation of that story. So the thought of how frustrating the monitoring and observability and tracing of environment is kind of like I felt it as a developer. And I think then later on toward 2016, 2017, I get to deal a lot with, I mean, actually my co-founder, who's our CTO, we dealt a lot with modern cloud environment and specifically serverless. But when I say serverless 100%. It's the Lambdas. It's the function of the service. But we're looking even on a wider definition as I defined managed services. So it's Lambda but it's also managed databases like DynamoDB and manage queues like SNS or SQS, API Gateway and Kinesis. And I'm giving AWS examples. But but I think the, the realization of the fact that you are, you know, we were in an environment that was so advanced with managed databases, managed services of AWS as an example, the cutting edge technology and the realization that we still need to go and debug it like we used to debug 30 years ago with console dot log, you know, and try to piece together what's going on and having a black holes in the middle, like what happened within my message queue managed message queue that that was a huge discrepancy for me. Huge conflict. How come such an advanced technology have such basic abilities when it comes to debugging, monitoring, tracing. So that's kind of like what? Drive off. There must be a better solution. And we couldn't find a better solution. And that's what drove us to start working on Lumigo and figure out what can we do to help the developer community.

[00:06:17] Chapter 4: Instrumentation and the Role of Managed Services

Mirko Novakovic: Yeah, absolutely. I can tell you from a different perspective, because I was still CEO at Instana when you founded Lumigo and and when AWS Lambda came out, we looked at them and said, oh, I don't know, is that a big enough market? And then we figured out that our whole agent technology will not work there, right? Because you can't really install an agent. Right. And it's super lightweight. And there are problems around start times, right. Because of functions is spun up pretty quickly. Then it's a going away. There's like warm ups and there's a ton of different things. And this as you said, cutting edge technology but the classical things, even at the time that we built for containers, didn't really work for Lambda, right? So we didn't really do it right. So we said, okay, it's too small. And then it grew pretty fast. But that was the chance for companies like Lumigo right. And so so how did you address all those problems. Right. That, Lambda had different instrumentation technology and how did you do that?

Erez Berkner: So maybe taking one step back, what do we want to get to? We want to get to full observability. We want to get to our ability for a developer, for an SRE to understand the end to end story of every request now. And it's especially important in the modern environment because you have microservices, you have ten, 15, 20 services in a single request. And it's critical for you to piece together those and build the story of what happens step after step. That's how you debug in a monolith. You have stack trace in a microservices, you need to build your virtual stack trace, let's call it. And then when you look at modern environment and let's look at, you know, a typical AWS environment, you have a mix of compute services, containers, Lambda, EC2 and managed services. So the nice thing about compute, there are different ways to connect and to trace, especially when we're talking about agent like you guys did in Instana. That works great for the containers and works great for the EC2. There are a very big market as you said doesn't work for managed services. So in Lambda there were some pretty cool technologies by AWS for example, called on the Layers Lambda extension that we worked on in order to integrate our agent or our code library into the lambda. So the lambda can omit the right spans and traces and tell us what's happening? The other side is and that's really a hard one, is what do you do with those managed services, the managed service that you cannot deploy an agent on, like you said. But honestly, we cannot plug into them.

[00:09:06] Chapter 5: OpenTelemetry and Developer-Centric Observability

Erez Berkner: You cannot deploy code into them. You know, it's a closed garden and it's a really big problem. And by the way, it exists across the board. And it's just getting worse in the sense that more and more developers are using more and more managed services. The cloud provider adopted this concept of pushing forward managed services because it makes sense for everybody, for the cloud provider, for the developer, and it's the right approach. But how do you observe that observability stay behind in terms of the infrastructure supporting those services? So what we end up doing is basically looking at, let's say there's a queue in the middle, let's say an SQS, which are managed by AWS. When I ended up getting the traces Races of know what the, let's say the lambda before the SQS sent to the SQS. And what was the next hop of the SQS? Sqs sent the a message to another Lambda. So we're able to see before and after and through what we developed on our correlation engine, we're able to defer that. This was actually a call from one Lambda to the SNS that raised the call to the lambda. So basically we managed to figure out the black holes in between the managed services. And that's a major, major part of modern cloud architecture with managed services. I think that's that's aha moment even beyond the Lambda that we managed to do and, you know, show at the end the full trace, including the managed services, instead of this being no cut into different islands that are not connected, if that makes sense.

Mirko Novakovic: Yeah, absolutely. I mean, I'm a big fan of instrumentation and these things, so I totally get the problem right. So to say you have two Lambdas in the middle is SQS. But SQS is managed so you can't install anything on it. Right. And you can't do anything. So what you have to do is on the one side, you have to instrument something on the message header or whatever, right. You have to understand where to put something so that on the other side where you read it, you can have that metadata with the correlation ID so that you can have this end to end correlation. Right. Which means you have to literally for each of the services, understand the details of the protocol of where to inject something. And it's actually a pretty technical challenging problem. And you have to do that for different languages. Right. Because the lambda could be a go function, could be JavaScript, could be anything. Right. And that's kind of the thing you have to build as a vendor, right, to support all the technologies. You have to build all the libraries to do that correlation. So you have an end to end view, correct?

Erez Berkner: 100%. And I think the really cool thing is that we found out and it took us a lot of work and years and expertise of the team that we're able to correlate most of the managed services without changing anything in the headers and the payload. And that's a cool stuff, because we prefer, of course, not to change anything. And we found out that if you drill down, you know, into the headers enough, in most of the cases, you're able to find and hold to some unique ID of the record of the message that exists in both ends of the message queue. So if you have that, it's not across the entire transaction. But let's take this and let's build this kind of like three services. And those all of the three services were able to build, you know, long transaction chains. So that's I think one of the cool stuff to figure out how you can do this without changing anything on the data or headers of the actual traffic.

Mirko Novakovic: And that's actually good. I mean, I have done agent development and you always think it has no side effects, but then at some point it turns out that everything you add has some side effects, right? It's always better to not do it. I totally get it. Out of curiosity, how do you see OpenTelemetry on the AWS side? Do you see a way that they will add the OpenTelemetry instrumentation into those services, so that at the end you can use OpenTelemetry and the standards there? Or is it still very early at AWS?

Erez Berkner: Yeah, I think it's a great question. Like I love OpenTelemetry. I think it's definitely the way to go. And I think more and more organizations are aligning on OpenTelemetry. So I think, you know, 100%, I think at some point OpenTelemetry going to figure this out and be able to identify all the, you know, the hundreds of services out there and close that gap. The community is going to do that over time, but the community didn't do it in the last five years, for example. So there's still a very big gap in OpenTelemetry, I think in two main things. Number one is the actual services that support it. So let's suppose I have DynamoDB DynamoDB streams. I don't see both sides of that in OpenTelemetry, and there are many of those services. So there are significant gaps that are going to get closed. But I think it's going to take several years at least. And second thing, I think it's still, you know, we're meeting developers every day. Opentelemetry getting the right libraries, the right runtime, the stable versions. It's still very hard today. You really need to know what you're doing, and you need to have, you know, you're owning this and you need to keep maintaining this. And that's still like a big, big challenge today when you come and you want to deploy OpenTelemetry on your own project. So that's the second thing that I think we still have a way to go on. Opentelemetry.

Mirko Novakovic: No. Absolutely, absolutely. And talking about developers, you mentioned that you're primarily focused on developers, right? Where I would say most of the observability tools are more focused on SREs and the operation part. Right. Which, by the way, historically, I can tell from my side it was also because the budgets were there, right? Correct. And the vendors always go where the budget is. Right. And developers didn't have too much budget for that. But that's changing, right? The more we shift left, this is changing. So how do you see the difference between a developer focused observability like Lumigo and a more SRE focused, dashboard focused tool? Like, I don't know, Grafana, for example?

[00:15:29] Chapter 6: Expanding Lumigo's Reach to Kubernetes and Beyond

Erez Berkner: Yeah, I think it's a great question because, you know, when we started Lumigo, one of the things was, you know, we couldn't get we couldn't find a tool. You know, there was like the Datadog and the relics of the world, but we felt like this is not good enough for us as developers. You always had to rely back on the logs to figure out what's going on. As a developer, you know SRE has a lot of ability to create SLAs, etc. and load and CPU and memory graphs and alerts and many of the issues are just that, just, you know, increase the size of your cluster. But when it comes to actual bugs and issues you need to debug, you have to actually see the data. So one of the cornerstone of Lumigo is not just building the end to end story of a request. So you can see this from the beginning to end, but also allowing the developer to see the data that actually passes from one service to another. So if I have a database or a message queue in the middle in Lumigo, I can actually see what was that message. And I can see that this was, let's say somebody made an order in UberEats, and I can see that order ID passing as part of the data stream. And then I can see all of a sudden that whatever the ID is something that is illegal. Just as an example. So I can actually look at this and understand what went wrong on the data in motion beyond just like seeing the services that are talking to each other. So I think one major thing that differentiates developers from SREs and from other tools is the ability to see the HTTP request, the payload, the HTTP response. So you know the query to the database, the response from the database. This is all available for the developer postmortem. So you can see, you know, one hour ago specific request. Go in. Show me what happened. Show me what was the response of every service. So I can actually debug. That's maybe one of the biggest thing we do differently.

Mirko Novakovic: Okay. You probably know Ilan Peleg from Lightrun. He was on my show and they have a very developer focused observability also, which I think what you build is something that is built on classical tracing, etc., and it adds payload to those traces so that you have more visibility where you can even go a step further, which I like it. Right. Lightrun which is more like debugging, where you see all the little details right of the fields on, on, on a method level. But I like the way you approach it. I would say it's still low level, right? But you can scale it, right? It can be under load, and you add the payload and the developer gets the context of a trace in a more detailed level. Right. Because you have the HTTP request, you see the payload of a message, etc.. That's pretty cool.

Erez Berkner: You touched two very important points. One is it would be good to find a way for this to be on by default, because I take you back to my credit card story. I know, I know where to go and see, and I want to see what happened, you know? Ten minutes ago. If you have this on by default, then you know you're not sending too much data about every line, but you are able to see and what happened postmortem. So I think that's a very a good position to be in if you can get this. And second is and you mentioned this as well is about what's happened in terms of scaling or deployment. This is another very painful thing in observability in general. Opentelemetry. But most vendors out there and I think when you ask about developers, the second thing, beyond just having, you know, the payload is about the ability to deploy this quickly and be able to show value quickly is first and foremost a product LED tool. So, you know, it's mouth to ear. It's people coming in trying this for free and getting value in their environment after 5 or 10 minutes. So the idea is it's not a deployment product of two months. It can be as simple as ten minutes. A automated no code changes because it's all based on modern automation and APIs. And then when you scale, it attaches to the new workloads. So you don't need to keep running after you do remember to add this or OpenTelemetry over there. There's automation that keeps you safe and protected all the time. So it's a lot about you don't need to maintain it. Go, you know, develop your stuff will be there to make sure you're covered in terms of observability. So that's another thing that I think is different when it comes to developers.

Mirko Novakovic: No, absolutely. Also, I mean, developers just don't want to do these two months POCs anymore, right? They want to log in, register. And as you said, get up and running in ten minutes. Right. If they can, they will switch to the next tool and try it out. Right. It's just the new way developers approach software in my point of view. And then just going back to your story, what I saw is you started on the lambda side, on the managed side. But I mean, you can tell me, but probably you saw that applications are more complicated than that. And that's what everybody figures out who's an observability. Right. And then the whole microservices world, Kubernetes is always a part of it. Right. So probably people have some Lambda functions. They use managed service, but they also have their own services running on Kubernetes. And then they have something on EC2 and whatever. So then you moved up, up the stack or you widened the stack and you build an operator for Kubernetes to automate that part too, right. Tell us a little bit about the story, why you moved there and what the next step was for Lumigo.

Erez Berkner: Yeah, I think we started, you know, 2018. We started by basically covering the biggest, most immediate problem we saw in that desert that was of serverless tracing, serverless troubleshooting. It was so bad that we felt like we can create a very big impact in a very short time. And I think I think we did. Alongside with other companies. And then I think, you know, I think maybe I remember this 2017, 2018, there were some debates whether the world's going to be serverless 100% or whether it's going to be a mix of technologies. And I think, like always, nothing is absolute. And yeah, and the world is a mix of technologies. So serverless doesn't replace everything. It just adds abilities to existing architecture. So we've seen a lot of requests coming from our customers, serverless customers saying yeah, but you know, we this is only part of our environment, or we have the other environment over there, which is Kubernetes or ECS, or even EC2 or God forbid, on premise. And it would be great if we can extend and see, maybe we can see the entire environment. So that was, I think, a natural development to support Kubernetes and containers in general, and just plain virtual machine. And the concept was we want to make sure that we keep providing what Lumigo does best, and that's observability and troubleshooting for developers. So the payloads, the connectivity and the full tracing now beyond the serverless environments. So we support containers of Kubernetes. We support EKS. We support most of the main managed services or non-managed services out there. And you can trace with Domingo much beyond your service environment. You practically can trace all the common services out there, regardless of where they reside.

Mirko Novakovic: Yeah. So it's the obvious move to go to the microservice world. And I think we are also seeing that, right, that customers are using every technology. Right? I mean, Lambda have a really good use case, right? Actually, the whole serverless world is becoming more and more popular still, I think. Right. Also on the database side, but then you will have a mixture of microservices, on prem applications, cloud applications, microservice, etc.. Right. So that's kind of the thing. As a observability vendor, over time you have to support more and more, right? Because you have to grow into the customer environments.

Erez Berkner: Correct.

[00:24:18] Chapter 7: AI in Observability: Lumigo's Approach and Future Prospects

Mirko Novakovic: And now there's also a whole world that's called AI coming up.

Erez Berkner: We wanted to get Mirko 28 minutes without mentioning AI. I think that's a new record. It's a record.

Mirko Novakovic: I tried to do it always more to towards the end. Right. But I mean, you are mentioning AI, right? I think you're using AI in Lumigo. And so I wanted to get your take on. So first of all, how do you see AI inside of your customer base. And what does that mean for observability? But secondly, how are you leveraging AI in amigo to create value for customers?

Erez Berkner: I think that we're seeing from what I'm seeing in our customers, the hype is a bit more advanced than the actual usage we're seeing in our customers. I'm saying that because in many cases, there is an interest in AI to employ AI and the features it brings it with. But we see different customers, especially the big ones that have, you know, this AI committee, for example, that, you know, needs to approve AI for developers to use it for their organization to use. And in some cases, it's actually blocking usage inside the organization until they figure out. And it's there's still, in many cases, haven't yet figured out the procedure in order to approve AI. So it's kind of like I think there's a lot of organizations. They're still experimental mode even in the approval cycle of can we use AI in our organization and what does it take and who approves it and what is the process? I think in SMEs and startups it's much easier naturally, like everything else. And over there I think that you'll see more and more usage by developer first and foremost, I think. So I think that's what we're seeing in many of our customers. I believe that AI is going to change observability landscape drastically, probably going to change almost every, every, every domain. But I wasn't created equal in the sense that AI in observability is different than AI for e-commerce in general. The main thing that we're doing in AI is trying to digest and help the developer do what he likes to do. He needs to find the issue and fix it faster.

Erez Berkner: As simple as that. It will build our AI agent, which is now in beta. We're going to go into a general availability pretty soon. We try to build it to be the, let's call it the architect, your architect friend, or a very experienced developer. It's going to help you guide you toward what's the issue and how to fix it. So, you know, finding the root cause through AI, through natural language. What happened here? What is the root cause? We found out that the AI is very good at explaining what happens step by step in a failure. And I think the reason for that is not because, you know, we develop some crazy, unique LLM that nobody else seen. It all boils down to the data. And I think, you know, we see the race to AI now, but I think very quickly this will turn into a race to data, because your AI is always going to be bounded and as good as your data. And because in Lumigo were developer oriented, and we already provide the payloads of hundreds of services and the HTTP request and response, and we're able to feed that into our AI agent, then the AI agent can be very strong and give you insights that he couldn't have if you don't have that data. So I think this is where Lumigo AI is different than just adding an AI to a tool, because I think it has access to a lot of data that the developer needs in order to find the issue, and then it can help you just figure out what happened.

Mirko Novakovic: Yeah. So in fact, that you have the payroll, right? And you have the concrete fields and data and messages that now helps you to, to provide data to, to an AI that can then help the developer doing the work of analyzing that data, right? And if you don't have the data, which most tools don't, right. That's also important to understand that normally you don't have all the payloads. Then you can't really do that right. So I like it right? I agree with you. I think at the end and all if you hear and read about what people say, right, there's only one internet to train an LLM. Right, essentially. And the current Llms are basically trained on the whole internet, then you need to get to some other data, right? Observability data in your case can be that data that's not publicly available to anyone. It's you only have that data and you can use it for an LLM. That makes total sense to me. How do you deal with the cost side? Is that an issue? I mean, we are obviously we also use LLMs at Dash0, but I always think it is sometimes hard to predict the cost level, right? If you use it too much or you can't use it for every request, right? So you have to be careful how you use those things and. And how do you see that?

[00:29:34] Chapter 8: Challenges and Potential of AI in Observability

Erez Berkner: We're still experimenting on this just in terms of we haven't figured out yet the exact pricing model and pricing model of this. I can say a couple of things. Number one, our approach currently is to charge by the number of answers, basically number of requests to make sure that that's always a good alignment between our cost model and what the customer pay, and allowing it to grow over time as needed. The idea is that, you know, if you get value out of it, you're going to use it much more. So it's valuable. So you're able to pay more, and then you can buy a bunch of credits to use it. I think the other interesting part is that we see that at the end of the day, the AI really reduces the time that it takes for developer to work. So almost all cases of our with a couple of dozen people using in part of the data companies? The statement is this is saving me significant amount of debugging time. I think as long as we can articulate that and as long as as long as we can maintain that, it will allow us to, you know, help articulate why why this is something that equals money for the organization. And I think this is how this is going to move forward in terms of

Mirko Novakovic: Yeah, absolutely. I still think also, if you look at something like ChatGPT, sometimes you read that every request costs $10, right? If you would pay $10 per ChatGPT request, it is kind of questionable. Yeah. If you would really do it. So at the end we have to align cost always with the value. So that will be an interesting part. You said that you think that AI will really transform observability right. So what do you think this is the start. What do you think is the end scenario in your point of view is that agents doing the whole work and even fixing things in production. So what do you think? What will be the result over the next years?

Erez Berkner: I'm very bad at guessing what's happening, what's going to happen in seven years. So let me allow me to stick to the next year or two years. Sure. I think that is what we're going to see is the AI basically is going to replace, let's call it the manual hard work that the developer had to do. For example, there was something that somebody wrote to the database two days ago. It's not part of this transaction, but somebody wrote it two days ago. The developer today needs to go and piece together that there was a transaction that was written two days ago and this one that now failing. This takes time, but all the data is there. So the AI and we actually see today can tell the variable, listen, this is failing because somebody wrote something wrong to the database, but I'm not stopping there. Two days ago, there was a customer named X that actually uses this. And he, as a result of that, that was written to the database. That's the root cause. And so piecing this together, I think this is the next step that we're going to see.

Erez Berkner: And this is really taking the developer to the next level. But following that is and this is I can tell you this is something that we're now doing in our labs. If the AI can have access to your code environment, to ReReport, to GitLab, to GitHub, then there's no reason why it's not able to actually suggest a fix for you, because basically you're combining production runtime environment. You have the data, you have the payload. And now if I see the actual code, that's where the magic happens because we can actually put, you know, overlay together. And then you can actually suggest fixes that you cannot do. But just looking at the code or by just looking at the production data and we actually have it running in our lab. As I said, this is a suggested fix for the problem at hand. I don't see this, by the way, being automated in the next year or two. I think they're always going to be the need for the developer to look to confirm, to code review it. But I think it's going to be much easier as we move forward.

[00:33:48] Chapter 9: Conclusion and Closing Thoughts

Mirko Novakovic: Yeah, that sounds really good. Yeah. I appreciate your sharing this. What you already have tested in the lab. And we will look forward to seeing that in production with Lumigo coming out as a feature. Thanks for joining the podcast. It was really fun talking to you and hope to see the next big steps of Lumi go soon!

Erez Berkner: Amazing! Thank you very much! Mirko. It was great fun and as always, great chatting with you.

Mirko Novakovic: Thanks for listening. I'm always sharing new insights and insider knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on