host

Mirko Novakovic

guest

Ilan Peleg

Episode 642 mins8/15/2024

Developer-First Diagnosis: Shifting Left and Real-Time Debugging

host

Mirko Novakovic

guest

Ilan Peleg

Listen on

Apple Podcasts Spotify Youtube

About this Episode

Lightrun co-founder and CEO Ilan Peleg joins Dash0 CEO Mirko Novakovic to break down his tool’s origin story, why he’s all in on developer-focused observability and how engineers are mission critical to business success across the globe.

Transcription

[00:00:00] Chapter 1: Introduction and Code RED Moment

Mirko Novakovic: Hello, everybody. My name is Mirko Novakovic. I'm co-founder and CEO of Dash0. And welcome to Code RED. Code because we are talking about code and RED stands for requests, errors and Duration, the Core Metrics of Observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Today my guest is Ilan Peleg. Ilan is the co-founder and CEO of Lightrun, which is based in Israel. Lightrun is a developer observability tool, a continuous debugging platform that allows you to add logs, metrics and traces to your production in real time. I'm excited to speak with him today about that topic. Ilan, welcome to Code RED.

Ilan Peleg: Thank you so much, Mirko happy to be here. And thank you for spending the time with me.

Mirko Novakovic: Yeah, absolutely. I'm super excited because I really love that topic. I know we have talked many times already, but I think you are creating something special here. But my first question is always, what was your Code RED moment that really brought you into that space?

Ilan Peleg: So actually, it's a true story. It's really a true story. Back then, I used to work at a big cybersecurity company. And one of our new product lines that we were launching was deployed in a POC mode at one of, like, the biggest telecom companies out there. And apparently, during the POC time the system was crashing twice a day, and we simply didn't have any clue of why it's happening. Now, we had, like, the best of observability in place. I can name a few of them, but Datadog, for example, and, you know, login simply was not sufficient. And there was not any exception shedding any light about the potential root cause. And we were kind of, you know, going in the dark. And I remember even back then I was, you know, troubleshooting this issue for about ten days, eventually joining some other team members to join me and help me because it became like super escalated situation. And it was simply were missing, like visibility. So, you know, I wish I had a very specific log line over there, or I wish I could be able to capture a snapshot at this very specific line of code that I'm kind of hesitant of what's going on inside and so many different like, assumptions that I simply couldn't verify because I was missing the right data. And actually following those tedious ten days of, you know, those servers being crashing twice a day without us predicting why and when, I simply reached out to my co-founder and my partner in crime Leonid Bluchten and we started like, whiteboarding, thinking, hey, this should be a much more agile way to define observability, but also consume observability, like from a developer perspective, because the issue was escalated quite quickly to us developers. So actually this red code moment eventually led to, you know, establishing Lightroom. And here we are almost five years following that moment.

Mirko Novakovic: Yeah, that's a great story. And I, I really like as you know, I'm also an angel investor. I think you also do angel investments. And one of the things I really love about startups is when they solve a problem they had themselves. Right. Because then you really understand the pain. If you have been in this troubleshooting mode for ten days, as you have been there and everybody's standing around pushing you, why is that crashing? Then, you know, it's a real problem and real pain, right? And then you want to solve it.

[00:03:38] Chapter 2: Importance of Developer-Focused Observability

Ilan Peleg: You know, I used to be a developer, and for the first time, I felt like I'm super critical to the business. If I would fail, it's going to be a huge failure because it again, considered to be and still considered to be one of the biggest telco companies out there. But at the same time, if and when I'm gonna be, you know, successful troubleshooting those issues. I'm going to save my day. I'm going to save, you know, this business that is suddenly at risk. So a lot of times when we talk about like observability, we are speaking about resilience of live applications, live production workloads, serving existing customers. So it's a bit different because it was like a pre-sale process. I truly felt that I'm becoming part of the sales team in a way supporting this transaction, at least from the technical perspective of it. It was key for everything that was laying the foundation for later or like the later discussions. So yeah, yeah, yeah, exactly about that.

Mirko Novakovic: I think that's why I really like your approach of kind of developer focused observability, because, I mean, one of the stories we are hearing for more than ten years now is this shift left. Right. It's kind of what you described with developers becoming more business critical. Right? Because more and more tasks are being shifted left towards the developer. When I started as a developer, I coded and then I threw the code over a wall and I didn't care anymore. Right? And that's not how it works these days, right? You are also on call. You have to troubleshoot. You are responsible for security, for performance, for reliability. And then you need the right tools to troubleshoot, right. And to get visibility into the code when there is a problem.

Ilan Peleg: Exactly. Mirko. We see that trends coming on from different like operational perspectives. Think about it that developers are the ones who build everything that is running in the cloud infrastructure. Well, they're completely disconnected from the operational cost, reliability, security, those applications. So the whole idea of shifting left not only observability, but the whole operational context at some point is like making them much accountable. First, bringing this visibility in order to be able you know, to become much more accountable, but then becoming much more proactive, finding issues much earlier in the software development life cycle, hopefully before they reach production and making the impact before they affect our customers. So I think this is one of the main principles around shifting left. Yeah, I assume we can talk about it into this conversation more.

[00:06:12] Chapter 3: Overview of Lightrun’s Features

Mirko Novakovic: Absolutely. I think we should start by getting an overview a little bit. I'm also a developer, and I'm really technical at some point, and I would really like whatever you can tell me to to understand more how it works, because I saw a few cool things, like you have a sandbox environment and you can attach to process at runtime. You can change things, add logs in your IDE. It's automatically deployed to the code. So some really cool magical features, which I would have loved as a developer. But I also always ask myself, okay, I understand how instrumentation works, but how are they doing it right? So how how did you come up with the idea and give us a little bit of an overview, what Lightrun is and what you can do with it.

Ilan Peleg: Of course, of course. So Indeed, like one is about shifting left observability, basically. Observability is currently still considered to be quite of an operational practice dominated by the greatest in giant, gigantic players such as Splunk, Dynatrace, Datadog, Elastic, and all of those. By the way, we do not claim to compete with them, but vice versa. We can definitely take you as a customer one step further in your observability maturity. But we're going to get there a bit later. But in short, we shift left observability in the sense that we bring observability data directly into the developer's ID where they, you know, edit their code and eventually start their coding of new features. At that spot the last, you know, year, only billions of billions have been invested by some of the gigantic players such as Microsoft. But following those. But some other great new vendors are rising up, such as poolside and the likes, which are all around like shifting left and like developer centric GNI solutions. So we've been here before, shouting out loud that observability must shift left directly into the IDE. But suddenly it's happening from other perspectives as well. So first, it's around bringing observability directly into the IDE to make sure that getting back to your previous point, developers accountable for whatever is going on on top of live applications. Second, helping developers to proactively define and consume observability from within the IDE against or on top of live production workloads. I will explain. As for today, developers kind of running in silos exactly what you just mentioned.

Ilan Peleg: Developers are being required to define all logs, metrics, traces, so-called three pillars of observability during development. Then ship this code to production using CI, CD, pipelines, whatever. Then, once you know the code is running and operating obviously it emits, you know, logs, metrics, traces, and then the traditional, more traditional observability tools they collect, those try to make sense out of it, analyze it, and, you know, monitor the health of your production workloads once things go wrong. Developers are being enforced to find a needle in a haystack, assuming the needle exists, and a lot of times they compromise the what's called MTTR mean, time to resolve simply because data is missing. Developers cannot really find the right needle. This is one I would say symptom, which leads to this compromise on the MTTR. So developers are trying to reproduce issues in their local production like environments. It's becoming much more and more complex, especially in the cloud native era where you have a lot of moving pieces, a lot of like non-deterministic states. So we're producing issues from a heavy production, scalable environment to a local one is something that is not always possible or the vice versa, and the nightmare which is called like hotfix, hotfix versioning. Let's have a hotfix only for the sake of adding visibility. So basically all of those operational processes that are happening quite daily you know, in the enterprise landscape are symptoms of the non agility of how observability is being defined and consumed, at least as we see it.

[00:10:14] Chapter 4: Technical Aspects of Lightrun

Ilan Peleg: So using Lightrun we streamline the process, allowing developers to connect from within the IDE to all of those live applications running across the SDLC, starting from everything remote like QA staging up until production. Being able to add logs metrics traces to any single point of execution in real time on demand. Just choose a very specific line in your code and add this new log that is missing. Add this metric that is missing, and so forth. But at the same time, while doing so, you can also streamline this data back into the IDE. So we are removing any context switch between, you know, more like dev tooling versus operational tools. That, as said before, considered to be more like the traditional observability tools. So just to sum it up, the whole shift left approach is first about, you know, this super expensive, I would say real estate of the developer, which is the idea bringing observability there. Second, transforming observability to be not only developer native, but more like operational, free, real time, and bringing it to much earlier development life cycles rather than mainly post-deployment. So developers will, you know, adopt much more of a proactive and preventive approach, not only like handling incidents much faster, slashing mean time to resolution, which is the holy grail metric of, you know, the whole industry, but also removing or preventing those incidents before they reach production.

Mirko Novakovic: Yeah, totally makes sense. Totally makes sense. But so let me try to say it in my words if I understood it correctly. Right? So normally if I observability tools as I know them with agent, is it an OpenTelemetry agent or something else? It will instrument the code based on some I would say global knowledge, right? This is a database call. This is an HTTP call. This is this framework. And it's very generic right. This is called the distributed tracing. It gives you a good sense of how microservices call each other. But it doesn't really give you a sense of the exact execution in the code. Right. There could be different if then else clauses. Right. And also variables. Right. The variables that you have in your debugger. And so as far as I understand this, if you figure something out I can take Lightrun connect, for example, to my JVM. I know that there is a problem in a microservice, for example, because one of my observability told me and then I can go into my IDE, click into that method, Add logs or other things. Spans, traces. And then I would get what I would get the execution time. But I also get the variables like in an IDE and it will show me the variables out of production.

[00:13:00] Chapter 5: Integration with OpenTelemetry

Ilan Peleg: Exactly. Think about the following kind of fact. Everyone, every developer in the world uses debugger.

Mirko Novakovic: Yeah.

Ilan Peleg: Right. But debugger is definitely, you know, meeting the dev local machine, kind of development, I would say requirements, but it doesn't hold for cloud. Cloud native, everything that is running remotely. And we all know that development, not only production but development is moving to the cloud itself simply is not applicable for like using debugger for numerous reasons. Leave aside the all operational aspect of running it in production from security standpoint reliability and so forth, which is simply not applicable, but even much earlier it doesn't support. Like microservices doesn't support cloud native architecture. Like it's mainly 1 to 1 in terms of like the connectivity to the process itself. Second, it halts the process, right? Like once you set the breakpoint, it breaks the service. So you cannot really debug or gain the same debugging experience on top of live remote distributed applications. So what we do is kind of at least we started with offering the first cloud native debugger, which means setting snapshots rather than breakpoints. Meaning we capture the data for you. Everything that you used to gain using a traditional debugger, you will still gain with Lightrun. We will not halt the service.

Ilan Peleg: We will not stop or break the service. This is how we started. This was the first feature we ever released, but then it really became like a dynamic and developer. Real time observability solution by which we're adding or allowing developers to add not only snapshots, but also all of the three pillars of observability for different like troubleshooting purposes. So just to sum it up, think about a debugger like experience for live production workloads adjusting it for live production workloads with everything that you can think of like from an enterprise perspective, that is more how do you guarantee safe, secured access? How do you guarantee that sensitive data you set variables, right? A variable might be like PII or like some sort of serial number of a customer. You wouldn't like your developers to gain access to it. So making this, I would say, quite native, approach that developers are used to when they develop their code, but extending it and making it much more reliable. Widely, or I would say much deeper into the SDLC, both from developer experience but also from the enterprise.

Mirko Novakovic: Yeah, and I can see that you must go very deep into the runtime environment, right? I mean, you have to really understand the runtime. And it would also be very different if you do that for Java or. Net. So, which runtimes are you supporting. And and and I guess you probably have started with one Java or something and then you added more. So how is that experience.

Ilan Peleg: Exactly. So we definitely had to develop a proprietary agent or SDK for each runtime because each runtime has its own instrumentation, its own bytecode mechanisms that you should write on, but also optimize those, not all of those optimize for like high scalable you know, enterprise grade, say, you know, requirements. So as for now, we support most of the back-end frameworks out there, which are JVM, Python, Node.js, and obviously. Net. And most of the IDs accordingly, meaning like JetBrains family, VSCode, Visual Studio and so forth. To your point, it's not only about allowing developers to instrument new additional observability pillars at runtime on the fly, but it also making sure that this product first is read only. You wouldn't like to break any SDLC principles. You wouldn't like to break any change management or any compliances. So you must verify, or we must verify and we verify we have some patents on that technology, which is around making sure that the product is read only. In short, under the hood, we have literally developed some sort of a mini interpreter which is in charge of validating that all of those changes are read only, which is quite an interesting technology by itself.

Mirko Novakovic: Yeah, I can see that. I can see that we had problems. By the way, I don't know if you ever this is a very technical thing, but we did dynamic bytes code instrumentation also with my former product for some things which can be tricky. And we sometimes had problems with other agents that add code also at runtime and did not do a good job. For example, be relying on fixed numbers in the thread, right? And then you add something and then the other agent breaks because you edit something and it's not correct anymore. I could imagine that somebody could run you together with New Relic or Datadog agent, and then you have two agents and both. Do instrumentation ever run into a issue with other agents?

Ilan Peleg: It's not something that happened quite commonly, but I must admit that one of the key areas where we shine, or we are proud of is that simply because we deal with runtime, you need to see a lot of environments in order to have, I would say, a robust coverage of so many different edge cases in that domain. So to be somehow somewhat general here so many different configurations, so many different runtimes, so many different like especially talking again about enterprises. What's happening there is from one hand, they're still running some of the most legacy code out there. But at the same time trying to modernize so suddenly you can support an enterprise customer. Everything like from on premise deployment with IBM stack going back to, you know serverless architecture you know, running in AWS, so, so many different, I would say, configurations or permutations of configurations that you should see in order to make this solution quite robust. And I think it took us at least two years to get to a maturity where we feel comfortable with where we are now. And it's something that now we are super happy to be proud of because like, a lot of things cannot be tested inside.

Mirko Novakovic: Yeah. I mean, I we always joke that Instana because we had a few hundred customers and I think we never saw an environment that was even similar to another environment of the customers. Right? You would think that there should naturally be something like, okay, I have this Kubernetes version, this Java version, this database, and there is somehow a common stack. Nope. Never ever. Right. There was always something special, something different and some edge case. And I can just second that. It took us really a few years and a lot of engineering effort from some super specialized experts to figure those things out. Right. And harden the agent so that it's really working well.

Ilan Peleg: Exactly. So we have quite a solid process of how we mature, new releases a new features, how we, you know, adopt some more, like the progressive rollout kind of delivery. So, so it's becoming more and more interesting as like architectures being like, changing, ever changing. But yeah, now, like Lightrun has been around for 4 to 5 years already. So we've seen, you know, every. No. Not everything. This is exactly the point. But we've seen some of the biggest enterprise landscapes out there. And I think this is what makes us quite unique, like the knowledge of those specific environments for our specific context.

Mirko Novakovic: And so with Dash0, I've decided to go OpenTelemetry native. Right. I will not develop my own agents. I will use the OpenTelemetry standard. How do you think about OpenTelemetry? How does that affect your product? Are you supporting it? What is your view on this new standard?

Ilan Peleg: So we are totally aligned with this trend. First, even with OpenTelemetry we still have we will still face some of similar challenges or challenges. So the need will probably be there. And we will still be a complementary solution to existing vendors that wrap together. Opentelemetry. The thing is that we are thinking about integrating with OpenTelemetry. So first you will be able to enrich existing spans dynamically and add any further cardinality into it. Second, we're about to release very soon a dynamic tracing feature, which means I think last time we met I showed you something, but you will be able to truly define code level tracing from within your IDE, saying, hey, line 17 at file named Mirko and line 19 at file named Ilan. Please capture the data over there and tie them back together under the same transaction, and also connect it to a span that I'm already kind of tracking down.

Mirko Novakovic: Nice.

Ilan Peleg: The idea of adding first any further dynamic tracing using Lightroom, but also enriching existing spans is something that we are already working on here at Lightroom.

[00:22:54] Chapter 6: Addressing Performance and Business Logic Issues

Mirko Novakovic: Yeah, that makes total sense. I mean, we have this discussion also, right? There's also we just last week had a discussion about there's kind of I wouldn't say it's a trend, but at least there are discussions. As always in our industry people go from monolith to microservices now. First they had one monolith, then they have 1000 microservices. Then they figure out 1000 microservices. Maybe it's also not the solution. So now there's a trend that people are talking again about monoliths, right? Maybe in some cases, and to be honest, the more monolithic and application is, the less distributed tracing is really a good solution for it. Right. Then you need more code level observability Solvability which you need or which you get with your tool. Right. So I can really see I mean, there's never a black and white, right? It's somewhere in the gray zone, how these applications look like. But I can totally see how you use distributed tracing to get an end to end view. And then if you need, you can use Lightrun to make the boxes a little bit more light. Right. Get more light into the box. And that's a really cool solution, right?

Ilan Peleg: Exactly. I'm going to tell you what Microsoft has recently released a paper by which they scanned their all incidents internally quite widely. And then they categorized those. They found out that around 30% of the incidents were code level. By the way, it was the biggest chunk, the biggest part of the rest of the cake. Obviously, the other parts were like configuration issues, infrastructure issues, network issues. So code level issues were associated as you know, the bigger the biggest chunk. But what's even more interesting is that the TTD time to detect or time to resolve those issues were significantly longer in orders of magnitude. And it makes sense when you think about it because like finding like what's going on wrong in your code probably requires more people to be involved especially, again, if talking about a huge enterprise kind of landscape where you're having like everything from like microservices, serverless, but also to monolithic and legacy applications where some of the developers left the organization. So finding a root cause in the code level might be more complex, but also eventually pushing the change verifying the new behavior. So I think this is something that we're quite proud of. Hey, existing observability solutions will give you a great sense of which service is probably behaving unexpectedly. Some abnormalities are in SLA, SLO, some alerting, so we'll guide you better of what might be the root cause. What service, what API, what gateway, and so forth. But when you truly need to debug those and go to the single line of code that eventually is responsible to the root cause, assuming it's an application level issue, this is where we shine. And basically we kind of promise to our customers, hey everything that is code related, this is where we can be super helpful. Again, getting back to your point. So spans and tracing will assist you in verifying which service, which component in your system, where when you need to, you know, dig inside and find the very specific, you know, line of code that is responsible for this issue. This is where you would probably use Lightroom.

Mirko Novakovic: I honestly, I would have I would have thought that The number of code related issues would go down. How I see it is, I mean, I started in an environment where you had 60 400kB of Ram a very slow CPU, and at that time almost every issue was a code issue. Right. Because if the performance was low or you had to optimize the code really hard, right? Really hard, like down to I was working on 3D engines where you thought about is a shift left faster than a multiplication, right. It's like these little things. I think you don't need that anymore. Right? These days, infrastructure and memory and CPU and GPU and network is almost unlimitedly available in the cloud. And it's also super cheap, right? To be honest, compared to what it was. So I always have the feeling that people these days they don't spend too much time anymore about thinking about the code and they just say, okay, if it's slow, then I just spin up three more containers or something like that, right? So but you say still a lot of issues are maybe this comes with new issues, right? That's probably what's happening.

Ilan Peleg: First, new issues and code changes. This is obviously one of the biggest reasons for like, you know, issues running in production in general. But I'm gonna give you another perspective. I think whatever you mentioned is more related to performance related issues, but what about business logic? What about hey, I just made a transaction, and suddenly instead of, like thousand of USD, there was some business logic mistake, and suddenly we just added two zeros. Those things are very hard to identify using like traditional observability, because truly need to narrow down into the very specific business logic. So this is just to give you a better sense of other I would say spectrum of issues that might arise that are not necessarily related to traditional observability, so-called or performance degradations.

Mirko Novakovic: No, no, it makes total sense.

Mirko Novakovic: You see that I'm too much in the classical observability mindset where it's more about performance. But you're right, especially if the data that you have. You can also check for correctness, right? Because you have the variables, you understand what's happening and you can actually look in something is correctly executed or calculated that comes with the granularity that you have. Right.

Ilan Peleg: Exactly. And in fact, we see a growth in I would say, the need to truly look inside your code running in production because of numbers of things that are happening. For one, is the reliance of third party code or like services that you consume, integrations and so forth, which means a lot of data related issues happen only in production, live with live traffic. Those are very hard to produce in pre-production or even anticipate them ahead of time in testing because like, you know, you are having like millions of transactions each time you reason another kind of service for clearing purposes or for like other reasons. But the whole idea is using some third party's resources, and it's very hard to mimic those ahead of time or again, reproduce that in some other environments. So this is one thing that is rising, the reliance on either third party code. By the way, let's talk about generative AI suddenly reliance on like, you know lama index and long chain, for example. Yeah. Go figure what's happening in this black box. And the other thing is progressive delivery workflows and the whole, like, DevOps culture that is emerging simply causing to more and more testing in production practices. So even before you're facing like P1, huge outage. A lot of developers are getting closer to production simply because they're rolling out their features progressively. So, like, they'd like to have, like, more dynamic nature of testing out those features, especially those that are not fully covered in testing phases before so you're rolling out a new feature, you're rolling it out to 5% of your traffic. You'd like to verify how it behaves, code flow that behaves correctly or as expected performance and general correctness of this piece of code that you just pushed to production. So you're receiving a lot of new traffic and you'd like to verify the behavior behaves as expected. So these practices of testing in production are also in the rise.

[00:31:13] Chapter 7: Observability in High-Frequency Deployment

Mirko Novakovic: Yeah. That makes sense. Total sense.

Ilan Peleg: Yeah.

Mirko Novakovic: One of the things beside shift left is this high frequency of deploying code to production. Right? I mean, that's something when again when I go back to my career, when we started, I worked for a lot of enterprise companies. You had 1 or 2 releases a year, right? It was a very diligent process with a lot of quality assurance at the end, and then it still failed. I say, you said it like the these emergency deployments, right? Or hotfixes. I mean, that was the funny part. You had two releases, but essentially you already had a lot of other deployments because you released something and then you had kind of a side by side process, which was hotfix, and then you had to hotfix a lot of stuff because it was not working as expected, but it seemed like one big release, but it essentially was already a lot.

Ilan Peleg: Exactly, exactly. It's interesting because I call it by the observability meets pipelines, right? Like, you need a way to verify that the correctness of those new features across the SDLC, especially when you just start this progressive rollout into production and the dynamic nature of Lightrun, allowing you to add those. Observability pairs dynamically. You shouldn't really persist this data. Because you're just testing out a hypothesis. You're just testing out your existing rollout. Your existing deployment, and so forth. So obviously some of those other existing. Like logging metrics should be persisted over time for like historical reasons purposes. For like audit product events, so many other you know, use cases. But everything that is related to, hey, I just like to test it out. It's I'm not sure we'll have the same hypothesis to be tested tomorrow. I'm just, you know, running a very specific test right now. So it's a different kind of use case. It's not directly linked to mean time to resolution of, like P1, P2 incidents. It's more around, like releasing new features confidently.

Mirko Novakovic: Yeah. Makes sense. Gives you more security in pushing the button. Right. To deploy something because you have a feedback loop now, right? Without that, you don't have that feedback loop. That makes total sense. What are the customers that are using Lightrun? Are this more enterprise type of customers bigger ones?

Ilan Peleg: Yeah. So we started with mid-market. Now we want to move fast. You want to have like fast product iterations and reach this, you know, product market fit. So at the beginning, we started with some of the, you know, unicorn companies out of here from Israel. But quite quickly we started supporting some of the biggest enterprises in the world. We have a few fortune ten customers. One of them I cannot name them, but we all use their products day to day. The enterprise market is interesting because from one hand, the cycles are much longer. But at the same time, the impact is enormous. Like the scale that those organizations operate. You truly see, like, every minute of downtime at some of those organizations is, you know, valued by sometimes hundreds of thousands of USD per minute.

Mirko Novakovic: It's also the size and number of developers. Right? Last week I had a call with one of the big American banks, and the guy said, oh, yeah, we have 40,000 developers. I mean, think about it, right? It's a bank, and they have 40,000 developers. Yeah. Coding for them. And then you understand why a developer focused observability addresses a huge market. And it's also a huge impact. Right. Make those developers, I don't know, 2% more effective. Think about 40,000 developers.

Ilan Peleg: Yep.

Mirko Novakovic: Making them a little bit more effective or faster. In troubleshooting. Not losing so much time. That's a massive ROI, right? So I can definitely see how you do your sales process, right?

[00:35:24] Chapter 8: Lightrun's Customer Base and Impact

Ilan Peleg: Yep. Exactly. In fact. In fact I might guess who this bank that you are talking about? I have a few names in my guest list, but the whole idea is that we're now prioritizing the financial market. We just completed some successful POCs and technical evaluations, and we are truly doubling down on the financial market simply because, like, from business perspective, there are it budgets are enormous. So the cost of downtime is enormous. But exactly to your point, employing like tens of thousands of engineers, not thousands of engineers, but tens of tens of thousands of engineers so everything related to accelerating productivity suddenly becoming even more significant MTTR and operational efficiency will be quite clear to us. Like think about like downtime for trading or like markets or wealth management kind of systems. So the combination of all of those is simply creating a super powerful impact that we can bring. And yeah, this is definitely one of the key markets or industry industries that we are now targeting.

[00:36:35] Chapter 9: Israel’s Unique Tech Ecosystem

Mirko Novakovic: Yeah. And to finalize this conversation, I have a very specific question to Israel. Right. Because it's interesting. I'm an investor also. I did some investments in Israel, and there's actually a lot of observability companies there. Right. I just out of my head. Heptio, Lumigo. Corelogic's. Lightrun. Right. So there are a lot and it's actually a it's a pretty small place, right? Especially because most of it is in Tel Aviv. And so so my question is, I mean, I understand there's a lot of security coming out of the military and things, but what makes Israel so special for this type of startups, and also because you don't have such a big market, is that something like of the secret sauce that you have to be global from day one, because, I mean, you said you sold to a few unicorns in in Israel, but but also there. Let's be fair, you don't have too many of these large companies you can sell to. And so at the end of the day, you have to be internationally pretty quickly. Right?

Ilan Peleg: So I'm gonna tell you a good story, and then we can extrapolate it. When we founded Lightrun you know, one of the key areas we were looking at was like how Lightrun will be adjusted to new to the new cloud native era, specifically to Kubernetes. So sorry, Leonid, but I will say that here Leonid didn't have a lot of experience in Kubernetes back then, but apparently he just managed to develop an internal orchestration for container system inside IDF because it you didn't have access to Kubernetes. So we had to develop an internal Kubernetes like solution internally. So like in a few days, it became much expert in Kubernetes than most of the Kubernetes experts that I know. And I know quite a few. The same with them. Some specific databases he couldn't use, some, you know, off the shelf databases. So those guys at the, you know, the IDF and the Intelligence Corps had to deal with developing their internal databases, matching their specific needs. What I'm trying to say here is that as the Israeli ecosystem is proud of a lot of talent that, you know, is developing cyber security related I would say products internally in the IDF and then commercialize it. They're also proud of developing their internal infrastructure simply because they don't have any, like access to the internet. They don't have any access to the outside world. They don't like. A lot of those solutions are simply developed internally. And a lot of those you know, say more like enterprise level requirements are also eventually originated at their environments without calling it an enterprise, without having off the shelf like vendors supporting them. So they simply need to find a solution so they find the solution and then they go outside and commercialize it. So it happens quite similarly to the infrastructure landscape if we compare it to your point, to cybersecurity vendors, cybersecurity startups that are super like shiny out of.

Mirko Novakovic: Here, that's interesting. I was not thinking about it that way. So the IDF is not using off the shelf components because of security reasons.

Ilan Peleg: Probably it does. Sometimes it's not. For example, the very specific unit I'm not from, I was not like serving. I was serving as a professional athlete, by the way. It's another story for another podcast. But but my point is that my point is that asking learned my co-founder. They didn't even have access to Stack Overflow. They didn't have internet. So they developed everything that they've been developing without access to the outside world. So think about the skillset they had to develop by themselves, you know?

Mirko Novakovic: Yeah that's cool. But that's a good, good, good point also for other countries, right. I'm always thinking we have these discussions in Germany. Right. How can we catch up. So basically by understanding more of the fundamentals. Right. Which these days mostly to be really honest, you only get at the big hyperscalers. Right. The really fundamentals you learn at Google, Amazon or meta or something. And you can't really learn them in Germany, right. Because the companies here are building on top of that stack. But what you are saying is basically because in the IDF, in some places, they have to redevelop the whole stack. You get this real knowledge about the infrastructure, and that's why there are so many startups coming there, because they just learned the hard things, right?

Ilan Peleg: I would say it's a good example story of the challenges that those talented guys face quite early in their life that make them quite, you know, talented, skilled early in their journey and together with some more, like, character based skills that they gained quite early in their journey. The combination is eventually quite successful in creating new ventures, you know, speaking more generally now.

[00:41:41] Chapter 10: Closing Remarks

Mirko Novakovic: Ilan, it was great talking to you. It was really fun. And I enjoyed the conversation and I'm looking forward to come to Israel. I mean, I promised to come to Tel Aviv soon and meet. And then we should.

Ilan Peleg: You know that I'm waiting for you.

Mirko Novakovic: Grab some food and drinks. Thanks, Ilan.

Ilan Peleg: Thanks so much, Mirko.

Mirko Novakovic: Thanks for listening. I'm always sharing new insights and insider knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on