Episode 1226 mins11/14/2024

Code First: Developer-Led Testing and the Future of Monitoring Automation

host
Mirko Novakovic
Mirko Novakovic
guest
Hannes Lenke
Hannes Lenke
#12 - Code First: Developer-Led Testing and the Future of Monitoring Automation with Hannes Lenke

About this Episode

Checkly CEO Hannes Lenke joins Dash0’s Mirko Novakovic to discuss his “monitoring as code” philosophy, how Checkly lets developers take ownership over end user testing and his view on AI’s role in the future of APM tools.

Transcription

[00:00:00] Chapter 1: Introduction and Guest Background

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I'm co-founder and CEO of Dash0. And welcome to Code RED code because we are talking about code and red stands for requests, errors and duration the core metrics of observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Hi everyone! Today my guest is Hannes Lenke. Hannes is the CEO and co-founder of Checkly, a code first synthetic monitoring tool. He previously founded TestObject and has worked in testing and observability field for 15 years. I'm glad to have him on board today. Hannes. Welcome to Code RED.

Hannes Lenke: Thanks, Mirko. Thanks for having me. Super excited to be here.

Mirko Novakovic: Yeah, that's great to Germans talking on a podcast in English. That would be fun.

Hannes Lenke: Yeah, totally.

Mirko Novakovic: We always start the conversation with the first question that is, what is your Code RED moment?

[00:01:04] Chapter 2: The Code RED Moment

Hannes Lenke: Yeah, I mean the Code RED moment. That's a good question, right? So it kind of goes back to where I'm coming from, right? So you already introduced me as working in testing observability since a while. It's almost two decades now. Right? So. And you know, when I founded my first business TestObject so that later on to Source Labs, you know, I was general manager of Source Labs, worked with lots of customers, customers doing test automation on production. Right. So what did we do at Source Labs? We enabled customers to scale their test automation in the end. Right. So this was meant for pre-production testing though you've seen lots of customers using Source Labs for production testing. Why was that though? I mean back then and still, you know, staging environments are complex, right? So it's quite complex to spin up a staging environment, which is production-like, you know, if you do end to end testing it might be easier, you know, to do them on production, even though, you know, the system is already rolled out. Another reason was, you know, even for larger enterprises like banks, insurances, etc. you know, the staging environments were often not the most performing environments, right? So it was kind of hard to run, you know, hundreds of tests against, you know, this staging server. Right. Because it was just not able to handle all these requests.

Hannes Lenke: So, you know in the end, development teams and QA teams back then decided, okay, you know, let's run some of these tests you know, against a production environment, really to make sure that, you know, our app performs well, right? So I've seen that happening quite a lot. And that's kept me thinking you know, it's not only about testing, it's about automation, right? So if you want to be an engineer, you might want to understand how your you should understand how your app does. Right. And is it reliable. Is it performing right. Is it reliable even before it's rolled out to production? And you know, is it reliable? Later on. Right. While it's on production, you might want to reuse the automation, the tests that you have already implemented for testing and then later on for production monitoring. Right. Run these tests in a sequence. Right on. And really, you know, understand if your app works from a user perspective. And I've seen that happening, you know, for companies like, like SalesSource and this was my moment to think about, okay. I mean, there needs to be a tool which is optimized you know, for production monitoring, which takes your existing end to end tests and use these for production.

[00:04:00] Chapter 3: Synthetic Monitoring and Playwright

Mirko Novakovic: Yeah, that makes sense. And I don't know if I remember correctly, but I think Source Labs was basically jmeter automation in the cloud, right? They used jmeter scripts.

Hannes Lenke: Or Selimion automation in the cloud.

Mirko Novakovic: Selimion. Yeah. Selimion. Yeah.

Hannes Lenke: Not J-Meteor. But you know, they're still I mean, they're running and they're doing lots of other frameworks etc.. But you know, it was started with Selimion automation in the cloud.

Mirko Novakovic: Yeah. Selimion I remember also very good. It's basically you used it as a developer in the early days to automate your tests. Right. And script it in the browser as a plugin or, and then record it and then you play it back, right to see if this thing still work that you thought they should work. Right. I have to say, we were proud users of Checkly at Instana. Then I became an angel investor in Checkly because I'm a big fan and with Dash0 we are again a customer of Checkly. We use you for our synthetic checks. Maybe you can describe a little bit what exactly does. And you? You are not using Selimion. You are using something called Playwright, right? Maybe you can also describe a little bit what that is and what you're doing. And then later on we can talk about your latest I would say evolution of the product towards tracing and how you integrate with tracing. But I think the synthetic part is, is, is what we should start with, right?

Hannes Lenke: Yeah. I mean, there's a lot, a lot to unpack here. Right. So what exactly does in its core in the end, you know, it enables you to understand how is my, you know, API or my service. How's my web app doing from an end user perspective? Right. So we allow our customers, which are, you know, might be DevOps engineers, might be SREs, might be developers to run their automated test script as, as tests and then you know, for production monitoring continuously, you know, from 21 data centers worldwide, right? So that's that's the core, right? So we're testing your application from an end user perspective. That's what synthetic monitoring does. Right. So it's really it's you know, pinging your APIs. It's going through transactions on your website, in your web app, etc.. And you know, if something goes wrong, then we surface that something is wrong from end user perspective. That's on a high level what we do. Right? And as you said, we're using a framework which is called Playwright. So it's not Selimion anymore, right? So Playwright is the latest automation framework. It's, you know adopted by thousands of organizations worldwide, developed by Microsoft, actually. Right. And this is an open source framework. You know, developers already, you know, know how it should work, how that automation framework might work. So, you know, they can use their, you know, existing knowledge to write end to end tests in the end and use these end to end tests for production monitoring as well.

Mirko Novakovic: I remember with Selimion you had basically two modes. One was you either program it, it was a scripting language, but the other way was you basically used your browser and you did. You did what you wanted to test in the browser and it actually recorded it. And then you had the script at the end, you could change it, but it was basically two modes, right? One was a very easy I record something and then get a script and the other was I program it. So it's Playwright similar.

Hannes Lenke: Playwright supports these two modes as well. Right. So Playwright has a tool called Playwright Codegen which you can use to record the first actions and then it spits out code, right. Which you can then later on use and execute. Right. And I mean, the truth is, you know, with Playwright Codegen, you can get started quickly, right? So you can maybe, you know, start to create a first script, but then as a user, you might want to do some adjustments, etc. to make it really fast, reliable, etc.. So it's more, you know, a kickstart for you as a developer. And then, you know, in the end, it's something, you know, it's a code in your IDE which you, which you would, you know, maintain there.

Mirko Novakovic: And is it TypeScript or what is it. Is it.

Hannes Lenke: Yeah. It supports TypeScript and JavaScript. It also supports Python and various other languages. But, you know, we support TypeScript and JavaScript. Okay. And Playwright, you know, is known for browser automation, but it can also help you to, to automate APIs, right, in a similar fashion. So and we're, we're using that as well and executing these scripts you know, in a similar fashion.

[00:08:43] Chapter 4: Use Cases and Ease of Integration

Mirko Novakovic: And then I think just for me to understand, right, when I use Selimion, I so I have to be very honest here. Right. I coded. The last time I coded was 2009. Right. So. And at that time, Selimion was the cool shit, right? And so I used Selimion, but I mainly used it for testing my code if it works as expected. Right. So I used it in my CI CD environment. So at that time it was called Hudson. Later I think it turned into Jenkins. So I had my Jenkins server and I executed my unit test, but I also had my Selimion test to basically make sure, okay, what I developer is still working as I expect. So they were running inside of this CI CD cycle. But at that time I never used it. I tried to remember but I never used it for uptime monitoring. Right. So there is a difference between testing your code in a CI CD environment and testing your code, or testing your application during production to see if it's still up. Right. And as you said, do it from 21 data center. But it could be that if I'm in Spain, it works. But because there is a DNS problem in whatever. India. Yeah, I can't access the page anymore. Or the cable between the Atlantic is broken, so I can't access the page anymore from the US because my data set in Europe, right? So that's kind of uptime monitoring and that's what you do. Right. Or do you do both. Are you a CI CD tool or.

Hannes Lenke: I mean we're not a ci CD tool, right. So we allow our customers to pretty much you know, execute these scripts also on demand. Right. So also for testing. So you can use us for CI, CD cases. And the truth is, you know, lots are lots of lots of our customers are doing that. Right. So they're writing these test scripts and using them once you know, in their CI, CD environment and then later on, you know, on a schedule for uptime monitoring. Right. So maybe, you know, if you use us in CI, CD environment, you want to make sure that your apps app works from a functional perspective, right? More than catching issues you know, in various places around the world. Right? So. But then if you schedule checks for monitoring, you might want to make sure that your application is reachable from Germany, from the US East coast, West coast, etc.. Right. And that we capture that as well. But you know, the lines are kind of blurry from my point of view, right? So for me it's automation, right. And what we've seen throughout the last years is, you know, the lines between you know, operations and developers kind of you know, became much more blurry. And, you know, the teams are not so silent anymore. So you know, and what a tool that enables you as someone who's, you know, sitting in a, in cross-functional development team, DevOps team or C team is, you know, work together with the with the engineers who are traditionally created these automation tasks and then, you know, use these tests for production monitoring or use a subset of these tests for production monitoring. Right. So but you're all of a sudden you're speaking the same language with, you know language that your developers are speaking today, right? So you're using the same programming language, etc.. And that's what we enable our customers to do.

[00:12:04] Chapter 5: Customer Adoption and Product-Led Growth

Mirko Novakovic: And how do you help me? Sorry, but this is my old developer thing coming up because I'm always thinking, okay, if I do tests in my CI CD staging environment, I can see me doing use cases and changing data, etc. and then have my test users. But now I'm doing this in production. So are you helping me somehow with the data? So which users and passwords I can use to log in, and then if I do changes, I kind of have to roll them back. Or how do I manage to not mess up the data in production? Or is it a problem I have to solve? Because that's something you can't help me with.

Hannes Lenke: Yeah, it's an interesting problem, right? So we help you with parts of it because, you know, we have set up tier-down scripts, etc. so you can, you know, make changes to your database before such a script is running. And maybe tier-down that script you know, later on, etc.. So we help you with that. But in the end, you know, you need to come up with a concept you know, how to, you know, monitor your production system. Right? And you might want to have a, you know, test user on production. Right, which is able to perform, I don't know, check out in your, in your application just to enable you to understand is my checkout actually working for my users if you're an e-commerce shop right. So there's nothing we can do for you.

Mirko Novakovic: And the I mean, this is kind of synthetic, right? And you normally also mark those transaction as synthetic. Right. You will give a certain tag or HTTP header to the transaction so that somebody can detect that this is actually a synthetic transaction. Are these just just out of curiosity, if I'm using something like PayPal or some other service in the background, I want to test if they are working or not. Do these APIs today support synthetic checks? So do they allow me to basically say, hey, now I'm calling you, but I'm actually not want to pay? I'm just testing if this payment services is working right. So stripe or whatever. Is that possible?

Hannes Lenke: Some APIs do, others not. Right. So you might want to I mean, so if your system does support that then you might want to do, you know, a real transaction. But what we see customers doing is kind of, you know, going until the end of the flow sometimes, you know, and not doing the actual checkout. Right. So not doing the actual transaction or have a little you know feature flag somewhere to, you know, enable a synthetic transaction to pass. Right. So yes, there's some complexity involved. You know acting. Right, Like a real user. Right.

Mirko Novakovic: But as far as I know, I haven't used myself. But I know from my developers that it's super easy to set up. That's why they love it, right? It also works in their existing environment. So they somehow love it because Playwright and all these things seem to be familiar with them. Yesterday I can just pull up my slack, but I know we had the yesterday there was a deployment at Dash0 and something didn't work and they had to roll back that deployment. And the next comment was, oh, I made some Checkly tests. So next time this happens, we get informed that we did this right. So it's kind of also kind of this philosophy that, hey, with Checkly, I can actually test if something is working or not. And I have to credit, it seemed to be like a very natural thing that the developer said, hey this was not good. And to make sure that it doesn't happen, I create these 2 or 3. I think it was three synthetic tests that made sure that, I don't know, I remember some Clickhouse thing is up and running and and they did this exactly right. So it's a pretty cool service. Yeah.

Hannes Lenke: Yeah. Thank you, thank you I mean it's interesting. Right. So naturally you would say, hey, something like techy synthetic monitoring might be, you know testing your UI. Right. So, but in the end, what it does, it's going to be running end to end tests, right? So on production. So, you know, even if you mess up your Clickhouse, you know, which is maybe several layers below you're able to catch it, right? And you know, what you're catching is a real problem for, for an end user, right? Compared to, you know, various other tests that you might set up, etc., or, you know, looking at your Clickhouse telemetry you know, there might be alerts in there which are not really having an impact on the end user. Right? While when we fire the service or staging, you know, then you have a real problem. Then, you know, you need to wake up your engineers in the middle of the night, right?

Mirko Novakovic: And as far as I know, you have more than a thousand customers, so we're not the only ones. So congrats on that. How did you get to a thousand customers? Is it a plg motion? So did customers just log in? And I mean, you probably can't do enterprise sales with a thousand customers at this point of time, but that sounds like it's a tool people find and then you just use.

[00:17:12] Chapter 6: Differentiation from Traditional Observability Tools

Hannes Lenke: Yeah. I mean, so we have, we have pretty much a plg motion. So product led growth motion. Right. So kind of layered on top. We have, you know, sales supporting you. Right. If you're a larger customer. Right. So you might be a startup, very innovative team like Dash0. Right. Then, you know, you might just create an account directly and then, you know, putting in a credit card and let it run. If you're an organization like LinkedIn, you might, you know, want to run a larger volume of synthetic checks and enable hundreds of engineers. Right. And then you know, we help you to, to onboard your, your teams. Right. So we have, you know, almost 1100 paying customers now. Lots of them are coming from, from a plg motion and others you know, finding Checkly or happy with Checkly, maybe with its side project and then approaching us and approaching our sales team, and then we then we help them to onboard it. Right.

Mirko Novakovic: And is it normally, just out of curiosity, do they start with Playwright and then search for something that is basically supporting that, or do they start with Checkly and then get to play? Right. Can you see a pattern there? Is it like Playwright driving the customers to you or the other way around?

Hannes Lenke: We see both, right? So we do a lot of education around Playwright because we think it's a very cool framework, right? So even if you aren't using Checkly today, we're going to help you, you know, get started with Playwright. So there's a lot of content. And, you know, the YouTube channel where we have thousands of subscribers, etc.. You know, we want to enable your engineers kind of, you know, to create favorite scripts first. Right. And then they might start thinking about, okay, you know, can I catch that bug that we've rolled out yesterday? Can I catch that with a Playwright script running on Checkly. But we also see customers coming to us saying, hey, we have downtimes or we have, you know, Datadog in place. But we know we didn't catch that. You know, I don't have visibility into that problem from my end users. And, you know the truth is, my end users and customers are hammering my support, and I didn't, you know, my support knew about the problem before I knew. So, you know, I want to have some visibility from an end user perspective. And this is also how customers are finding us, right? So it's pretty much these two these two ways, right? So either you have the additional or initial adoption of Playwright and think about okay, what else can I do with it. Or you really think about it from a, from a pain perspective. So, you know, we have downtimes. We want to be faster at detecting problems. We want to resolve problems faster, you know, and to resolve these problems we need to detect them first. Right. So this is how customers come to us.

Mirko Novakovic: I can see that. Yeah. Because I mean, what you just said, you mentioned Datadog as an example, right? Most of the observability players, they also have synthetics. Right. But what you're saying is basically that they are not specialized in synthetics and they can't do some of the things. You will now explain what Checkly does, what others can do, that it's harder for them to find them. And they don't use Playwright or what? I don't really know. Right. What Datadog is, for example, doing. But what's the difference? Why? Why if I'm a Datadog customer, why would I? I mean, we all know that it's super expensive. That's one of the things probably. But beside the price, because I, I, I kind of think price is never a really good argument. I mean, at some point. So probably you also can do something that others can't do. So what is your pitch to say, hey, use Checkly instead of Datadog Synthetics, right? Yeah. Yeah.

[00:21:04] Chapter 7: Monitoring as Code and Developer Empowerment

Hannes Lenke: Yeah. I mean, you mentioned one reason, right? So it's price. But I agree with you, right? So it's never never. You know, the only argument, you know, you might think about, hey, how do I enable my engineers to, to reuse their tests. Right. How do I, you know, enable them to have some insights into operations. Right. So you could call that developer owned operations. We have customers who came to us saying, hey, we need a synthetic monitoring, which fits into our modern DevOps approach. Right? So coming from Datadog or coming from Pingdom or other such services. Right. So what we do differently is we enable you to take your existing tests, keep them in the code repository, and then use a subset of these tests to, to run on Checkly. Right. So we will see we have a Terraform provider that enables you pretty much, you know, with, with 2 or 3 commands to, to use a subset of these tests and run them against a production system. Right. So you can stay in your IDE. You know, we're not asking you to do, you know, clickops go into a UI somewhere and, you know, write automation where clicks somewhere that you then don't understand why it actually fires. Right? So the, the thinking behind is you as an engineer get alerted in the middle of the night.

Hannes Lenke: 3 a.m.. How do we make sure that you really understand why? Why? That load fires, right? The first thing is, if you were responsible or your team was responsible to create the actual monitors, then you're much more likely to understand, okay, this fires because of a reason. And I said, we're doing monitoring as code. It's it's you know, in the end just using your existing test scripts. Compared to, you know, legacy providers that force you into, into UI, into a tool where you might not even have access to. Right. As a development team. So, you know, might be, you know, a traditional APM tools where, where only operations people have access to, etc.. So now you can use, you know, these test scripts, but it goes even further. Right. So we call it monitoring as code because we enable you to configure pretty much everything as code. Right. So is it status pages unchecked, dashboards on Checkly?. You know, Is it alert channels, et cetera? so the whole stack you can have in the code repository. This code versioning behind it, etc. and what it does, it kind of makes. You know, monitoring and observability kind of a team sport, right. So because you use your existing processes etc. to, to configure your monitoring tool in the end. Right. And you can also roll back.

Mirko Novakovic: I think where we are super aligned also with Dash0 is exactly that. We think a you should use the tools that developers are already using, right? You shouldn't reinvent the wheel. So there's something like play right or we for example, we use Percy's for dashboards or Promql for query language and defining alerts like Prometheus alerts, right, that are only there. And we also agree that it should be possible to version the dashboards, the alerts, or in your case, the test scripts with your code inside your tool set like in GitHub and then deploy them with your code. Because not only that, it's aligned with your processes. What I really like about it is so you are rolling back like what we did this night right when something didn't work. You're rolling back and then you actually make sure that the dashboards, the alerts that were there with that version of the code are also deployed. Again, right. If you do this Clickops right now I have created in my UI a new dashboards and something rolling back the code. Now it doesn't work anymore, right? Because maybe with the old version of the code, there's a metric not available or the alert doesn't work anymore. So in this case you have everything versioned and running in one kind of kind of bucket, which is what developers actually care about. Is the code, right.

Hannes Lenke: The repositories save this monitors. Right. So you're I mean, if you're testing an API, you know, developing a new API endpoint you know, you might create a monitor, you should create a monitor for it. Right. So just to make sure that API endpoint is up and running and, you know, it's kind of delivering the, you know, the data that it should monitor as a check. Right. So monitor is a check is a test. Right. So then you know that check might need to change again. Right. So if you roll back the API endpoint is not there anymore. Right. If you don't roll the test back or the check back then, you know, then you're going to alert engineers, right? So you don't want to have that happen. So it needs to be integrated in your software development lifecycle. So and then let's see where we're all agreeing there. Right.

[00:26:10] Chapter 8: Shift Left and Future of AI in Testing

Mirko Novakovic: I taught almost in every podcast we were talking about shift left, which is kind of also scary because it means everyone who is talking in this industry is talking about shift left, which means we put more and more stuff towards developer. Right? Hey, here's your monitoring security testing thing, right? This sounds good and bad at the same time, but I totally get why it's needed because the developer actually controls the code and is also in control of the testing of the configuration. And yeah, I think with, with probably with I, I've seen a few things like that that you can support now developers creating those things easier. I'm not sure how your AI story is if you have one. And if you see things happening that you use LLMs to create test cases, etc. or Playwright scripts. But I think in the future I can definitely see that, right? That you create code and you, for example, say, hey, adopt the Playwright, test this new code and it will support you 80% of the stuff that you normally would do, right. And that's probably something that will happen pretty soon.

Hannes Lenke: Totally. Totally right. So, I mean the benefit of monitoring this code is, you know, it's in your code repository, it's Playwright. So your copilot knows how to create Playwright scripts, right? So, you know, in your co-pilot might in the future also know how to create, you know, check the API tests. We are more thinking about integrating into existing, you know you know, code generators for, for developers. Then, you know, coming up with our own code generator. What Shiftleft is doing, it's pretty much just for me. It's just, you know, automation, right? So thinking about, okay, how can I automate a process which was before, you know, manual process using that word again. Clickops. Right. And how can I enable you know, we're saying developers, but what we I think what most of the industry means is engineers, right. So also SREs operations, etc., who know how to code, how to enable them to, to automate, you know, certain aspects. Right. So I think that's an important movement I believe. Right. So and traditionally that wasn't really automated and should be automated. Right.

Mirko Novakovic: So yeah I agree by the way one of the reasons why we also have chosen some of the open standards, like Promql for querying is also that we think if, if, if the future relies on LLMs and I and I think that's the same for Playwright, something that's well documented all over the net, where you have hundreds of things also documented on Stack Overflow, etc. and there is public documentation that means that these LLMs can also learn about it. Right? And then will be much better in doing it. If you have a proprietary test script, it will be much harder for something to really get that knowledge from, from public documentation and what's available. Right. So that's something I think is a big advantage for publicly documented things because just there's so much available.

Hannes Lenke: Interesting, right? In a way, AI is like a developer, right? So, I mean, developers are pushing back if, you know, there's technology used that they can't find on the internet. Right. So now we're having the same conversation about, you know, why we're using this technology to enable AI, right? So I think well-documented open standards are, you know, are mandatory, right. And I would just, you know you know, push that even even further. And yeah, we should use open standards, right. So also, you know, standards like OpenTelemetry, etc., so everywhere proprietary technology, you know, is a thing of the past, right?

[00:29:59] Chapter 9: Open Standards and AI Integration

Mirko Novakovic: Yeah. Talking about open telemetry and tracing, I saw recently an announcement that you started moving into that space. Let tell me about it. What what is what is the thing you're doing there?

Hannes Lenke: So the high level thinking here is Mirko. I wake you up in the middle of the night at 3 a.m., right? I tell you, something is wrong, you know. And if that problem is coming from your, you know, UI or you know, something which is on the surface of your API, you know, you might figure, you know, you might understand that just from from the alert, you know, but there might be problems, you know, with your Clickhouse that you can't understand just because there's an alert coming from. So what I want to enable you as a user to do is, you know, waking up in the middle of the night 3 a.m. and really, you know, not only understanding there's a problem, but actually also where the problem is coming from, right, to resolve the problem faster. And, you know, the thinking is, okay, what kind of telemetry do we need to connect correlate with with synthetics that enables you to you know, find these issues you know, full stack, pretty much. Right. So what we did is we, you know, partnered with, you know, companies like Coralogix, etc. to correlate our tracing or synthetics with their tracing. Pretty much. Right. So we enable also customers that don't have an APM suite. Right. So we talked about having a thousand customers. Some of them I mean, you would be surprised don't have Datadog, you know, deployed or any other APM solution. Right. So enable them to send us traces as well so that we can correlate them with. Exactly. Right. So but the thinking is really to enable you to understand okay. You know, check your service. The problem. Where's the problem actually coming from. Right. And traces enable you to, you know, backend traces enable you to see deep into your stack. Right. And really yeah. For you to understand this is the layer in my stack I need to look at. Right. So this is databases. You know, some there's a slow database query or anything, right?

Mirko Novakovic: No, that makes sense. Yeah, totally. I mean, we call it context, right? If you have a problem, you want context. Because the problem without context is then you have to create that context, which can be very complicated if you are able, with the tool to create all the context. For me, in best case, that's kind of and I don't think that anyone these days does really well so far is you have a problem and you get all the relevant data. Here are the metrics of that part. Here is the log that's important for you. Here's the trace. If you get all that data for your failed synthetic check, and maybe you even get an explanation why it happened, that would be awesome, right? Yeah.

Hannes Lenke: Maybe even, you know, happy to understand that a certain commit did you know did affect the the problem, etc., right? So that's not possible today. But, you know, that's what we're thinking about. Right? So how can we help you to understand, hey, you know, certain action led to the situation you're in now?

[00:33:15] Chapter 10: Introduction to Tracing and Correlation with Synthetics

Mirko Novakovic: It's hard. Right? I mean, there's a lot of tools that claim that they can do it, but I did this tool, right? In the past. But it is very hard, right? Because the other problem you always have is these false positives, right? What you also don't want to do is wake somebody up at 3 a.m. in the morning, right, and present them with some data that makes absolutely no sense, right? Then they will say, oh my God, Checkly woke me up. They gave me these tracers, told me it's Clickhouse, but it actually was not. And even if you are, this is my learning from Instana times. It's like even if you are 98% correct, but then the 2% right? It's a little bit like autonomous driving, right? If you are 99% correct, but the 1% means you're driving into a wall, right? Nobody will accept that. And it's very similar in this observability environment. Right? People will not accept you telling them false data. Right. That's kind of the big problem, right. That you have to be accurate. If you, if you wake somebody up at 3 a.m..

Hannes Lenke: That's a huge problem. Right? And it's a problem that was also there in testing. Right. So you don't want to, you know prevent your app from rolling out just because, you know, there were false positives. Executing your tests. Right. So that's a problem you need to work on. We're working on every day, right? So we're monitoring our own systems quite heavily to, you know to make sure that our test execution runs well. Right. So because we're executing, like, hundreds of millions of these test scripts every day and, of course, we don't want to cause you waking up in the middle of the night because there was a problem with a test execution. Right?

Mirko Novakovic: Yeah, I agree.

Hannes Lenke: So it's a really hard problem. Even harder if you, if you then try to make sense of traces and logs, correlate synthetics with logs and then you know, not only taking the, the clear signal from a test that a test failed, but the test may be passed. But, you know, you see certain errors you know, coming from, from your database, you know, are you alerting someone because of that?

Mirko Novakovic: I like what you're doing in tracing. I would have a lot of more questions, but I'm. I'm super happy that we have a really good observability vendor out of Germany, right, with somebody who has more than a thousand customers internationally, raised a $20 million round recently, right on top of the round you had before. So congrats on that. I'm not only as an angel investor, I'm cheering for you, that you are successful, and I'm only watching because I think you have some very interesting, innovative approaches, very open, very focused on automation, on developers, I love that. So Hannes, thank you for joining my podcast.

[00:36:03] Chapter 11: Conclusion and Final Thoughts

Hannes Lenke: Thank you Mirko. It was great to be here. And yeah, let's catch up soon. Right.

Mirko Novakovic: Thanks for listening. I'm always sharing new insights and insider knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on