Episode 1540 mins12/19/2024

The Open Source Frontier: Dynatrace's Evolving Vision

host
Mirko Novakovic
Mirko Novakovic
guest
Alois Reitbauer
Alois Reitbauer
#15 - The Open Source Frontier: Dynatrace's Evolving Vision with Alois Reitbauer

About this Episode

Alois Reitbauer, Chief Technology Strategist at Dynatrace, joins Dash0’s Mirko Novakovic to discuss the role of open source standards in advancing the observability industry, even for vendors with proprietary solutions. They explore Dynatrace's AI capabilities, innovations like Keptn and OpenFeature, and the company's role in initiatives like OpenTelemetry.

Transcription

[00:00:00] Chapter 1: Introduction to Code RED Podcast

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I am co-founder and CEO of Dash0. And welcome to Code RED code because we are talking about code and red stands for requests, errors and Duration the Core Metrics of Observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Hello everyone! Today my guest is Alois Reitbauer. Alois is the Chief technology strategist at Dynatrace, where he leads cloud native research and open source initiatives like OpenTelemetry. He also helps develop industry wide standards in the W3C, including web performance and distributed tracing. Always welcome to Code RED.

Alois Reitbauer: Thank you for having me. Long time no see. It's been a year.

[00:00:54] Chapter 2: Alois Reitbauer’s Code RED Moment

Mirko Novakovic: Yeah, actually. Yeah, yeah, I will come to that point later. I checked out a few things from the past. But my first question is always your biggest Code RED moment. Can you share with us a story that is still in your mind? Yeah.

Alois Reitbauer: I mean, obviously, the early days of Dynatrace, we had a lot of customers out in situations. You know, at the beginning, APM, as we called it back then, we sold products basically by solving people's problems. And we had things like PurePath not getting their term sheets at the end of the school year. And like, like many of those or people not getting their parcels delivered, but they wanted really stuck to my mind, was actually me sitting in one of these very traditional operations rooms at the very beginning of my career, and it was very interesting that they started really nice. They went to the office and then somebody called us. It was an application for more or less retrieving your goods, and they told us, well, they can find the good, but they can't find the person who brought it there. And you think about your database scheme that that can't happen. That's a master detail relationship. There are foreign key constraints on the thing. So you tell the people on support. Back then it was a startup support. I told them, no, this can't happen. The only way this could ever happen would be if the referential integrity of the database would be violated.

Alois Reitbauer: But I can see why this actually happened, obviously. What do you do? You go to a database, you use a database tools, check it. Okay. Referential integrity validated of your database. What's going on here. So that ended up being a very long day because we obviously needed first to restore the database to like a valid state. So we used like binary search like restore it to half okay. That's good. Then take the next half and like to go to the last known good because people it was an online application. People were constantly obviously entering the goods that they were looking for, and they wanted to give it back and do as little of reconstruction as possible. So we eventually did this. It was the first part of the day but then we needed to figure out what was going on because suddenly it happened again. And you're like, what's going on here? We just fixed this problem. And we were done calling the database vendor. I say do you know like any issues with your current database? Because that's weird. The same thing happening twice a day. And then after a longer investigation, we figured out what actually really happened. That was me.

Alois Reitbauer: The data center were waiting for it for happen again. I remember sitting at that screen and it was a windows machine back then, and suddenly the clock slipped backward. So we realized that somebody like, instead of using the usual time sync mechanisms that make your clock go faster or slower, it did not believe in them. But what he did, he just randomly changed the time on a machine, which obviously a database transaction log really loves. And that's kind of what ended up in our database. And it was really my first experience. You see, in this data center at night, you have no idea what's going on. Everybody's home. It was well after midnight and suddenly you're if I just knew there's a couple of hours ago what was going on. Like understanding where you're looking is not necessarily where the problem is. The database vendor. It might be somewhere totally unexpected and totally unexpected. Other area where that issue was coming from that was like, really my Code RED that I think we'll always remember. And that was back then. It was my first year after I graduated from university. My first application, I was like really running in production and responsible for.

Mirko Novakovic: That's a fun experience. Everyone has the and I always ask myself, okay, how would you monitor this? Right. It's not that easy, right? It's not that easy.

Alois Reitbauer: Actually, it's what a challenging one, but still an interesting way to figure out. Okay, you have to look really everywhere to figure out what's going on.

[00:04:27] Chapter 3: Observability and Performance Anti-patterns

Mirko Novakovic: So when we started, we were talking about, okay, long time no see. And I actually checked. So 15 years ago approximately. Yeah, around 2009, I was with my company called Centric. I was a reseller partner of Dynatrace. And you and me did a lot of content together. We had an article series in a local Java magazine here in Germany, and I checked out the names of the articles Enterprise, performance Anti-patterns and database, Anti-patterns. It was about Java memory and garbage collection. It was about continuous performance monitoring. So actually a lot of the things are still they didn't age like milk. Let me say it that way. Right. It's not that bad.

Alois Reitbauer: It's kind of interesting. Yeah. I think you might think that people overcame these issues by now like that typically database anti-patterns and plus one query problem. Yeah a lot of those like wrong sizing infrastructure underestimating remote communication workloads or data loads. But it still is there the same way. Didn't that much change the way we run applications, also we're talking about totally different environments. Back then it was J2EE, not even JEE, not even the fanzy Java enterprise stuff. And today we talk about microservices and cloud and serverless. But still it's the reason why applications fail to a great extent. The application level is still the same than they were back then.

Mirko Novakovic: I would even argue that they became more even more relevant. Right? The anti-patterns for calling too many services or something. I think you can see these today even more than I mean, back then in 2009, we didn't have these microservice container based applications, right? It was more at that time. It was service oriented architectures, right. Where, by the way, one of the things where Dynatrace really started in that area. Right. And I think it was the first tool, tool that could do distributed tracing. Right? I don't remember the PurePath when seeing it the first time I found this super cool that you it was an eclipse based UI, right?

[00:06:30] Chapter 4: Evolution of Dynatrace and PurePath

Alois Reitbauer: I think even before the first one wasn't even based on like the eclipse UI framework. There was even one before, but then most of them. I think when you saw it the first time, it was an eclipse based already.

Mirko Novakovic: Yeah. And the core capability, as far as I remember, was really tracing. Right. It was PurePath. You had the list of PurePaths. You instrumented the code. It was at that time, I think, mainly Java applications. Yes. And so you could see an end to end trace of your applications, even if it was distributed through multiple JVMS. And as far as I remember, you burned. Graf Nader was the founder and you and others, you came from a company Segway or that built the Silk performer.

Alois Reitbauer: Yeah, I was somewhere in between. I was yes, I mentioned that story. So one thing was distributed tracing back then, which a lot of people was, oh, this is like, really my application kind of looked like the second one was actually doing it in production. It wasn't the tracing, it didn't not exist. But doing it in production was one thing. And also being able to remotely configure it because back then it was if you looked at other products before, like Wily, you had these files that you had to locally deploy and manage and like also having this remote management and even being able to deploy on this line, where we had like bytecode instrumentation, where we instrumenting at runtime. So you figured out something wasn't working as expected. But you were missing one part of the code. You were just throwing an additional sensor, as we called it back then, into the code and instrumented at runtime.

Mirko Novakovic: I remember that you could right click on a method or on a function call, and then you could see which functions got called and you could add the instrumentation for it. Right. So you could in the UI, you could actually a pretty cool feature that is useful today too, right? Because a lot of tools in the distributed tracing, they see the calls between the services, but you don't see too much anymore inside of services. And if you need that, you could actually, I had this discussion with Ben Siegelman in my last podcast here, where he said he is looking for something like a trace level where similar to log level, where you could say, okay, I have a trace, and if I turn on debug mode, I get much more spans than in the normal production mode, right? Because I don't want to have the data all the time. But once I run into a problem, I want to turn that debug tracing on and get a much deeper understanding of it. And that's what basically what you could do with Dynatrace back then, right? You could add instrumentation on the fly to get more visibility.

Alois Reitbauer: Yeah. And we even had like these different levels that we shipped. We were proposing this is your debug level instrumentation and only turn on the debug level instrumentation. And then obviously filter which transactions you want to be covered because you can't slow down an application entirely, like for each and every user if something is going wrong. Because then it basically stopped working, would even do today in most cases, because there's just so much instrumentation overhead that you would then eventually run into. I think that's also where today tools like Live Debugging play in as well. It's kind of like the same idea, just to some extent, when they are doing a bit more than like capturing like entire stack frames and so forth. But the idea is the same. Okay, I need exactly to understand why this code is actually executed in here. And no, it's more or less like first of all, you isolate the the problem, the the root cause, and then you want to understand why it's actually happening after you understand, okay, this is what is happening, but the actual why, I think it's becoming even more and more important the more we use feature flags and so forth. Even if, you know, the feature flag was triggered right now, but it shouldn't have been triggered right now.

Mirko Novakovic: Actually, yes. So what happened to PurePath? Does it still exist?

Alois Reitbauer: Yes. PurePath does still exist. I mean, PurePath had itself various iterations. Obviously we started with full instrumentation. We added them profiling data to it. We added memory profiling to it, we extended it to mobile to Ram capabilities. We linked it to the entire stack. So it's a core building block and a core concept of distributed tracing. Obviously today you can use OpenTelemetry traces as well, like the OpenTelemetry native support in Dynatrace does exist as well. We spent obviously a lot of time also working inside of OpenTelemetry into the semantic semantic conventions and other topics. You can even link your PurePath and OpenTelemetry data seamlessly together, depending on the environment that you run into. So it still exists. I think traces are traces. I think there's just two fundamental concepts, whether you look at it more span based or more end to end. But the concepts are still around, the capabilities are still around. It's just now blend more seamlessly with OpenTelemetry, which obviously back then did not exist.

[00:11:09] Chapter 5: Understanding Tracing and OpenTelemetry

Mirko Novakovic: We always struggle when we have discussions. We never really know how to structure the concept of the trace it for me. I mean, it was kind of an end to end thing, right? But today, if you look at the spec, the way I see it or we see it at the moment is you have spans, right? Which is kind of the core component of of OpenTelemetry. And then your link spans. Right. One span calls the other, and you get kind of a span tree, and the trace is actually the whole tree. Or you could say the tray starts at the root spanning. It's somehow every two dozen a little bit different. Right. But there's no it is not really easy. Some call it request, but then it sounds more like HTTP. So it's like the wording is something I'm still not really satisfied with. I like the PurePath because you had your own name and you could say it's a PurePath, but in the OpenTelemetry world, I wouldn't say I'm struggling. I mean, I know what the trace is, but it's somehow weird to explain to the users what it is because you're actually looking at the span. And essentially the root span is the trace with all the components below it. So it's not easy, right, to, to describe that.

Alois Reitbauer: I mean, you could always say it traces whatever you want it to be. Yeah. And we started actually to do this in like one of the iterations of PurePaths because we started to realize like literally once you end user monitoring and you start to link more and more enterprise applications together. It's really true. It raises what you want it to be. And we had even different people deciding where it race starts for them, because if I'm only responsible for a payment system, might race a start at the entry to the payment system, which again is going to be multiple microservices. If I'm responsible for an entire application from the mobile front end, back to everything else, it's calling might race starts. I was in the mobile app when the user clicks on something. So it really depends more or less on your like scope of, of influence or your sphere of influence, what that race really is like. But I think it's also deciding, like what's different from a race than to a profile, because a lot of people are like, understand what profiling is and how tracing is actually different. I mean, the distributed nature we have, it links together, multiple tiers. I think that's key. And it separates individual requests, which I think is key as well.

Alois Reitbauer: And you can assign values to individual requests that you really want to understand where it starts and ends. Becomes more and more fluent based on the applications that we are building. Even to some extent you will have even on security boundaries like calling into a payment system. As an application developer, you might just see the call that's going into, but not any of the internal workings of some of those systems. So I think it's more the characteristics of how you differentiate the trays from other means. And, and profiles are the ones that I think the most, most prominent ones rather than seeing like the end to end idea, because end to end is, as I mentioned, like really fluent in in many cases, it's even layers. I remember at the beginning when we discussed tracing with cloud providers, and for them, end to end is suddenly way more different because there's a lot of internal cloud provider infrastructure that's running that they would never expose to you as an end user of their cloud. So even there, it was different. And they were actually tracing twice because you can't like, really relate those to each other like the same trace.

[00:14:34] Chapter 6: Dynatrace’s Open Source Strategy

Mirko Novakovic: I totally agree with you that the trace is somehow a fluent thing in these kind of architectures, but still, what it is, it doesn't become easier for the user to understand the concept, right? Because I kind of always had the feeling that a lot of the users in our space still, there are a lot of power users who really understand the space, who do the topics every day. And but we also see, at least at Dash0, we see that a lot of users log in and they are not that familiar with the concepts. And you need somehow a very easy abstraction somehow to those concepts to help them find their way into the tool easily. Right? Without understanding the core concept. What is a root span? For example? Right. For for someone who is familiar with with OpenTelemetry. It's kind of obvious, but if you are new to the space, and occasionally it's not really obvious to understand what a root span is, what a service call is, what what actually are the two spans, right. The outgoing, the server and the Declined spend. So the things are not that trivial. So we always try to figure out if there are any forms of abstraction that we can use to make it easier for the user to understand that.

Alois Reitbauer: I mean, we hide most of actually, the technical underpinnings. That's actually very interesting. If you build an observability product because you have users that like very different levels. And that's also why we built our latest data storage with scale like in indexes and schema on read storage. And when I was working on some topics where it's also around AI observability, it's almost that you need a different view in the same type of data. Okay. If you have somebody who works on like OpenTelemetry instrumentation, they want to see all of these details on a span level. But even if you move, move up to a developer who or an architect, but most likely know as well. But people that look at their services, their service methods, they have, they might look at a web request, they might look at a specific user interactions in many cases for customer support. It's really like, okay, somebody who wants to buy something. We remember even somebody built a on top of dynatrace a dedicated UI for support. People like the name of that person. And we show more or less just the business steps that this person was doing. So suddenly that trace in their concept is actually multiple traces linked together at the user session level. It's like a link of a trace. So I think people should usually always see what they feel most familiar with. And again, as an application developer, I know the services that I'm writing. I know what my database is. I know the cloud services that I'm using.

Alois Reitbauer: The further you move away from, like the actual implementation work, the more abstract the concepts are actually getting. And that's where I see the power in modern observability tools that actually can transform this data on query time to whatever the user wants to look at specifically and omit a lot of the other details that you need. We had the same another example, even staying in the technical domain. When we were working on CI, CD observability back then, we kept an open source project where it was more or less tracing through Kubernetes deployments like progressive delivery and deployment in Kubernetes. And then people were showing, okay, this is the span, this is the name of this band. So yeah, I just ran a deployment on Kubernetes. I don't care about span. That's not the nomenclature that I even want at all. Or we use it for continuous delivery. That's also why I have semantic conventions. I think why semantic conventions are in open telemetry is so important because semantic conventions actually tell you what people want to see, because it's also at the root span. But you trigger the CI run and then you have little steps or actions depending on the tool that you're using in the CI run. And this is just something that recently came into OpenTelemetry with CI, CD observability. So it's less about the technological underpinnings. For once you show data to people, it's more or less that the view on top based on the data that you actually see.

Mirko Novakovic: Yeah. And you mentioned captain. Yeah. I'm not sure if everyone on the podcast is familiar with it. Can you give us. Because I know you, you kind of started that project in with Dynatrace. It's an open source project. It's part of CNCF, I think today. So can you give us a little bit of a quick overview, what it is, what it does, and why we should use it.

Alois Reitbauer: Yeah. So the idea behind captain was really let's build observability for CI processes and also deployment processes in Kubernetes. In the beginning we started without Kubernetes, but it was more or less getting OpenTelemetry there into your deployment. And the idea was also you could hook certain steps or introduce quality gates, like after a deployment, you wanted to check whether it actually works. So you attach, in this case Kubernetes jobs after deployment run a test. But you do do this actually natively. Today a lot of actually was absorbed into OpenTelemetry the semantic conventions where we work. But the idea was giving the visibility to the way you deploy software that you usually only get for your running software. That was the idea behind it. And yeah, we started captain, that was the first one that we did. The second one that was open feature, and the first one we ever got involved in was obviously OpenTelemetry. So it was kind of like our own maturity curve on, on launching open source projects there.

Mirko Novakovic: So what's your rationale around it? When do you do something in open source. When do you do something closed source? What kind of the decision process? I mean you're running strategy for that. So what's your decision process and deciding when it's open source and when it's not.

Alois Reitbauer: So I think open open source and collaborative open source, I mean there's also things we build open source that just as source code is available, it's like plugins and so on and so forth on Dynatrace that you have access to that, customers can just work, adjust, and sometimes it helps them. I wouldn't call this like fully open source, but it's like source available. You could freely use it and modify it for us always. The point is, does it make sense if more people and even competitors collaborate? That's always my number one question. Does it make sense if others help you out? Does it accelerate what you're trying to do? And this was actually great with open feature. Like we could have built a feature management SDK is unified across vendors. We could have added observability on our own, but then we would have to do it for Dynatrace. We decided to go for like a unified API, bring a lot of people on board, and actually today a lot of people are working on it and we ourselves do very little. So one of my goals in open source is I'm tracking how much we're contributing to an open source project on the overall code base, and this metric needs to go down, or the metric of the overall contributions need to come up. Actually, even somebody from code centric is contributing to open feature. Like that's one reason there's one thing that I'm tracking.

Alois Reitbauer: And the very important question I also have is that the open source project becomes successful. How does this make your business more successful? If you can't answer that question, open source is not going to work for you. And in these cases, it was obvious for us OpenTelemetry more telemetry data we can get without writing instrumentation. Open feature having a unified API to get feature flagging. Level of level detail for all the applications that are running out there. That's usually where it's a no brainer for us. And I like put this together in like a version of the business model canvas. I call it the open source canvas. So it's open source canvas.org, which kind of outlines those questions. But this is key for me. Does it benefit from more people contributing? Because you need a more industry wide agreement for this. And like really that's the open source project. The success of the open source project make you, as a business more successful. And especially the second one is one where you have to be really honest with yourself. And if you look at the history of some of the open source projects, we're actually struggling with this and then trying to make changes to their licenses or in a core model around it that it eventually plays out. And I think that's the two key questions that we usually need to answer for ourselves. Yeah.

[00:22:14] Chapter 7: The Role of AI in Observability

Mirko Novakovic: When I was flying over to Kubecon, I was sitting next to Andreas Gartner was by accident and, and we talked a little bit about it also. He said that basically, especially our captain, he thinks that what you need as Dynatrace is almost done right and that now it's more and more getting the community involved and, and pushing the project further. Right.

Alois Reitbauer: And even the ideas, I mean, to some extent a lot of the initial ideas we had on captain made it into some of the core Kubernetes tracing, some made it into OpenTelemetry. That's not how you define success for a project. It's not that necessarily that project itself should be super successful, but the idea you want to bring into the world or like the, the agreement in the, in the wider community you want to bring into the world. That's also what actually, I mean, you know, I have this background in standards, like whether it was the trace context standard. This is also a great example. It would not make sense if not a wide variety of people agreed on something like trace context. The big problem was like no tracing format was compatible. Cloud providers had built their internal ones. Different tools that they had at others, and people might be complaining it's not perfect because it's an agreement by multiple people, but it's working. So and I think that's really what you have to keep in mind. Like what is your goal? If your goal is understanding and moving the industry forward, it's not even about your project. It's really okay. This is where we want to be. That's a technology advancements that we want to see. Because keep in mind, implementing something like trace context for a company like us, as well as everybody else who had their own proprietary trace or context forwarding format, actually meant a lot of work. So bringing something like this to the market came with the application of $1 million investment, to some extent on our side. But still we were pushing it as a company because we thought that's the only reasonable way to go forward. Also, we knew we had to touch massive amounts of our code.

Mirko Novakovic: Yeah, I can see that right. I remember the discussion about the ID length and stuff like that. Right. That's not easy for a vendor who has already done it right and, and relies on certain things. And then you have to agree with multiple vendors on it.

Alois Reitbauer: Yeah, yeah. That's a famous quote from Steve Jobs where somebody was talking to him about open office back then. And that is true. Like whenever somebody brings up a technical argument that you were wrong, there's always at least one reason why they are right. That's true. This is also the case with standards. Yeah. Whatever you build, people will not think it's the best solution. I think eventually adoption tells a story. Are people using it? And we start people's using the standards. And it was actually back then the first standard back in the W3C. There was application and not web focused. So we then talked with Philippe from WCC. They said, yeah, kind of because there are standards. They actually did our application focused were things like Soap and others. After that there wasn't really a lot happening there. I think that that's really how you should measure the success. And yeah, some people will always say you did it wrong, but come on, if millions of people are using it, it can't be that wrong. Yeah.

Mirko Novakovic: No, no, we love it. It's great to have it right for Interoperability. Also being able to work with traces that do not come from your agent and your instrumentation. Right. That's the only way to really do it. I'm a big fan, right? I always love the Dynatrace agent. The one agent. It would be interesting to get your opinion on how do you think will this evolve? Will they go away? Will OpenTelemetry will customers eventually at one point say, I don't want proprietary agents anymore. I only want the open agent. Or will it be hybrid? So how do you see that? Because I know, I mean, you probably be invested a lot in Instana. We had so many developers on the agent level instrumentation. I know that it's a lot of effort to build and kind of the magic that Dynatrace one agent does without the discovery how to instrumentation and being able to also merge OpenTelemetry with the own tracing requires a lot of magic on the instrumentation side. Right. So how do you see that evolve?

Alois Reitbauer: I mean, obviously it's the one who's like, literally building the one agent and OpenTelemetry. I see coexistence and I think you need to look at it even a bit more differentiated. I'm separating some capabilities that the agent has, like automatic installation, automatic load balancing, encryption, compression. Some of those things will make it at some point into OpenTelemetry, but people will still use it even with OpenTelemetry. I think a little known fact is that you can actually use the data with one agent only for OpenTelemetry data, so we would only instrument the SDKs. Even this would work, so we would roll it out and just fetch OpenTelemetry data. And there are other scenarios where an agent doesn't even make sense. If you look at cloud services that are totally third party, that are multi-tenant, I think one of the biggest inventions that's sometimes like really under undersold is Otlp. I now have a standard protocol, how I can get trace information from systems that I could never instrument. And there OpenTelemetry, especially Otlp, plays a massive role. How the data is collected in this case doesn't even make that much sense. In other areas, for some middleware components, I think it goes back to open source. Like for SD and envoy we invested a lot of time actually adding OpenTelemetry instrumentation to that open source code base. You could never do it for your proprietary agent protocol. That, that we were running because nobody would add this to an open source project. Be kind of weird. I was in an area where it's still used is for some very specific instrumentation needs that you have where you add custom instrumentation.

Alois Reitbauer: Plus we have obviously a 17 year library of support for technologies. As we discussed before, that is like super, super old, but they're covering like a very wide range. And this obviously adds value to it. So I think agents to, to some extent will stay. Also, we also see OpenTelemetry that the concept of an agent also, you might not call it that way is emerging even more like we talk about auto instrumentation. Even, like at an ebpf level, you could argue that this looks very much like an agent, even if you call it that way. I think that concept will stay and provide some more extensibility there. But obviously there are customers who want to go fully on OpenTelemetry. I think the biggest challenge that OpenTelemetry still has, and we have to figure out once you like, start to ship OpenTelemetry or put it that way, there are two people who want to buy observability, those who have a lot of OpenTelemetry code built into their environments and just want to send it somewhere and want to analyze it somewhere. And there are those who also want to get like instrumentation libraries and fully support on those instrumentation libraries. I think that is still an open challenge. Like almost like the Red hat business model, but for OpenTelemetry sensors, because for us, actually the biggest cost is not really building a sensor. There was always the argument at the very beginning, it's so expensive to build a sensor. It's actually not. But setting up a test environment.

Alois Reitbauer: Within all permutations in which it can work. Building self-diagnostics into those capabilities. That's what today. The reason actually is why people as well opt for, for a commercial solution. I think that's also where the OpenTelemetry project at some point needs to evolve. Like, how do we test this? How do we build a test infrastructure and how do we as older vendors in this space agree to actually pay for this at the end of the day? Because this is really expensive. That kind of like gets you to Linux distributions. You can run Linux, but most people then still opt for a commercial distribution of Linux. Simply, if it's not working, you want to call somebody. And this is, I think, still something the market kind of needs to figure out, like, okay, this agent or this instrumentation is not working, who can I call? How can I get there if I'm not the one actively contributing there? Like, who's my proxy? The proxy who then does this work for you is usually a vendor that you want to hold accountable. So I think that that's I think for me, the last obstacle for like really mass adoption of OpenTelemetry that the market needs to figure out. We have our own collector distro where we handpick bits and pieces. We test them on all of our target platforms and provide full support. So if your dynatrace custom, you get full support for our version of the Otel collector, all the components that we built in there and we test them for upgradability and everything.

Mirko Novakovic: Let's talk about AI. I just need to bring that up and, and, and I think Dynatrace is one of the most vocal around AI. If you look at the website, everything you do your booth at KubeCon, AI it's a really big topic for Dynatrace. So I know that you don't see AI as only LLMs gen AI you had you had AI forever, right? For a long time you also had kind of a chat bot years ago, if I remember correctly. So what do you see as the next big evolution or what do you see today, and what do you think is the next big thing with AI and observability?

Alois Reitbauer: Yeah, we really look at three different types of AI. That's what the Dynatrace Hypermobile AI is about. It's causal AI, like running models, running predictions, and anomaly. Were anomaly is basically failed prediction. Then doing the causal like the fault tree type of isolation based on your understanding of the environment and then the LLM. On the one hand we see it as a productivity tool. My favorite example is building dashboards automatically for you. Nobody ever wakes up in the morning because they have to build some dashboard. Nobody ever does this. Also opening it up more to people on the query side. Obviously also being able to query in natural language, which I think has two advantages. One is obviously you don't need to write the query. And the second advantage that again, is massively oversold is you don't know where you don't need to know where the data is. Is it in a trace. Is it in a log. Is it in a metric like show me how many database statements I did in the last two hours? Is a very simple question. It's still very ambiguous when you look at where the data is coming from. It might be a metric that has it, or you might have to go through a trace. And very often it's hard to figure out. That's the same with the BI tool. BI tools are great having all the data, but if you don't understand the schema and don't understand where to find the data, they are literally worthless.

Alois Reitbauer: I think that's one advantage there as well. When I look at it, I always look at it in a couple of phases. I goes around the history of Dynatrace. At the beginning, collecting the data was the challenge that is solved right now. I mean, for very large environments, collecting the data, then trying to go away, but then retrieving that data started to be a big problem. So it was more easier to store than to retrieve. And that first wave of AI that we're working on with the really first one was like really predictive anomaly detection. It was really step number one. That was the first thing we built. This metric is behaving in a wrong way. That's even the first generation of Dynatrace had some predictive AI capabilities already in it. The second one where we introduced Davis was really the causal one. Tell me, what is the impact and what is the root cause? Like tell me the fault chain. Be able to explain back to me why you think something failed. The usual example service A equals b equals c, C fails and B will throw alerts as well. If you don't put this into a causal order, you would then see all three services failing. Just realizing the two of them actually failed due to a consequence of of something else failing.

Alois Reitbauer: So at the beginning, it really helped us to detect what's not right. In the second step, it helped us to figure out why that's actually not right and explaining it back to us. I think for GenAI now comes into play is helping us to more automatically resolve issues. That's usually where APM or observability stops. Here's your problem. Figure out what to do right now. The change that's coming now into the mix is we can do something about it. I remember even in the early days when we showed, AI, okay, if you know what the problem is, why don't you fix it? Which is a very fair question. The problem back then was it was custom scripts deploying stuff on machines. There was no automated way. We had no infrastructure as code. We had no CloudFormation templates, we had no Kubernetes manifests or anything like this, which we have today. And I think their generative AI is super helpful. Once you have the root cause, you can say, okay, modify this Kubernetes deployment and scale up the instances from 2 to 5. That's also, if you stay in this very limited domain where you really don't have an issue with hallucinations or anything, you provide very specific input, you provide very specific ways on how to change it. And I think especially in this remediation or helping people, how to fix problems, propose changes to your environment, or even like in a security context.

Alois Reitbauer: Hey, these two servers are allowed to talk to each other, but they never do. Or these two microservices. This is the policy that you need to deploy there. Yes, you could write it manually, but I think eventually it brings us from like, okay, just figure out what's going on. At the very beginning. More and more we are reviewing the offers that we get. Like, I like the analogy of a GPS. The GPS gets you from A to B if you trust it every now and then, you still need to validate whether it's the quickest way to do it, or you still want to have a final look at it. If it tells you to turn right, if you're going off a cliff, you most likely not going to do it. I think that's that really job quality that we get, we get like really this additional employee that said, okay, this is what I think went wrong. This is how I would solve it. Do you want to accept the PR on the infrastructure's code repository or not? So that's that's really, I think, where AI is eventually going to help us and take over more and more of those tasks down the road. And I think we're also GeniAI are perfectly fits into that story.

[00:35:49] Chapter 8: Dynatrace’s Culture and Innovation

Mirko Novakovic: Yeah, it makes sense. And I mean, kudos to Dynatrace you. You have done so much in that space, right, of AI, of observability. When I was talking to Andreas, one of the things that struck me basically as, as a CEO and also founder of this is that nobody leaves Dynatrace, right? I mean, when I remember back when I was there the first time, I think you were there. Florian. Right. The product, he's still there. The founder, Bert Greifnader, is there, and it's like 15, 16, 17 years constantly building, rebuilding the product. I think you reinvented the product multiple times. There was Rockset, which turned into Dynatrace. Again, you're rebuilding it probably right now. Again, right, with this app Store concept and everything. So how do you get the motivation? How do you make sure that all the people stay in, basically most of your engineering is still in Austria? Probably. Right. It's hundreds of engineers in Linz and.

Alois Reitbauer: Austria and there's Barcelona. So we spread it out more. We try to keep our engineering mostly in the same time zone because it makes collaboration easier. It also keeps the motivation of developers higher if they don't have to work in the middle of the night. That's, by the way, where some of our especially open source engineering is in the US due to the West Coast. Europe overlay. The East Coast is actually perfect for this. Yeah. How do you keep people motivated? A lot of people ask me, why don't you, like, why don't you get another job?

Mirko Novakovic: Yeah.

Alois Reitbauer: Like, after all this time. And I think the answer is, if you always start to build something new, it's like you get a new job, like, every two years. And that's been the case for a lot of us. You have a market that changes, that requires you to do something new as well. Like, one year ago, we started. Or a bit more than one year ago, obviously, we started to work on how we can integrate AI. It's always new topics coming up. Then at some point, we started to talk about microservices and we started to talk about containers. We moved from data collection to analytics to automatic remediation to self-healing and AI, for example, built a video I think it was eight plus years ago about like how we envisioned the future to be. It was like really this well-produced video that we used as internal for for motivation. Like this is how the life could look like if I would take over the whole process from identification to remediation of the problem fully, automatically, and people just more or less monitoring or interacting with the AI. I think that the interesting thing, though, is if you can start building on this and you industry is giving you industry as a whole, it's giving you new, new opportunities and what to do there.

Alois Reitbauer: And you're always driven like it's just by building the next thing anyways, more or less at the convenience of not having to leave your job. So I think that's really the point there is if you can do what you want, if it's constantly something new and it feels like you're in a new startup every two, three, four years anyways, in that area that you're working on, I think it goes pretty well. Plus especially for the very hardcore engineers. I think it's also the very technical challenges that you have. You would have talked to somebody seven years ago or even five years ago and say, yes, Figure out how to process one exabyte of logs in one day. But not just like once, but for 500 customers. They would just call you right? Was crazy today. Okay, that's kind of like a reasonable demand, at least coming in the next 2 to 3 years. So I think it's just the challenge and opportunity you provide to people because you're not doing the same thing over and over. That obviously would get boring over a short period of time. Yeah.

[00:39:19] Chapter 9: Conclusion and Farewell

Mirko Novakovic: Thank you. Really impressive. Congrats on the whole journey. I will be watching from the sideline again what you're building. And it's interesting. And. Yeah, it was nice talking to you again.

Alois Reitbauer: Nice talking to you as well. I think the next time I will see you is most likely for KubeCon Europe.

Mirko Novakovic: Kubecon. Yeah. See you there. Thanks for listening. I'm always sharing new insights and insight and knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on