Episode 148 mins6/20/2024

Making Software That Doesn't Suck

host
Mirko Novakovic
Mirko Novakovic
guest
Michele Mancioppi
Michele Mancioppi
#1 - Making Software That Doesn't Suck with Michele Mancioppi

About this Episode

In this episode of Code RED, host Mirko Novakovic talks with Michele Mancioppi, Head of Product and founding engineer at Dash0, about their shared journey through the world of observability. From their first meeting during an SAP RFP to building Instana and now Dash0, the conversation explores the evolution of observability tools, OpenTelemetry, and their passion for crafting products developers love.

Transcription

[00:00:00] Chapter 1: Introduction and Background

Mirko Novakovic: Hello everybody. I'm Mirko and welcome to Code RED. Code because we are talking about code and red stands for requests, errors and Duration the Core Metrics of Observability. On this podcast, we will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Today my first guest is Michele. Michele is the head of Product and founding engineer here at Dash0, and we have worked together already at Instana, where he was the Senior Technical Product Manager. And now we are reunited here at Dash. It's safe to say he's one of the world's foremost experts on observability, and I'm very excited and proud to have him on here today. Hi, Michele. Hi, Mirko. Good to see you. You and I, we go back a while even before. And so when I found it in Astana one of the first RFPs that we took was SAP. And you were on the other side. You were actually building that RFP, and we got called by one of the SAP team, and we had a meeting there. Even our CRO came from the US because it was this big deal. We drive down to Walldorf and are in the meeting room. And that's the first time I saw you because you entered the room and you told us that we have to leave because it's actually against the rules of the RFP that a vendor comes in and gives a presentation. And so you said, oh, you, you, you can present. You actually have to leave, right? That's how we met. So how do you remember that?

Michele Mancioppi: I mean, you caught me in a, in a difficult moment doing an RFP of that size. And a company like SAP was like trying to build an airplane while flying it. It's also a rather politically charged one with different divisions trying to position their favorite tools. To the extent that I think somebody I caught somebody forging documents to support a vendor. And then I get a message from one person of the team like, oh, we're talking with a vendor. Sounds cool. We're thinking of giving them a contract. So you're thinking what? So I had to rush over from the other side of the building and say slow down. That's not how it can work.

Mirko Novakovic: Yeah, so we left. So we left. We drove back to Solingen, and we we lost the RFP or we pulled out of the RFP, to be honest, because we also saw that, I mean, sap at that time and the requirements were just too enterprisey too big for a startup like Instana. But surprisingly enough a few weeks later, we get to know each other even better because you applied for a job, right? So you saw something in Instana which attracted you so that you thought, oh, I will not go to the winner. I will go to the one who pulled out and I threw out of the meeting room. So why that.

[00:03:11] Chapter 2: Observability with Instana

Michele Mancioppi: I didn't throw you out right away and give folks a chance of showing what you were doing. And you told me one of the most romantic things I've ever heard in my life. How many bytes you could add with instrumentation without screwing up the JVM just in time compilation? Like I was smitten. The point is, however, that a company like SAP with like, everybody and their mothers with different environments having a bunch of different requirements required tools that were much more enterprise ready, much more scalable, and with support for Cloud Foundry to to even being considered.

Mirko Novakovic: But then you you you joined us. Why?

Michele Mancioppi: My observability journey started with my observability villain origin story that made me realize that, logs are not enough to to know what your application is doing. And I started looking, you know, what are the things out there. And that's how the RFP happened. And it took like 18 months of my life together with my partner in crime at the time, Thorsten Fuchs, to, to get the RFP done. And and the tool contracted. And then I spent, I think, roughly another year evangelizing the use. And then I realized that I probably liked more doing observability than using observability. And I remembered that small, scrappy startup with the CEO that knew something about git. I said, yeah, why not? And speaking of which, but do you remember our first interview when I when I came to Oregon for the job?

Mirko Novakovic: Absolutely. I mean, intimidating as you are, right? But but one thing was pretty clear, right? You are an enthusiast. You are into the detail as I am right, knowing how much bytes you can do without screwing up the git right up a lot into details and loving the story around automation. I remember that definitely that this was this was the basic idea. If you talk about code Red, right? What was the basic concept? The idea when we founded Instana were a few, but one of the core concept was automation, especially in the context of microservices and containers. When you have a lot of moving components, you couldn't really configure everything, right? You couldn't configure all the containers, all the business transactions, all the different languages, all the different dashboards. And so the basic idea of Instana was to have one agent that discovers everything automatically, instruments everything automatically, and kind of magically gives you a view on your application stack, even if it's super complex. Right? And that was always the magic moment for customers when we when we showed this right when you started the agent and things started popping up almost in real time because we sent the data through a stream, that was another thing that we did right real time. And I think that was the thing that that also you liked, I mean.

Michele Mancioppi: Working with SAP in a company very, very large with very, very different levels of skill across different teams, different divisions. One of the biggest challenges, even after contracting the observability tool, was to make it so people could roll it out. And especially at the time, the if I, if I were to send over a 20 page manual of this is how you collect metrics and send it over to, to that observability solution, nobody would have done it. So it really needed a point and click thing. It was also a thing that it's still very much alive today, especially in larger companies, that very often the people that need observability are not those empowered to configure it. At SAP, it happened very often that for example, if you had to use a proprietary SDK to to emit telemetry towards a solution the operators would want it because they needed the data to maintain production and the developers would say, no, never. I'll never do that. I have other priorities. I have features to ship. That is something that the the kind of extreme automation of the installation solved pretty beautifully because the reality was just very simple for people that did not have access or specialized expertise about the application to be able to inject telemetry collection, especially tracing. That was fantastic. And in fact, I joined is done as the PR for that.

[00:07:48] Chapter 3: Post-Acquisition Journey

Mirko Novakovic: Absolutely. Yeah. You joined as you joined as the PM for the application monitoring for for the agents, for instrumentation, the different languages. And yeah, I think we we did an amazing job and doing that a little bit faster now because we want to talk about their zero and why we do it again. We, we had a good journey. We went until we got acquired by IBM. I have to say you you didn't stay for long.

Michele Mancioppi: I was typing, I quit as you were, as you were saying in zoom that, oh, we got acquired. And I was like, no, please, not the IBM, please, not the IBM. Ibm.

Mirko Novakovic: Exactly. You don't like the big companies, right?

Michele Mancioppi: Three letter companies are bad for my health.

Mirko Novakovic: And so we got acquired. This basically ended our journey. And you started a new journey at, I think first canonical. Yep. And then you went to a serverless company called Lumago out of Israel. Tell us a little bit about Lumago, because I think that's also special, right? It's serverless. It's it's kind of a different approach to observability here.

Michele Mancioppi: The nature of observability. What you need to know to to know whether your application is doing well or not. It's not very different in the serverless work world than it is in the in the more containers or virtual machines or host based more mainstream cloud native Kubernetes world. The collection of telemetry and the processing of telemetry is remarkably different. For example, it is very seldom in in a world of containers that you will not be able to transmit trace context. And so that when your HTTP client is sending a request to a server, it sends over effectively a pointer to within the trace it is. And the other side picks it up and starts adding more information to the trace as it goes on. In serverless, it's actually the exception to be able to do that. Even, for example, when we talk at AWS, even basic mechanisms like notification via SNS, the the default settings actually will strip away all that metadata like SNS raw delivery. Throws away your trace context, then you break trace. So there are the technology needs to be built completely differently. And that is a pretty interesting challenge.

Mirko Novakovic: Yeah, I, I totally get it. I mean, we know how hard it is to, to build a tool that covers everything, right? That's not easy. So specialized tools for serverless are pretty obvious also. And I think they attract a space of users that use serverless a lot. Right. But these days you see more of a mixture, right? People use everything and they use it as a tool set, right?

Michele Mancioppi: I mean, there is a constituency of companies, especially the small ones that start out that are pure play serverless. But Eventually most of them grow into containers. For example, for for pricing reasons like the. A few months ago, there was this paper from I think it was Amazon Prime Video about the cost of Lambda scale and why they moved to ECS and containers. That's a fact. You can use much better resources if you have consistent high, high scale workload, if you manage your own containers and they're the technologies that you use the way you do tracing is different, much simpler. But the technologies are remarkably different to the extent that even in OpenTelemetry, when you look at how for example, the NodeJS SDK works in just a express.js container, it's quite different than in Lambda Lambda. You need to do extra work to make sure that you do not block the function as you send data out. The way you collect incoming requests is different. So a lot of fun.

Mirko Novakovic: Yeah. And then going back to or coming back to zero. Right. Can you.

Michele Mancioppi: Started without me? Mirko? Explain yourself.

[00:11:59] Chapter 4: Founding Dash0

Mirko Novakovic: I started actually pretty alone because I was. I was on a nice island called Mallorca, and I was a little bit bored. And to be honest, when I left in Stan at IBM, I told my wife I will never do a startup again and I will never go into observability again. Right? That was like 2021. And how many.

Michele Mancioppi: Weeks did that last?

Mirko Novakovic: Actually, it lasted for more than two years, right? I really didn't look into it. I haven't really missed it, to be honest. I did other stuff. Not in it. I opened a restaurant and the wine bar and and some other stuff. But then I would say at some time I saw that I'm missing building a product, building software, and also selling software. I like that a lot. Working in a team. And so beginning of 2003, it was obvious for me that I had to do a startup again, right? So I told my wife, I have to do a startup again. And she was like, I knew that, right? And by accident, I looked into an observability tool at that time. And I just felt how how excited I was about the topic. Right. And I looked at OpenTelemetry, which already started when, when we had Instana, but it was now five years old. It was much more mature. It had three signals basically specified and implemented logs, traces and metrics. I liked it a lot. I loved the semantic specification. We can talk about that later, what that means. But I just said, oh, this this is the thing, right? And so I called Ben. Who?

Michele Mancioppi: Mr. Blackmore.

Mirko Novakovic: Yeah. And said we we need to have lunch. And so he got also excited about the idea of building something around OpenTelemetry and building something that's really useful. And so I got I got the team together. Right. And we needed someone for for product and. Yeah.

Michele Mancioppi: Then who you gonna call?

Mirko Novakovic: Who you going to call? Michaela. Right. And? And I think at the beginning, you were a little bit skeptical, I remember.

Michele Mancioppi: No, I had I had a different I saw something that I have always been in love with, observability tools. And that's because it's the best way to to help people to software that is better. I, I hate software that sucks. It takes it personally. And making a good observability tool is a is a way to help people help themselves. That was the whole point. Why I went fighting windmills for 18 months at SAP, because I wanted to give tools to my peers to actually do a better job. What I wanted to do in the beginning, like the idea with that, was I came in and thought do we make another SaaS tool, or do we try to bridge the gap with development environments? We already saw the science of opencore being on its last leg, and we saw that the, the, the death battle of, of that with the relicensing of radius recently, and what happened with HashiCorp and Terraform and the drama around it. So the going down the dev route, dev tool route, maybe not the words, but they're still working out really nicely on the other hand.

Mirko Novakovic: Absolutely. So we started you joined us. And basically the core idea getting back to code Red, right, is we want to be OpenTelemetry native, really 100% OpenTelemetry support done, right? Especially on the semantic side, especially on the context side, bringing context to all the signals, right? That's one of the foundations of zero. But then and that's something we we I know you and me had these discussions a lot. Also in Astana is what I call product love, right. Really build a product that is thought through that works well, that that you just like using. Right. That's and that's a lot about the details. Sometimes you don't even know why you love a tool. Right. Let's let's take linear as an example for task management. Right. We just love it. It's so good. You don't really know why but it has keyboard support. It looks nice. It's super responsive super fast. It looks good. And that's really difficult to build right. It's like Apple products sometimes you don't know why. Why a product of Apple is so much better even if it looks very similar right. But it just feels better because it's thought through into the detail. Right? And that's something we want to do for observability, because we still think that not a lot of users really get the value out of it. Because tools are too complicated. They are not easy to use, not easy to understand. And so we want to really bring this into the mass market, right. With a tool that's easy to use.

[00:16:33] Chapter 5: Product Philosophy and Challenges

Michele Mancioppi: The user experience and specifically the gap in the quality of the user experience in observability tools is you feel it really in your bones, like during an outage. And The tool is working against you to prevent you from troubleshooting itself. Instead of laying in front of you small pebbles or breadcrumbs, like in the fairy tales to to lead you to the source of the problems. If you get stuck in overly prescriptive workflows, or if the tool is inconsistent and two views are using different clues to tell you what to do next, or if just the experience of dealing with the data is not tight enough, the fact that you can interact with the data and quickly prove or disprove your hypothesis by filtering stuff in and out and grouping very fast, then yeah. And in an outage, when the sirens are blaring and your boss is asking you every 10s if the checkout service is there yet, then you already by default use ten points of IQ just out of stress. So the tool needs to help you. And that is what we want to do.

Mirko Novakovic: Absolutely. But let's start with OpenTelemetry. Right. OpenTelemetry is a standard under the CNCF kind of community. It's the second biggest project after Kubernetes now. So it's really active. The purpose is to really standardize the format for the signals, but also the way that you acquire the data. Right. The SDK, the API of at the moment, three signals. Right. Logs, traces and metrics profiles are almost there.

Michele Mancioppi: They're well well on the way.

Mirko Novakovic: And real user monitoring is also inside. Right? So, so overall, we can expect OpenTelemetry to be this specification for almost all the relevant signals in observability. And the benefit obviously is it's open. It's standardized. You're not dependent to one vendor. Every vendor that can support OpenTelemetry can read the Otlp, the protocol, the format. And that means not only that you can switch tools easier, but I think the biggest benefit is that cloud platforms, serverless platforms, frameworks applications can add the instrumentation so that you don't even have the need of an agent, right? The data just comes out from the core.

Michele Mancioppi: Yeah. That is actually a very big departure from what you did with Instana, where we needed to to create bespoke instrumentation, our own tracing format for every single library and version that customers would bring to the table, which is all of them at some point. I remember chatting with Fabian, the VP of engineering at Instana, complaining that how can it be that we managed to find customers that use 50 different HTTP APIs in Java? First of all, who needs those? Many? And second, why does everyone use a different one? That actually turned out to be the one of the cardinal sins of the JDK that was solved on version nine. Not to have a standard, I would call it HTTP client. There was only the http url connection, which was not enough, and the same problem we didn't see. For example, in node, because virtually all HTTP clients nodes were built on the standard node, HTTP and Https packages. But OpenTelemetry is more than just a way to get the data out. There is also the collector that it is an agent, but it is an agent with a lot of flexibility built in. You can configure the collector in so many ways to enrich your data, collect additional data, add context through resource attributes, semantic conventions attributes and spans from logs on metrics on the scope. There is everything, and it's effectively a Swiss Army knife to allow you to pre-process your academic tree before you send it to to a database for creating analysis.

[00:20:42] Chapter 6: OpenTelemetry and Semantic Conventions

Mirko Novakovic: Can you a little bit help understanding what you just said? Like the attributes? Because I really love the standardization of the semantic convention. Also the resource attributes. So can you can you just explain what it is and why it's so exciting?

Michele Mancioppi: I'll start with with a story. It's a it's a cautionary tale. One day I was at one of the companies I worked in between, and I started looking at the way metrics, logs and traces were manually annotated by developers about reporting interactions with specific S3 buckets. Well, I found not one, not two, not three, but four different keys used to to effectively specify which bucket we're talking about. And that was just the keys. The values like every single casing. Uppercase. Lowercase. Everything. Camel case. Kebab case. All of them. So all of them. And that actually is a huge problem when you want to query across different databases, different types of telemetry, because the kind of information that the log gives you and that profiling gives your tracing, they need to complete each other. But if the telemetry, if you annotate it in ways that is effectively incompatible between signals, you lose most of the usefulness. I mean, then you need to go and fumble around like, okay, time with the global time frame. Yeah, fine that we get in and then. Oh yeah. How did we annotate that particular attribute here? That case was like different applications by the same team using different keys for the same thing. That's a classic right? So much so that pretty much every large observability team that we saw, large customers at Instana, had actually problems standardizing across their organizations, even about a couple of little tags that they would put to say, for example, who you gonna call when this explodes? Now, OpenTelemetry solves this pretty beautifully with the concept of semantic conventions.

Michele Mancioppi: Semantic conventions think of it as a sort of dictionary. It's a dictionary of keys and what type of values you're supposed to put in there. So, for example, if you're tracing an HTTP client that issues a request towards an HTTP server, the span that represents the outgoing request is going to have the HTTP method get post uppercase please the URL, the version of the protocol. How many bytes is the is? The response is coming in. What is the status code for that? And on the other side, the server, if it follows the semantic conventions, is going to have the same data. And it's not only just that particular pair of clients and servers. The fact that OpenTelemetry has these standards, they apply to all instrumentations that you get out of the box and there are many. So not having to go and manually instrument yet another HTTP client because somebody else has contributed that instrumentation already for you, and that instrumentation is following best practices in how to annotate data. That's amazing. That is like 90% of the job that we were doing at Instana for instrumentation. After we we had nailed the basics and the agent, and it was a continuous source of costs for us. At one point, like I think we had like in total 60 engineers and 30 were working with me on agent and and tracers. It was like the biggest cost center for us.

Mirko Novakovic: No, absolutely. And just to make sure that it's clear as an example, right. There could be a simple example like the hostname. And before the semantic convention you could name it hostname host underscore name, hostname in one word host small letters and capital letter name. Right? So there are multiple ways of naming it. And then if you have different signals on different servers it is very hard. So a law could have a hostname. Hostname. Then you have a trace with host underscore name. And you can't bring the stuff together.

Michele Mancioppi: Or it could be that even if the key is the same for some instrumentation, some on some hosts, you're collecting just the name and some others are also attaching the, the, the DNS name.

Mirko Novakovic: So yeah, the.

Michele Mancioppi: Ip address of the other. They are effectively different values when you're trying to match a database. So you need to know the differences. And it becomes really complicated right.

Mirko Novakovic: And so OpenTelemetry standardizes that so that you basically speak the same language. Right. Wherever you are. And even if you get it from a third party, it's the same language. And coming back to one of the big issues that we both see in observability is what what what we call context. And it is if you have a problem, you want to see the data that you actually need to solve the problem, right? And one of the big issues of observability, in my point of view, is that during the past years, we became better and better collecting data. We have better databases in the cloud, serverless databases, you can store more data much cheaper. So data do people do? So data exploded. But guess what? I always say finding the needle in the haystack. Right. And what we did is we made the haystack bigger. Right. And so it's very hard to find a problem if you have a lot of data. So what we try to do and what OpenTelemetry supports by with this semantic convention, is making it much easier to find the needle in the haystack by giving the hey context. Right. And once you know what you are looking for, you can bring everything together by by basically joining through that semantic convention. Right. And, and the tool actually has to support this.

Michele Mancioppi: You know the saying. Right. Telemetry without context is just data. Yeah, I was giving a presentation a few weeks ago where I put a chart on screen, and it was a chart that went up and down. It was red. And I challenged the audience like, what does this mean? And say, oh, it's red. So it must be about errors. So, oh, it has spikes. Spikes are bad. In reality, it was a chart from Reddit that plotted number of likes on an animal versus cuteness. And I still don't know why there is a value for seven, but there you go. And that that means, like, without knowing what the data means, how it is collected, what is good or bad, for example, is like duration 200, good or bad. First of all, let's talk about unit of measurement. 200 seconds for a request. Probably not great if it's milliseconds or nanos. Yeah, they're probably fine. But then again, is it a request that you need to be fast? Or maybe it's some batch job Actual running overnight. So context is everything there?

[00:27:37] Chapter 7: Context and Contextual Relevancy

Mirko Novakovic: Absolutely. So context is a big story for us. And and we try to build that into the tool where we come to the second topic, ease of use, product love. It is essentially about building a tool that makes it super easy. In a world where we have a lot of data and we need to find the needle, right? We need to find the root cause of the problem. That's the hard problem to solve and build a UI that supports the user to really quickly under pressure. Not being an expert, not using the tool every day to get there and and find it quickly. Right. That's that's what we want to build. And that's what takes a lot of effort on the UI side. And I think we we discussed it I think this morning, right when we had a meeting that if you ask, how do you do that? It's culture, right? One of the most important things for a company that is building a product is culture around creating an atmosphere where people understand that every detail matters, and that you cannot be satisfied with something that's not working as expected. Right. And plus creating a culture where nobody is afraid, even if the CEO says, oh, I like that feature, that somebody can say, no, no, that's not good. Right. And having open discussions around it, I think so far we did a pretty good job collecting a team that is vocal. Right. People who who like to explain their thinking and who also are vocal on, on on chats and vocal saying if something is not running well. Right.

Michele Mancioppi: Not only vocal, but passionate. Passionate about the problem they're solving. Passionate about the people they're solving it for. I mean, if it were just a tool for us, we could have a great time. But the fact that by doing a great tool, we're going to help hundreds of thousands of people out there doing software that is better, and by doing so, helping their end users indirectly. And for example when when I told my wife, yeah, I'm going to join another observability company, she said, really? Isn't that like the fourth in a row? It's like, yeah, but it's important. And tell me why. Because imagine the case where our son wants really to get a ticket for a concert, but the web application for that, for that, for buying the ticket sucks, for lack of a better word. And it crashes in front of our son and he cannot go to the concert. Won't he be pissed? So, yeah. Incredibly so. Yeah. What if you give the people that are doing the application tools to make it better? Okay. I like it.

Mirko Novakovic: Absolutely. So, um, Usability is important. We we are talking about that all the time. I like the discussion about the color green with you. Right? Because I know that.

Michele Mancioppi: Tell me why you didn't call the podcast Code Green. Why do you use code red? Why not green?

Mirko Novakovic: Yeah, code red. And it's it's one of the discussions if you if you think about it, uh, there is the idea that a lot of things are red, green and yellow, right. Where, where red means something is wrong and green means something is good, right? Uh, but for example, we think that this is actually distracting you from the most important things, right? The red ones. And if everything is colorful, then it actually has no meaning anymore, right? Nothing is right. And so we, for example, removed basically the color green for most of it. So if something is normal.

Michele Mancioppi: There's going to be just one place in there, zero that is going to have the green color. And that's the status page because we want our users to go to the status page, see green and say Dash0 is fine, then move over.

Mirko Novakovic: Absolutely. But. But other than that, we really, uh, try to make it as obvious as possible to see the things that are wrong. Right. That's so important. The thing that are read that are yellow, that's something we, we put emphasis on and the rest should just look normal, right. Yeah. If normal is white, it will stay white. If normal is gray, it will stay gray. If it's black, it's black.

Michele Mancioppi: The motto of the series is dating color. Specifically, those colors are red, yellow and gray for something else. Because the really like already iffy that you can log into Dash0 and then your AI automatically goes to the brightest color and you see immediately, oh, that thing is not going fine. That's shaving time from from enter. That is improving the way that people navigate the system that actually instead of forcing people to read a lot of stuff to figure out what's good and what's not, it lets their automatisms inside to go into it further. We're all trained, pretty much all of us to drive. And the semaphores as defined in Baltimore in 1920s in terms of color schemes, is what we go by as an industry to describe what's good and what's not. So let's use that to the largest extent.

Mirko Novakovic: No, absolutely. And there are many other things that we think are important, right. Connecting the data. So if you have a lock you should see the context. You should see the trace context where the lock is created. You should see the resource. For example the pod that lock was created. So have everything in context right? Being able to navigate easily by just linking, just clicking on it. Right. Everything is linked. We are big fans of the keyboard, right? Especially because developers. So everything in this tool should be quickly navigable and and searchable and filterable with the keyboard. Like like I know in a, in a, in a span view or in a log view, you press shift F filter opens, you just filter, put in the filter and it works right. And it's fast. It's responsive. It's fun to use just to know we haven't we do not have a release product yet, but we already have designed customers onboarded. And we we will release the product at KubeCon in Salt Lake City in November. But it is fun. It's fun using it.

Michele Mancioppi: One of the things that Opentelemetry has done that was a drastic departure from the past is the fact of actually dealing with multiple signals like you remember, like opentracing was just traces. Everything was seen from the point of view of spans. You would not have a specific way to describe where the span is coming from. The metadata describing, for example, the pod that runs the application that generates a span would be mixed with the ones about the span itself. Not great, but viable. But the fact that, for example, we have logs spans metrics under the same roof has given us two more ways of correlating data. In the past we would correlate by time frame. When did it happen? Stuff happening roughly at the same time could be correlated, and then you would have to go and invent your own semantic conventions, for lack of a better word, to describe where the data was coming from. Hoping that everybody is on page spelling hostname the same way. But Opentelemetry gave us two more ways of correlating data. One is the resource that is effectively a set of metadata, very well specified and always increasing. There is always more ways of describing more systems to specify where the data is coming from.

Michele Mancioppi: What system is this telemetry talking about. And then there are point to point correlations. When you emit a log you can annotate on it what is the current span. When you emit a metric data point, you can provide exemplars, which are pointers to one or more spans. That roughly means this pan was happening when the metric was changed. Very useful because it allows you to to effectively have a look at how the metrics are changed, what operations were in the process of being done when the metrics changed? Profiling is going to be the same here. The format for Opentelemetry for profiling is effectively an extended version of Prof. Which is the format for for go. And um, it's not going to be very forward or backward compatible technically, but uh, there were two good reasons to, to break from, um, the go format. And uh, one was more optimization, so more denormalizing, but the big one was to be able to link sampling like samples with spans, and their spans becomes the central pillar of the entire Opentelemetry system. Everything can relate to spans and through spans, everything can be related together.

Mirko Novakovic: And coming back. Why? Why is that important? It is important because you want to find the needle in the haystack. And the relationship between the data makes it easy to create context, to reduce the amount of data that a developer sees, and that reduces the time to repair. Right. So if something is broken and the developer needs to fix it, and you get him or her the data that she needs, then then it's so much faster. Right. And you are using losing less revenue. You fix the problem faster or even make sure that it doesn't happen. Right. And that's that's why we are building it, right? Context is for developers to make it easy to fix problems. Right.

Speaker2: You know, they say context is for kings. Yeah. And for us, the end users using observability tools are our kings and queens. It's for them. I remember on my skin the frustration of having to sift through too much data and going down too many rabbit holes to find out what the reason was because the data was too much and too poorly annotated. Hope it was a long way to fixing that. It's lovely.

[00:37:32] Chapter 8: Cost Transparency

Mirko Novakovic: Yeah, and talking about data, I want to have one last point, which is also important for us. And it is the topic of cost. And I want to mention it because normally cost is very tied to the amount of data you're collecting, you are storing, you are processing. And I don't like really to say we want to be cheaper. Right. Because I think at the end of the day data is stored on S3. It's compressed with very similar algorithms, and there is not such a big differentiator on how you process and store the data. But this said, there is a big differentiator in data how you give the customer control about the cost and visibility about the cost, right? One of the biggest pain points we hear today from customers is that with their existing observability solution, they get surprised at the end of the month how much they have to pay, right? Because sometimes we had this over last weekend with the design customer. Something broke and the web server had the connection reset by peer lock. The lock exploded. It was 30 times the amount of lock data. Basically always the same log message. So you store 30 times the amount of logs and you pay for it, right?

Michele Mancioppi: Well, we're not done. First of all, that log was pretty useless because although it was labeled as error by nginx, in reality the client recovers most of the time from that. The only thing that changed was the IP address. But instead of storing maybe just the IP address, which in most cases is even useless because yeah, it's hard to know where really it's coming from. Anyhow, even there, all the rest of the log, which was like 80% of the amount of bytes sent over, it was always the same, storing it 80 like 30 times more logs that are completely useless and having to pay for it. It's effectively injury like insult after injury.

Mirko Novakovic: Yeah absolutely. And and so are smart tool could just detect that say hey I create a metric I just count those logs right. But I throw them away and keep only 1 or 2. Right. Or only the IP addresses if necessary. And then you don't store them. Right. You still have all the information when how much logs happened of that type. And that's something we want to build in. Right. Give the user the power to say if this happens I don't want to store the logs. Right. Because I pay too much and process them. Right. I don't I it's enough to have a metric, an example of the logs. Right. And um, this said, if you look at our whole space there, there is kind of a new category. Cribl is, by the way, leading that. Right. Um, at the beginning, I was not even sure what, what what Cribl is doing, but now it's a multi-billion dollar valuation company. And basically what they did is they reduced the amount of data you are storing in tools like Splunk or Datadog, or they give you the option to put the data into a different place where you don't pay so much. Right. And that's basically the use case. And customers are buying it because they have that pain. And Cribl is solving that pain in my point of view. It is something the tool itself should solve, right? Because you have all the context, you know what the data is doing. But but then it's an innovator's dilemma, right? If you provide that feature to your customer, you will lose revenue, right? Because you are.

Michele Mancioppi: Paying, it's effectively asking for the salespeople to come over and give you an example of all good trials with torches and pitchforks. It's effectively like cutting the branch from under you. In the long term, it's probably great for you. I don't think any any tool would actually survive a reputation of being unfair, expensive and greedy. It's good for the users. You just have to power through a few bad quarters and companies that are very large. They are really not made for that. It takes some special act of dedication by founders or leadership to actually take the blow and say, look, this is going to be worth it. But guess what was not going to have that problem?

Mirko Novakovic: No, I just want to say that we we have this as one of our core principles, right? We call it cost transparency. We want to be as transparent as possible to show the user where cost, where data is generated, observability data, how much you pay for it and give you options to decide. I'm not saying to only reduce, but at least to decide how much you want to spend for a certain service for a certain lock type like we just discussed, or give you options to intelligently reduce the cost by compressing, by throwing data away, by aggregating it into a metric. So there are many options of doing it right. Especially today with the incumbents of things like AI Llms, that can also help in the space of understanding what's similar or what's the same. How do you aggregate data? And I think that's a big topic for us to make it again, easy for the user to control and optimize cost for their observability.

Michele Mancioppi: But also fair. Systems are not necessarily alike. The logs about the debug environment or the development environment. I care far less than production. If I have a spike in logs for for debug, I'm probably not going to sift through all of them. I will not need to to recompense my users for for something that went wrong. So I'll just probably throw them away, figure out what the bug is, and then move on. So the fact that no, no two services are alike in importance at the end of the user means that also the way that telemetry should be stored, processed, compressed retention periods could also vary.

[00:43:08] Chapter 9: Dash0's Core Principles and Release Plans

Mirko Novakovic: Yeah. And to summarize right, the basic ideas of Dash0 are being OpenTelemetryNative, right. We are 100% committed and will build a tool that's the best tool for OpenTelemetry right. That's number one. Number two is product love. We will build the best user experience in the observability space and make it super easy for developers SREs to find problems and get the right context, the right data at the right time. Right? And three. No surprises. Full cost control and giving the user the power to decide how long to keep data. Which data to keep and intelligently, uh, making sure that we only collect what's really needed to solve problems.

Michele Mancioppi: By the way, what is your observability villain origin story?

Mirko Novakovic: It goes back to 99. I was finishing my studies. I started at IBM. And at that time, if you remember, that was kind of the start of the big bubble. And it was basically the start of the internet. Right? And especially in Germany, I was actually part of a project, for example, building the online banking for the largest German bank or building the first website for booking flights. Right. And so we were building that with Java on WebSphere, if you remember that.

Michele Mancioppi: Memories.

Mirko Novakovic: Oh, yeah. Right. That was the first time there was scalability. Parallelism, right. It was a total different way of programming. And I was I was a young, uh, developer in those projects. And we had massive scalability and performance issues. And, and I tried to, to debug that. Right. And I needed visibility. I came across a tool called Wily Introscope. Cool. Uh, built by Lucerne, who then founded New Relic later. And I was I was a user of Introscope that already had tracing in it so we could trace the WebSphere application, the Java applications. And that that's when I fall in love with that whole topic. Right? Because I understood that I can get actually insights into production environments. And then it happened that I was the one who, uh, inside of, of IBM here in Germany knew knew how to do that. And so I went from project to project. And, uh, later on I founded my own company doing that, solving performance issues. And I worked with teams from Dynatrace and Appdynamics early on. And I kind of as I said, I fell in love with that category and then built Instana, my first own, uh, tool. So, uh, it's a really interesting space. It's changing all the time. It's evolving very fast and it's technically challenging, which is also an interesting thing to solve. Right? That's why I really like it.

Michele Mancioppi: I mean, the purpose is very meaningful. You make stuff better. I had something very similar. One day. I was working as a full stack engineer at SAP for the cloud platform cockpit. It was like the the cloud of SAP. Now it's called something else. And, um, my product owner, Robert, comes over with a laptop and says, uh, you need to see this. Like, concerning his face like, okay, which bug did I ship today? He puts a laptop in front of me, and it's a LinkedIn page of some bloke in Brazil. That said, I think I logged into somebody else's account in SAP Cloud Platform. Alarm bells ring in my head, said, okay, I'll take care of this. Go download the access logs for the cockpit from every single landscape I could think of. And then I take the code base 30,000 lines and growing them by a bunch of different teams because all the teams were contributing. And then I spent two days effectively tracing with my finger through logs and code what I thought that particular user doing. And there was nothing wrong. I could not find anything that the guy had logged in and spun up a couple of yarns and tore them down and moved on with his life. And then I said, oh, it cannot be. So I take the laptop, move over across the corridor to to the one person doing the authentication database for the entire thing. I need your help. What did this guy do? And then he runs a couple of queries with against Maxdb with the command line, who nobody else would be allowed to because it was the only one with access and said, oh yeah. First he logged into production. I think it was production EU data center in Boulder. And then on the test environment. Come on. Really? So because it was a completely different setup, like the guy logged into it with a different URL, did not recognize the stuff, thought it's not mine. After that, I swore never to have to do it again. And, uh, that's, uh, the series of unfortunate events that, uh, brought us together in the end.

Mirko Novakovic: Absolutely. Yeah. Michele, thank you for the conversation. I think we will have fun building this product together. And we.

Michele Mancioppi: Have.

Mirko Novakovic: Yeah. I'm really looking forward to release it at KubeCon in Salt Lake City in November.

Michele Mancioppi: Yeah. Come visit us.

Share on