Episode 1640 mins1/9/2025

Observability 2.0: The Birth of Modern Observability

host
Mirko Novakovic
Mirko Novakovic
guest
Christine Yen
Christine Yen
#16 - Observability 2.0: The Birth of Modern Observability with Christine Yen

About this Episode

Christine Yen, CEO of Honeycomb.io, joins Dash0’s Mirko Novakovic to discuss the code red moment that inspired Honeycomb, the importance and coining of the term "high cardinality data", how the modern understanding of observability was formed, and more.

Transcription

[00:00:00] Chapter 1: Introduction to Code RED and Guest Introduction

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I am co-founder and CEO of Dash0. And welcome to Code RED code because we are talking about code and red stands for request Errors and Duration, the Core Metrics of Observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Hello everyone! Today my guest is Christine Yen. She is co-founder and CEO of Honeycomb and Honeycomb was founded in 2016 and is basically the inventor of the observability word and industry. Right? And I'm so, so I'm very glad that you are here, Christine. Welcome to Code RED.

[00:00:48] Chapter 2: Christine Yen's Code RED Moment

Christine Yen: It's very kind of you to say thank you for having me. I'm excited to be here.

Mirko Novakovic: Yeah. We always start with the first question, which is what was your biggest Code RED moment?

Christine Yen: I was working at a company that was a mobile backend as a service. So all of our customers interacted with us through our API and through our SDKs and engineering team. You know, the product that we sold abstracted away a lot of the complexity of building backends and persistent storage for these mobile developers. And I was building the analytics component of this we wanted to offer some visibility into all the traffic that was flowing through our servers. And part of building out this analytics component involved standing up a Cassandra cluster doing a lot of pre-aggregate analytics in order to predict for customer what analytics they'd be curious about looking up. And I was the lead engineer on this project. My co-founder, my now co-founder charity was then the lead infra engineer. I was really excited to be building this new feature for this, on this platform that had a lot of traffic that would be turned on immediately for everyone on the platform. And, you know, we did a lot of planning and thinking and got it rolled out. And I remember it must have been maybe a month beyond its release. About a month in, I get a call. I think it was on a Saturday. I got a call on a Saturday saying, Christine, something is wrong with the Cassandra cluster. What did you do? What do you mean, what did I do? I built the feature. The. It should just be running.

Christine Yen: What do you mean, something's wrong? And if my fault, some folks in the office were there and they pulled up their graphs, right? The graphs that they. That told them that something was wrong and it was right throughput of the CPU, of the Cassandra cluster and CPU utilization and memory and all of these very Opsie, very Infrastructure. Ways of thinking about a software system. And I had spent very little time on the opposite side of the world. But I was a newer engineer, and this was my thing. And all the engineers just seemed, like, very cool. So I wanted to, like, pretend like I knew what I was doing. And they were standing there frowning at these charts, and I was like, okay, well, I will stand here in front of these charts too. We're trying to figure it out, try to figure out what's going on, to give you a sense of scale. I don't remember the 2012 numbers, which is when this was happening, but in 2013, we were serving traffic for maybe 60,000 unique mobile apps. So each one of these different apps had a different usage pattern of our platform. Some were right, some were read heavy, some were you know, send syncing their entire app state from every mobile client, every second. Some of them are like very judicious in their use of API. So just totally heterogeneous mix of, of traffic. And we were trying to figure out in this way, apps were causing Aleksandra Cluster to struggle because its been going fine for a month.

Christine Yen: We ended up using TCP dump and like looking at the data flowing through to try to figure out which app was overrepresented. And we were doing this with our eyeballs because it was live and these were the tools that we had. We had our dashboards, but our dashboards were saying the system overall was fine. It was just the central cluster that was struggling. And then in the logs, we didn't know what to look for. So we were stuck TCP down making. And just like reading with our eyeballs. And in the meantime, this new feature that I'd been very proud of, building was just completely broken. It was not accepting new data for any customers and deeply stressful. But we ended up finding out was that that day, it was the heyday of mobile, right? That day, a Russian dating app had launched, and they were sending these incredible payloads that were translating to a really intense load on the analytics cluster. And once we were able to isolate that and blacklist it, pause and throttle, throttle traffic, everything stabilized for everyone else. But it was horrifying for me because it was my thing. I was the lead engineer that I had, that I was horrified by how large of a gap there was in my understanding of the data that InfoTeam looked at, and what the data that I wanted to live in.

Christine Yen: Right. As an application engineer, I want to talk about customers and endpoints and SDKs and the business logic of the code. And that wasn't what was represented in our charts. Life went on for for a couple years beyond that. But that one stands out in my mind because when Charity and I started talking about building something new. Really, it was rooted in this sort of exchange that we'd had in the past where Ops and Deb just spoke different languages, felt like they were on opposite teams almost, even though, you know, we were just there to deliver a great experience to our customers. And really, the, you know, the inspiration for Honeycomb, the realization that life could be different when trying to deal with heterogeneous traffic on a, on a you know, multi-tenant platform with polyglot storage and you know, a high rate of change. The real source of inspiration was this internal tool at Facebook. This company was eventually acquired by Facebook, and we were exposed to their internal tools. It was a tool that made a set of trade offs that we'd never seen before. It felt like it had all the flexibility of lugs. It had the speed of metrics. It allowed us to. Have a sense that something was happening at a very high level, but drill down and isolate a problem. Even though we didn't know what was going on.

[00:07:05] Chapter 3: Inspiration and the Birth of Honeycomb

Christine Yen: 60,000 mobile ops in a sort of traditional monitoring world, you just can't say things like, show me which one out of the 60,000 is sending more traffic than I expect. Like, that's just not a question that you can formulate with pre-aggregated metrics. The deep irony here is that the problem was with a system that was providing pre-aggregated metrics to our customers. Once we were exposed to this tool inside Facebook, we were like, Holy cow, we never want to try to build this mobile backend as a service again without a tool like this. The thing that this internal tool did that we've then baked into Honeycomb from the very beginning, was the ability to handle whatever nouns, whatever language, was necessary or useful for these two halves of an engineering team to talk to each other, right. Because the traditional monitoring tools talked in the language of infrastructure. At the time, my error tracking tools spoke exclusively in sort of developer language, but we needed a tool that could talk about app ID, mobile version mobile operating systems, these, these, these nouns that only existed and mattered uniquely for our platform, our system, our business, our product and allow would have had would have allowed charity and myself to understand quickly. Oh, this is one app. This is one app doing this one thing with our API. Okay. Now we know how to stop the problem.

[00:08:36] Chapter 4: The BubbleUp Feature and Its Impact

Mirko Novakovic: Yeah. And it makes total sense. So that Code RED moment basically led to Honeycomb then. Right. That this learning of one that Ops and dev are not connected, don't speak the same language, don't have a tool that they can basically communicate with. Right. So as far as I understand, you wanted to bring Dev and Ops closer together with Honeycomb. And the other was how do you use. I mean, I remember that using something similar like Wireshark and then looking at the data and trying to figure out what's happening. Right. Because that was the only thing we had. And today. And we can talk about that. I love bubble up of Honeycomb, right. Which is essentially if I get it right, you would see a spike of as you described. Right. Or something gets slow, you would mark it, you click a button, say bubble up, and then it analyzes what data is different in that spike. And in your case it would actually tell you because it has the high cardinal data, which also has the tags like the country. It would tell you, oh, this actually are all from Russia from that dating app right away. Right. You would get that answer with a click now. Right. And and.

Christine Yen: Precisely.

Mirko Novakovic: It's pretty complicated, right? I mean, to get there from a technology platform standpoint, right?

Christine Yen: Exactly. The world that we're in today, we're no one is sitting around being like, gosh, I wish I had more data. Right. We're all drowning in it. The question is, how do we make that data useful? How can we make it easier to know what questions are even trying to answer with your data? And that was really the impetus behind Bubble Up. We noticed that with Honeycomb and with the data that folks were sending to us, they might be sending country and they might be sending an app ID. And all of these, these useful nouns, but they would have this like very iterative flow of, okay, something's going on with let's pretend latency or, you know, let's say they were looking specifically at the latency, talking to Cassandra. Okay, latency for talking to Cassandra is up. Ios versus Android. Is one of these significantly higher than the other? Okay. Let's break down by that and then zoom in and filter to just iOS and okay. Can we break down by you know, which endpoint it's impacting and they, they end up in this loop of looking at their data through a different lens and knowing how to hone in. And at first we were like, that's awesome. That's iterative, that's exploratory. That's actually the workflow that we want people to have when they're tracking down these unknown unknowns in their systems.

Christine Yen: We're like, how can we shortcut this? How can we call it the breakdown dance because at the time we, you know, SQL Group bias was but we called them breakdown in our UI. And it was just this dance that you'd get into and people who knew how to do it would feel very natural and kind of fun because you just keep getting different views of your data. But for folks who didn't know how to do it like a dance, you're, you're kind of you're kind of lost and you're trying to get your feet under you. And with bubble up, you know, it's effectively anomaly detection. It's in this weird area. What's different about it relative to everything else. And what we found is that by being able to leverage this high cardinality data that you describe no matter which engineer on the team you're talking to at the time, like anyone knew. Oh, app ID, I know what that means. I know what it means for our customers. I know how to figure out whether it matters to us or not. I know how to look up more information about this. It gives every engineer so much more ownership over their software and context for what signals matter. And that is incredibly powerful and transformational for how engineering teams work.

Mirko Novakovic: It is. And what I also find really interesting, I think, is one of the most powerful feature in the industry. But it's not AI, right? It's actually it's actually not AI for a root cause analysis. It's basically I'm just saying basically it's comparison of two large sets on very different parameters. Right. So it's essentially what you need to have is a very powerful database where you can do that fast, right? For a lot of data. Right. That's kind of the challenge. But it's not AI. Right.

[00:13:00] Chapter 5: Complementary Role of AI and the Journey to Building a Database

Christine Yen: It's not AI it's statistical analysis. I, to be honest, I have been a real AI skeptic in the past, and this current wave of tech is actually pretty cool. And I'm very curious to see where this goes. But our philosophy you know, you and I both in this space, no one, none of our customers can afford the type of false positives or false negatives that are possible with an AI tool. And I think that there's a previous generation of AI ops companies that really tried to use AI on the data in a way that exposed customers to that risk. I think that the application of AI going forward has been generally more shrewd and thoughtful, which is awesome. But our philosophy has always been not what can the AI do instead of the people? But how can I complement what the humans do? Or AI, ML statistical analysis, Honeycomb, whatever. How can the software complement what the humans bring? Because as good of a job that you and I do as vendors, the customer is always going to bring more context about what their software is doing. We just see a spike. The customer knows whether it was a load test or a DDoS, and customers bringing that type of context able to tap into that judgment. That's what humans can uniquely do. Machines with machines are really good at is churning through tons and tons of data really fast. And so that's the fun part about building software for humans, right? We get to figure out where to draw that line, how to our field CTO Liz Fong-jones, likes to talk about building mecha suits instead of robots. We want to give the humans exceptional powers, not try to replace them with something that's going to, you know, run into a wall a bunch of times before finding the right path.

Mirko Novakovic: And to do that, you had to build this database, right? Which is based on the concept of events. Maybe you can tell us a little bit. And I'm also curious because I've gone through it myself, I've never built a database for, for my companies. I can imagine that it's it's a long lasting task. So at the beginning, you also have to convince your venture capital folks that there is some sort of engineering needed to get running. Right. So I'm curious how you did that, building that database that's based on events that can work with this high cardinal data to do those tasks. How long did it take for a first version and, and and what is kind of the magic behind it.

Christine Yen: So Charity and I are the sort of engineers where we tried really, really hard to not build our own database. I will just say that it is not anything we would recommend to anyone. It is something that we looked at many, many, many options to try and avoid. At the time, Clickhouse didn't exist or was not production ready for other folks outside of Yandex, but we had a very clear picture of the type of experience we wanted to provide through Honeycomb. Right? I mentioned this, this had come through our own use of this internal tool. So we knew that we wanted to optimize for speed above all else that we'd be, you know, okay, using approximation algorithms where possible and making other sort of clever trade offs to maximize speed. We knew that one of the really magical things about using scuba was that for any developer, the process of adding new metadata was frictionless. You just start sending an additional field and it would show up as a column in your data, able to be queried on, like anything else. And what we saw from sort of other more traditional logging monitoring tools inside of Facebook. You had to file a ticket and there was like really tight tight controls. And that just added a lot of friction to those tools being useful to developers. And so we're like, this is a requirement for how we want people to use our software. And we knew that it was really important to us that you be able to get graphs really quickly, but always be able to drop down to look at the raw data that comprised that graph. Right. So if you imagine that mobile backend debugging example, we wanted someone to be able to get to a place where they were looking at a graph that was filtered to just iOS requests in Russia, hitting our analytics endpoint.

Christine Yen: Broken down by app version, we wanted someone to look at a graph that granular and then say, show me the raw data that was churned through to view this graph. Right? Show me the logs underneath the graph, if you will. We want that to all happen really quickly. And that combination of things speed, flexibility of the raw data and ability to update schemas easily do not exist in any of the off the shelf solutions at the time. And because of that, we knew in our bones there was just not a way that we would be able to provide this value, this transformative experience, to the market without building something specifically for that experience. Fortunately, Facebook had published a white paper about scuba the architecture, and none of it was a secret. So we were able to build as pragmatically as possible as simply as possible, our own columnstore that could back the types of queries we wanted to run. You mentioned our own and customer service and events. Events are just a. Events are just a fancy word for spans for structured blogs. Talking about data types makes me very sad because from my perspective, it's all just data, right? Logs, metrics and traces are not different types of data because they're the fundamentally best way to ever talk about software systems. They are different types of data. Because 30 years ago when we started building software on other people's machines, it meant to run on other people's machines.

Christine Yen: The constraints of how we stored data and process data meant that we had counters, and we had logs and couldn't really get and had to make that trede off. Today, the world is a little bit different. You actually can derive metrics from your logs in a way that is virtually undistinguishable events. That's just a word for a type of data. In the end, we're a column store custom built for really fast analysis over cloud scale data. And you are right to call out, it was an adventure to convince our VCs that it was necessary. I think that we've been very lucky so far to have investors who in some cases were technical and so could sort of understand what the characteristics were and why it was necessary to invest in building something on our own. And in some cases were convinced enough that, you know, these two crazy founders who seemed like they had such a clear picture of something weird that they were willing to give us a little bit of time and let us run. I will say again, we were pragmatic. We knew just what we needed to build to get things off the ground. And we probably had been able to onboard our first couple customers before a year was out. So you know, we were always building to figure out what the minimum necessary we could build to support just this fast analysis, access to the raw data experience that up until then and still today, somewhat, people are cobbling together with logs over here and metrics over here, and hoping somehow that the humans will bridge the gap between the two.

[00:21:09] Chapter 6: Observability vs. Monitoring

Mirko Novakovic: When we come to this topic of observability versus monitoring. I will try to explain what that actually means for a metric. Right? If you have a service, as you explained in your example, and you would have an API and it has a response time in the classical monitoring system, you would have a metric that says response time of that service API. Right. And you have a chart. And in a system like yours you have the same chart. It looks the same. But what you can actually do is because it's not a time series, it's not this only data point, you can actually say, now show me that same graph, but only for the requests that come from Russia for that customer ID, and then it will dynamically take all the data and recalculate that graph. And you will see that graph the same graph just for your query. Right. Where in this classical monitoring approach where you have only that, you can't do that because it's only a number, right? That's a time series and it doesn't exist with this high cardinality. So you can slice and dice it anymore. Especially if you know upfront what you want to have, then you would have different time series. But in most cases, because you have so many variables, you can't know upfront. And as far as I understood, there was kind of the invention of that observability term with high cardinal data. And, and you came out into the market and then we can talk about what the competitors did with that term, but I think that was kind of the real idea, right? It's not it's not it's really observing and understanding the internals of the system and using high cardinal data that you can slice and dice basically. Right.

Christine Yen: Yeah.

Mirko Novakovic: Am I right?

Christine Yen: That's exactly correct. We came from a world where we like, we knew how important it was for us to break down by app ID and how. And so we and how absurd it was that at the time, modern monitoring tools said, no, you cannot you cannot break down. You cannot send this kind of data. You know, you'd look at docs for let's just say open source monitoring solutions that would have a giant warning in red box. That said, watch out for high cardinality fields. Because you will blow up your cluster. And the reason again, Charity and I are pragmatic engineers, not folks who love to define buzzwords. And the reason we realized we had to start talking about observability, or that we had to we had to find a new word to talk about what we were building, was that we looked at the tools that people were used to using, and we understood that if we tried to say something like, we're logging tool but really fast, or we're monitoring tool, but really flexible, we would forever be dragged down by the baggage and expectations that people had for this tool or that set of tool. I've tried the like, we're like monitoring tool, but we handle high cardinality. We're really flexible. And I've had conversations with smart engineering leaders for about this for like half an hour, and at the end of it still had them frown at me and go, but how do I take a metric and then find out what went into it? And, you know, one of the most humbling parts about starting Honeycomb is realizing it's, you know, wanting to build Honeycomb because it was a cool technology that customers needed, and realizing how much of the work was in words and marketing and talking about our ideas.

Christine Yen: But this high cardinality piece, you know, we've gone back and forth on how much to to to lean on that phrase. Because I think for people who have felt that pain, who've run into that wall, who've seen that red box in monitoring tool documentation, they know high cardinality is important but expensive or important and impossible. And so it was useful to identify those folks early on who were like, what do you mean? I can get my graphs in with high cardinality data. But we find I don't know, you know, when you spend a lot of time explaining technical concepts or obscure data terms to your customers I think that it takes you a little bit further away from the pain they're actually feeling. So we go back and forth.

Mirko Novakovic: But you did a great job, right? You charity. Liz. Today, also Austin Parker, they're kind of the voice of observability in this space. But what happened, looking from it, from a little bit of the outside is that a lot of people saw, oh, this is actually a thing that's taking off, right. Observability and it's something new. It was kind of a synonym for modern observability and monitoring for these microservice distributed apps. And then a lot of vendors just said observability is logs, metrics and traces. Right. The three pillars of observability, three.

Christine Yen: Pillars.

[00:26:09] Chapter 7: Evolution and Adoption of the Observability Term

Mirko Novakovic: It became kind of another term for it. Right. Or another definition that was a little bit different. All of the vendors basically jumped on it and basically redefined it. Right?

Christine Yen: I have that's not wrong. And I have a lot of respect for the very clever go to market and product marketing teams that recognize that opportunity. Because in one sense you know, a consolidation was inevitable. APM logging and monitoring shouldn't have been separate industries categories. They're all trying to solve the same problem. And as I mentioned, the technical constraints that required them to be separate approaches from 30 years ago no longer, no longer really the case. I think that the so, so consolidation was inevitable. And perhaps then it was inevitable that folks would glom on to a new term to describe the consolidation of the previous generation of approaches. But I think that the door is cracked. I think I certainly see and hear in the market enough conversations between engineering leaders about how, hey, I've done the consolidation and I still don't feel like I'm getting value out of it, and I feel like I'm still, you know, trying to apply the old practices to this new world. And they're starting to question What they should be able to ask for out of their tools in this space. And I think that cracking open of the mindset that's exciting. I'm not here to fight battles around words and definitions. Charity can do that. She's much better at it and has much more patience. I would much rather sort of stick my finger in that crack and say, cool. Let's talk about how your current tools are frustrating. Let's talk about how the concept that you're trying to the question you're trying to answer are really hard to answer with your current tools. Let's talk about that. And down that road, I think you'll see that Honeycomb, it was built for that future.

Mirko Novakovic: But this said, you still have recently kind of established observability 2.0. Right. And there are a few white papers and things. Shortly explain what that means. Right. What is the I would call it the next iteration observability? Right. So why have you come up with this term and which is a term again. Right.

Christine Yen: It is a term again.

Mirko Novakovic: Why? And what does it mean?

Christine Yen: About a year ago. Charity spent a lot of time talking to folks in the industry about what observability meant to them. And she was surprised then to meet a lot of people who were like, I'm doing observability. I'm doing observability. Great. We have all the observability tools. Look at how much money I'm spending on observability. It must be a thing that we are doing. What she found was that a lot of them were doing observability in observability in the three color sense. Right. They had consolidated, you know, put their logs, metrics and traces into one vendor, into one tool. Had invested in making sure they had all the data flowing. Still were struggling with these unknown unknowns. They were still struggling to answer these questions they hadn't predicted ahead of time. They were still had these like, huge silos in their engineering teams there, ops teams still couldn't speak to their developers because they were looking at dramatically different graphs that had no connective tissue. And charity was like, okay, at some point I can't keep arguing with everyone that they're not doing observability. That's just not a good use of energy. But their definition is different from where we're trying to go, where we've been trying to bring the industry all along. And you know what? If it's just the first version of observability really was consolidation, because that was inevitable. And we'll call that 1.0. And this world that we're trying to bring folks to of not having to worry about how to correlate your, your metrics and logs because they are derived from the same core data, like let's, let's say that that is observability 2.0, because we want people to be able to put in the metadata that matters to them.

Christine Yen: We want people to feel confident that their signals are telling them the same, the same thing. The whole point of defining observability was to create distance right between logging, monitoring and what we were trying to do. Okay, if now the three pillars are called observability, 1.0 to 2.0, difference is again trying to create some distance. Not necessarily to say that one is better than the other, although I imagine you can guess my bias. But to allow people to get the perspective, to see how are these tools solving my problems today and how do I want them to be solving my problems? Have my problems changed in the time since I've committed to this first set of tools, is it time to ask for more for my solutions? That's the intention behind observability 2.0. And again, I'll admit I was a skeptic at first. I was like, it feels buzzwordy again. And I don't know if this industry, like everyone, seems so jaded still about observability as a term. But I've been surprised. And I think I'm wrong about the industry's appetite for words that help create this distance and clarity in what it means to have a tool that is built for this more complex, cloud native world.

[00:31:50] Chapter 8: Observability 2.0

Mirko Novakovic: But if I think about it, if I think about 1.0 being the consolidation of logs, metrics and traces, and then you have 2.0 there a path to upgrade because it's, it's it kind of sounds like I if I have 1.0 and I want to get to 2.0, it's kind of an upgrade, right. And is there a path do you think that does it help customers if they have done 1.0 and consolidate it. And now they want to go to 2.0. So do you see a clear kind of evolution path to 2.0?

Christine Yen: I think one of the very exciting things about OpenTelemetry becoming such an industry standard now is that unlike before, today, customers see themselves as owning their own data. And I think that that's so cool, right? In the past it was like, oh, this APM vendor produced my data. Oh, I use this other vendors' SDKs to produce my monitoring data like no, today customers own their data and that's the way it should be for me. When I think about the path from one data or two data it is going to be through your data. There is an evolution of your data. I'm not going to pretend and say, you know, there's a one button solution somewhere because customers know their data best. Engineering teams know their data best, they know what matters and what doesn't. But I think that there are some pretty clear paths that are fairly well trodden that get folks to this more single source of truth world. First, it is recognizing that logs kind of are the basic building block of all these things, right? Span and a structured log are effectively interchangeable. And so what are traces, if not just logs that have been tied together into a hierarchy with timestamps and durations to standard fields and parent IDs. So you have this. You have this structure. You can turn it into a tree. What are metrics, if not these structured logs that have been aggregated into charts? And so first thing I recommend people do structure their logs. Sometimes they're already there. A lot of folks aren't.

Christine Yen: Start with structuring your logs. It'll just no matter what tool you're using, you're going to get more out of that tool. If it's structured. And those tools can leverage that structure for performance, navigation, exploration, all those things. I think that, by the way, Otel semantic conventions are going to be really exciting here, where from the get go, engineering teams are going to be, you know, find it encouraged to to build naming conventions because naming, naming and caching, right, are the two hard problems. So starting with structured Lux, starting to take a look at and this is assuming a world where they don't have traces, but they're tracing. Curious starting to take these logs and look at what I think of them as units of work. What units of work are interesting? Logs tend to just be a paper trail of anything, anywhere that your software does. Once you start thinking about them in terms of units of work, you can start to group them. When do you want to see clusters of logs together? Oh, probably. If you're handling a given request or you're processing a certain job starting to do that, that structuring, you can still access them as logs just by adding these parent IDs. It's not like you can't still grep through them or tail them or something, but you start to have additional options for how to look at and make sense of your data. Probably along the way you'll engineering can start to recognize that they have some redundant data, so it's often some cleanup that can be done along the way, although it's not required.

Christine Yen: And the last part I think is the most can involve a little bit of the most turning your brain sideways, but can often be the most impactful. And this is looking at your metrics and looking at your logs plus traces to see where again, there's redundancy. Because what we see a lot out there for folks who have invested in their logs and metrics, their 1.0 scheme for a while is that they have fallen in love with the speed of metrics. With dashboards. They want everything. They want everything up that is important to them. So despite metrics not being great for things like high cardinality data for, you know, per customer metrics, there, they found ways to make it work. In some cases this is by paying a vendor an arm and a leg for custom metrics just to have their high cardinality world sort of expressed in this tool. That's not meant for it. Sometimes it is just sort of proliferation of dashboards and fields from the natural growth of engineering teams. And so that first step is going in and saying what things are useful to us, but should have run counter to what metrics are built for. Metrics are so really great for certain things. It will never make sense to have the outside temperature not be represented as a gauge. That's just metrics are good for some things. But for latency per customer, metrics aren't good for that. And so identifying those things, identifying the field that makes as contributing to the cardinality in that metric, and making sure that that field is represented in your existing logs and traces, chances are it already is.

Christine Yen: They've just never had a tool that was fast enough to derive metrics in a way that felt useful from their logs and metrics. But that step of identifying redundancies, of moving things over where there aren't existing redundancies realizing how much more flexible it can be when your data that is important to you is just represented as fields on a structured log shoved into a column store that can get you back that fast analysis, that that transformation becomes such an unlock for spend because you're not spending an arm and a leg on custom metrics anymore. You just get to tack it along with the data you're already sending to a tool like Honeycomb. In our case, for free, you could just add fields for free or we don't charge for column store. We don't care if you add more columns, it's fine. Along that way, we're talking to these customers and we're talking about adding that language that connects ops and dev, that language that connects it to the business. That ability to not just understand what software or what systems are being impacted, but which customers and which parts of your business and that we see is all part of that two data transformation. It's not just data and software, it's how your teams use data and how they connect it back to ultimately what matters, which is making sure you're serving your customers and your users.

Mirko Novakovic: I can totally relate that. I really appreciate your work also for the industry and all the power that you put into, especially charities. Sometimes when I read the X threads, you can feel that she is really passionate about it. Right? And, and so thank you for that. And yeah. Hope we talk again soon.

[00:39:05] Chapter 9: Closing Remarks and Future Outlook

Christine Yen: Thank you so much for having me. I can't wait to see what you all build with Dash0. Congratulations on your recent launch and looking forward to continue the conversation.

Mirko Novakovic: Thanks for listening. I'm always sharing new insights and insight and knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on