host

Mirko Novakovic

guest

Ariel Assaraf

Episode 741 mins8/29/2024

Own Your Stack: How Flexible Observability Empowers Engineers and Shrinks Downtime

host

Mirko Novakovic

guest

Ariel Assaraf

Listen on

Apple Podcasts Spotify Youtube

About this Episode

Coralogix CEO Ariel Assaraf join’s Dash0’s Mirko Novakovic to break down his team’s engineer ownership model, how Coralogix has grown to serve as both a product and a flexible platform, and the future of OpenTelemetry in the observability industry.

Transcription

[00:00:00] Chapter 1: Introduction to Code RED and Ariel Assaraf

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I'm co-founder and CEO of Dash0. And welcome to Code RED code because we are talking about code and Red stands for requests, errors and duration the Core metrics of Observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Today my guest is Ariel Assaraf. Ariel is the co-founder and CEO of Coralogix. He has 15 years of experience in observability, and I'm excited to talk with him today about the future of our industry. Ariel, welcome to Code RED.

Ariel Assaraf: Thank you very much for having me. Mirko always good seeing you.

Mirko Novakovic: Yeah, absolutely. And by the way, before we start, I think we both met through a guy out of India called Saket, and he was actually your partner in India. And when I was at Instana, you actually recommended me to Saket. And we did really good business with him. So he's normally specialized, I think, taking israelian companies to India. But he also took us to India and it's an awesome guy and an awesome partnership and I and that's how we came together. And we also integrated Coralogix at the time.

Ariel Assaraf: Exactly. So Saket brought a ton of Israeli companies, but also other very good companies into India, and I think we ran together in quite a few customers. And then we met, if you remember, in San Francisco, that was probably 2017 or so, right at the beginning of Coralogix. And you guys were a little bit more scaled at that point.

Mirko Novakovic: Yeah, I know, I know. And then we integrated Coralogix at that time that just you were logging tool only and we were an APM tool. And we will talk about that how these categories merged together. And today we talk about observability. But I always start with one question and that is your Code RED moment. Do you have a special moment that led you to Coralogix?

[00:02:05] Chapter 2: A Code RED Moment in Observability

Ariel Assaraf: Well, I don't know if that moment led me to one of my biggest Code RED moments in my previous job. It was a company that does kind of civil intelligence. If you look at, you know police intelligence and in internal security, intelligence, homeland security, and we had a big client in Europe that had an issue, the slightest issue with how calls were recorded. And we wanted to understand that issue. So we, you know, obviously open up our logging infra, we up the debug, we start looking at the packets that are going through the UDP to understand why are there slight issues. And then suddenly we're starting to see a decrease in the sound and then increase of CPU and boom the recording server crashes. And so obviously at that point in time, talking to Homeland Security, there's no voice to the conversations that were recording was terrible. There's still called that data. So you can still see locations. You can see who's talking to who. And I call some people like guys. It's worse than what we thought. It's crashing and we start investigating it. And after 3 or 4 hours, we called the team back home. We're like, guys, you got to come here. They jump on a plane, they come there by midnight.

Ariel Assaraf: I'm sitting in front of my computer with the head of the police behind my back, trying to understand what's happening, because currently there's a kidnap. When they know that someone was kidnapped, they see the locations, but they can't hear the conversation. At 3 a.m., we're, like, exhausted. We go to sleep. We come back in the morning at like 9 a.m. and we're like, oh, he didn't crash throughout the night like six hours was quite. Maybe because there's no phone calls. Maybe it's a load thing. So we up the load, we start monitoring. We're looking at it. Crashes again, crashes again, crashes again. Long story short, because it's a long story until we figured it out and we started rebalancing from one node to another, it was the log data that we switched on on debug that printed into the log file every single packet as a log that killed the CPU. So basically a slight issue turned into us deep diving, debugging everything and started to crash. The server made it into a huge crisis. So that was a good lesson for me to understand, first of all, how we scope issues and what risks we take to fix minor stuff.

Mirko Novakovic: Yeah. That's interesting. Right? It's like Heisenberg. The deeper you look into it, the less you see because it crashes. Yeah. No, that's interesting story, especially having the chief of police behind you and kidnapping going on.

Ariel Assaraf: That's stressful. Exactly.

[00:05:01] Chapter 3: Evolution of Coralogix

Mirko Novakovic: That's stressful. Yeah. That's actually. So then you started a company that actually deals with these locks, right? It's a lock management analytics solution. I think at the beginning it was based on elastic. Right.

Ariel Assaraf: Yeah. It was a whole journey we went through. It was initially log analytics or initially anomaly detection on top of existing tools. And then we said, okay, we got to do the full log analytics journey. And we based that off Elk. We had a ton of issues there with, you know, mapping and performance and cost obviously. And with time, we started by rebuilding the query engine into what we call today querying from remote. So you know, querying the customer's data from his own storage, that was a big benefit that we brought. And then what do we do with the high concurrency queries? So we started, you know, alerts and stuff that can't run on that remote storage. So we started developing in stream services. You know, in stream parsing, enrichment, generation of metrics, but especially the stateful stuff in stream like alerting, anomaly detection, data clustering. And then we got to this model where we analyze everything in stream and query from remote. And then of course the conversions of everything. Okay. So now we got to do metrics. We got to do tracing. We got to do APM. We got to do databases realism monitoring. And we found ourselves with the full stack observability that we're running today.

Mirko Novakovic: So streaming means just for me to understand you keep the data inside of the customer, but you stream the data through your platform to to understand if there is an anomaly or a problem, etc. but you don't keep the data.

Ariel Assaraf: It goes through the Coralogix Kafka cluster. That only holds the state that it needs for long term analysis, because some alerts, you know, they require a longer time. So the customer sends the data, it's all analyzed in the stream, and then it's written back to his own storage in a, in a format that allows us to query it. This is also an open format today. We're basically building dynamic parquet with a lot of small index files, and then we run queries on it with our query engine.

Mirko Novakovic: That's nice. What kind of query language do you have? Is it a pipe one? What is your philosophy about querying?

Ariel Assaraf: We actually found ourselves supporting multiple different ones. So we started by mimicking Lucene. Right? Because we came from elastic, people knew that we didn't want to create too much of a breaking change. So we support Lucene. We support SQL, we support PromQL because a lot of people wanted Prometheus syntax. And then we created our own syntax called DataFrame, which is that SQL kind of a, you know, pipeline search. And that way we give our customers both schema on read and schema on write, both. An easy type of querying like Lucene is very intuitive, but also a lot more complex. You know, kind of the Splunk like query language with joins and lookup tables and extractions and so on.

Mirko Novakovic: Yeah, that's pretty cool. It's always a trade off, right? Of being easy or being more powerful and how to do the querying. We had a really long discussion at Instana, but also here at Dash0. What what do we want to do with querying? Right. Especially if you have different telemetry types. Right. How do you query logs compared to spans compared to metrics? Is that different languages or is it only one language? Do you want to combine them? That's a very interesting discussion. You see very different approach to that.

[00:08:25] Chapter 4: Balancing Platform Flexibility and User Requirements

Ariel Assaraf: Exactly that. Now add to that complexity, the customer's legacy. So you come into a customer. A lot of customers use Grafana today with Prometheus or whatever that is in the back. So they're used to PromQL and a lot of customers are used to the elastic syntax. A lot of customers want to keep that. They don't want to they don't want breaking changes. So on one hand we offer data on all data types. So, you know, we're now going into what we call explore V2, kind of the new version of the exploration screen, where a single query will run all data types, logs, metrics, tracing, security. On the other hand, we already know that a lot of customers are going to say, that's great. You know, we want to get there someday, but we need you guys to support PromQL, because I have a thousand dashboards and 5000 alerts, and my users are not going to learn a new syntax. And when I have a production issue, I want to have people, you know, being able to explore immediately. So, that's a big trade off of I'd say even beyond simple and advanced, but, you know, simple but limited and advanced but complicated.

Ariel Assaraf: I think a lot of the questions that we have on the product org is, is it more of a platform or is it more of a product, like, do I dictate how you're using it? Or do I give you a really wide set of capabilities and say, you know, build whatever you need. I think the two in our world, the two extreme examples to that would be kind of Grafana that says, you know, build whatever you want. This is a canvas and create your use case or Datadog that will very much dictate how you're using the product, how data is being displayed, how data is being collected. But then, on the other hand, if you want to do custom stuff, custom metrics, high cardinality stuff, then the cost and performance start to become an issue. So where's the balance between both? And should you balance or should you just go for one and go all the way? That's one of the biggest questions that we have here in the product org.

Mirko Novakovic: Well, that's a very interesting discussion. We had exactly the same discussion. It is also because you have different types of users, right. If you look at a core developer, not all but a lot of developers just like to fiddle around, right. They are okay with, hey, give me the API, give me the code, give me the lib and I will figure it out. But then you will have a different type of user that just wants to use the product in a very simple way. They don't want to learn a PromQL language and can't. They don't want to understand all the details. They want a lot of automation, predefined dashboard views and being opinionated. Right. So this balance between being highly opinionated, how you should use the tool and giving all that flexibility. Flexibility is really hard to build in a product, right?

[00:11:21] Chapter 5: Addressing Customer Experiences across Market Segments

Ariel Assaraf: It's very hard to strike right. And then even if you did find a path, it's very hard to understand who's the customer you're working with and who are you standing in front of when you're trying to sell and understand where you're trying to emphasize? So today we try to bucket it, but what do we replace? But then a lot of the times I'll get to a customer that has something that's very opinionated, and I try to replace it by showing, hey, you got all you know, you got you got all these presets and stuff that are very opinionated on my platform too. So it's going to be easy. But then he says no. The reason I'm switching is I need more flexibility or the other way around. A lot of customers will come and say, I can do everything. So, you know, some customers, I show them kind of a what is the customer journey in core logic? How do you go from telemetry to kind of business level insights, and how do you mature yourself and your organization together with the product? And many of them tell me, listen, this looks great. I just don't know how I'm going to get there. Like for me this looks as far. I remember one time I did this demo to this large corporate like super massive corporate to the CIO. And I'm running the full demo. And again, you know, you know, at Instana, what is full stack? it's like, how do I go from the service catalog to the calls to the DB, to the infrastructure? And then I go to the logs and I define an alert and it'll trigger to my pager duty.

Ariel Assaraf: And there's an automation and I show him everything. And he's like this is great. Now let me tell you what you did. And he told me, we're like a village. Somewhere on a desert island. And you just landed a spaceship in the middle of my island. And I have no idea what to do with it. It looks great, but. But I'm gonna need something simpler. And then, on the other hand, you go to people and they ask you questions about flexibility that you've never imagined. Like, how do I create an automatic anomaly alert on the user journey and my system throughout the different components multi-cloud. And you're like, well, you know, that that requires a great amount of flexibility. And, you know, the struggle is where do we invest in order to find that goal and path? And then how do we sell? And then how do we even market ourselves? Like, what do I put on a website? So if you go to the site today, you'll see your data, you know, any scale, any use case, something like that, sort of do whatever you want, but then you go to, you go to the, the actual customer and sometimes they'll say, I looked at your website and didn't understand anything. So where's the database monitoring? Right. And I'm assuming you ran into the same issue.

Mirko Novakovic: It's the same. And I have to say, I mean we at Dash0 are now a very small startup. We are just beginning. We haven't released a product, which the advantage is you can be very opinionated, right? About your ICP. You can still choose because you're not that big in terms of ARR, and that means you can still pick the customers pretty well. I think I read it somewhere, or you said it somewhere that you are getting to $100 million in ARR approximately. So somewhere around that. So you are a pretty decent size, one of the bigger observability platforms now. Congrats by the way. That's awesome. And I, I really know how hard it is to get there. So that's not easy, but that also means you have to deal with very different types of customers, right? Large enterprises, smaller startups with very different requirements. So you actually have to build a platform which is not black and white. It should be both. Right. Both worlds and somehow balance that. That is kind of a different challenge that we have at the moment, where you can still be very opinionated and very focused, and then over time, everyone will grow into that problem. Of course, we had that at Instana for sure.

[00:15:17] Chapter 6: Convergence of Observability and Security

Ariel Assaraf: Yeah. And now, you know, we talked about how does it get beyond observability even. Right. Like the use cases that you need to cater start to get cross-department within the organization. I assume you know you guys are familiar with that.

Mirko Novakovic: Absolutely. And I mean, we began by saying we met in 2017. At that time we called ourselves APM for microservice. And you were a logging solution, right? So at that, that was actually the time where things were merging together, right? That's why we built integration. Because as a startup, you couldn't build everything at once. So you kind of saw, okay, customers are asking me for logs, metrics, traces, APM, user monitoring. So I have to partner with others to build that platform that, like the Splunk and Datadog of the world, already had. And so that's how we started. But then I think that was a trend for probably 2017 beginning of 20, 2021, when all the vendors basically became the three pillars, right? At the time, logs, metrics and traces, everyone had them right. And I think you also did that right. You implemented tracing and metrics and everything else. So, a full observability platform.

Ariel Assaraf: And then into security. Right. So a lot of a lot of companies, if you look at our space today actually almost without us noticing, Datadog calls itself monitoring and security. Splunk actually primarily does revenue from security, but a lot also from observability. And sumo does a ton of security. Dynatrace got into security. Elastic is massively growing in security. Suddenly those observability companies have tens of percents of their revenue coming from security. We looked at it in Coralogix and we said, well, logs, you know, at that point in time being our only pillar, the easiest path to expand into a security. And actually SIM and SIM use cases were the first ones that we sold outside of of log analytics. So even before APM metrics, you know, database monitoring, we started doing SIM and then what happened in the past year. Even more so is that a lot of CISOs are saying I need to have the operational data like that, that concept that we say every log is a security log. Okay. And so for them, whether it's application edge, you know, CloudFront, WAF and so on, they got to have everything. And so it is necessary for them that the observability vendor will also be their SEM vendor, which makes this battle even more complicated. And also explains, by the way, in my view, why Splunk, even under the Cisco umbrella, continues to invest in observability because they have to have that portion of the business to sell the security side to it.

Mirko Novakovic: Yeah. That's interesting. I actually I mean, I was interesting in the last KubeCon, I visited the booth. Booth. Right. The company that had this offer for 23 billion Israelian company, pretty young, I think, founded in 2020 it's four years old, grew massively. I read 600 million in four years. And so I was curious, right. Because I'm not a security expert and I wanted to see the product and one thing that struck me is that one of the core principles of the product? As far as I understood, is that they built this graph of mapping the different components and understanding the dependency, which is a core observability thing. Right. And then what they basically do is they say if they have a threat somewhere or a problem in the database, they can actually check who is affected, right? Because now they have the dependency tree. They can say, oh, this database has a problem, but it's actually not connected to the internet. So it's not that problematic as a different thing. Right. But as you said, I was directly thinking, wow, we had this dynamic graph thing at Instana. It was actually very much the same. And all the observability vendors do that, right. You use tracing logs and try to map out the infrastructure in a tree or a directed graph to understand dependencies, to understand root cause and impact. And these things. So it's. I totally agree. I directly thought, oh, that's very connected, right?

[00:19:39] Chapter 7: Ownership and Service-Driven Observability Models

Ariel Assaraf: Because at the end of the day. And on the world of cloud, a lot of the security stuff actually fell on the engineers and the DevOps teams. They're the ones that are the ones that are running the cloud. You can have the best CISO and threat hunting team in the world. If a DevOps engineer leaves a bucket open to the internet and you don't have, you know the right way to not just see it. By the way, let's say you have Wiz and you saw that with orca, and you saw that with one of your security engineers, the person who's going to have to go to that bucket and close it or delete it or change permissions is again that DevOps engineer. It's not going to be your team. And so with security teams understanding that they can't say no to everything anymore with the cloud, you know, the initial was just no, you cannot send data anywhere. You cannot operate anything like that. Now understand they cannot slow the organizations down. So they have to cooperate with the engineering team. And then the tools need to start speaking that same language. It's going to be very interesting to see what's going to happen to the more traditional security companies if they don't go that way. They may find that within ten years their buyers have retired and they're going to have to face new buyers. Those VPITs and those CISOs, they've retired, they made, you know, a good career. But now decision making power has moved towards the platform that we see in a lot of smaller organizations. Cisos are there but they don't have a big organization. They dictate policies. They define the tools that need to be implemented. But then it all comes down as R&D tasks. And so that conversions I think is inevitable.

Mirko Novakovic: Yeah, absolutely. And I think you mentioned two things that I found interesting. One is you said moving to platform. Let's talk about that a little bit later. But first you were talking about ownership, right. And I read this blog post that you have written about ownership models and how basically things are shifting left, right. And the way you described it in the blog is okay. The developers every day get a new dashboard. Right. First they say, hey, here's your red metrics, right? Watch it and see that everything is running well. And then they get the next dashboard for infrastructure. They get the next for security. So more and more responsibilities and kind of disconnected right. That's how I understood it. So you suggested a service driven view on the model, which I totally like. Right. Because I can totally see. Can you explain it a little bit, what you think about a service. And you have, in my point of view, I'm a UX guy. You have the nicest dashboards in the observability scene. I would say they really look nice and with the little maps on it, and I just like them.

Ariel Assaraf: First of all, thank you. I think our CTO, Yoni, as you know, kind of formed this concept around ownership, which I think is how we operate internally. The concept is there's hundreds of microservices and then those microservices form products and those products have features. So here in Coralogix, the way we map our internal organization, the engineering organization is broken down to CX stands for Coralogix CX products and then CX features and then CX services. And then there are teams that are mapped to services that are mapped to features that are mapped to products. Because there are multi products multi features multi services. And then I as an engineer understand what services do I own and what features they impact and what product they relate to. When I open Coralogix, all I need to do is go to the service catalog and in one place see everything that's related to my service health. The first thing that I'd look at would be my SLOs. So SLOs will be driven by the SLAs that the organization at the business level decide whether internal or external. And then I define SLOs for my services together with my manager. The first thing is error, budget, latency and so on. I make sure that I'm in the green, and then I want to see what incidents are open for my service.

Ariel Assaraf: So today in chronological we relate incidents to a service. So I can see if there's an open acknowledge closed ongoing incident for my service. Now I get the red metrics etc. now I can drill in, understand, you know, upstream downstream effects. But I can also ask everyone we're talking about full stack observability. Single pane of glass. We used to actually laugh at it and call it a single glass of pain, where, you know, you take someone and say, hey, you got a ton of logs there. Here are some more metrics at APM and Real's monitoring. You know, single pane of glass. But then the idea around single pane of glass is to have everything that I need in the context that I need in front of me. So now I see that I'm breaching an slow. I get into the service, I see my red metrics. Obviously one of them is off because of that. That specific slow or not. Then I see the operations. I see how I affect upstream. Downstream. I click a button, I get the logs from my service, I click a button, I get the host metrics so I can send infrastructure for my service. One thing that we're going to add cost.

Ariel Assaraf: How much is my service costing? Now I can add dimensions on my own. Those dimensions could be per customer, per region, per account. And then I as a person gain ownership. I cannot gain ownership. If you don't show me what I own, I gain ownership and then the manager can come to me or you know the team leader and say, hey, you're responsible for those two services. They're, you know, 50% of our incidents they're breaching. They're slow, and they cost us more than any other service in production. And I can say, okay, I see all that data and I can own that part. And exactly like, you know, you're saying if you have just another dashboard and another dashboard, Those don't add up to ownership. They add up to kind of issues where you're chasing them with your hammer. But you don't have a holistic view. But then one step upwards, the product level can look at all the services that build the product. So kind of a higher level. And now what is the health of my product. Am I breaching any slow for them. Okay. Which one. And now I know what people I'm talking to because it's mapped to those what we call teams.

[00:26:06] Chapter 8: Customer Success as a Key Differentiator

Mirko Novakovic: Yeah, it makes total sense. I love the concept, to be honest. I can tell you how we think about it. We change it a little bit at Dash0. But the first question I have is how do you define a service and how do we make sure that the data gets that context information?

Ariel Assaraf: Few things that we found one, you know, there's the service to discover the services with the tracing part that's that's relatively easy. But now how do we really define and this is work that we do with the customer, by the way, a lot of customers expect kind of everything out of the box experience. But I'm telling you, you guys have to define what are the products, what are the features, what are the services, and what people are responsible for what service. There's no way for me to do that. And I want to talk about, oh, but you got to have it out of the box. I always give this example that customers, you know, they laugh at, but then they kind of reflect and say, you know what? It makes sense. I tell them this kind of ready made dashboards and ready made alerts are nice, but it's like you go to buy a picture frame at the store and you buy that frame and it has a picture of another family that's not your family. This is a great idea of what you can do with that picture, but you'd be dumb to just take it as is and put it on your counter in your home and say, oh, look at this beautiful picture, I got it. It's out of the box with a picture in it. This picture frame, it's not out of the box. It's something that measures KPIs that are not your business. That kind of gives you an idea of what you can do and helps you, you know, imagine where you can get.

Ariel Assaraf: But at the end of the day, it's on the customer to do that mapping. I can help them with the discovery of the services and the service catalog and the discovery databases and DB catalog, so that you guys have done an Instana years ago. Now it is about mapping them to the responsibilities. But then when it comes to correlations, we found out that it's going to be very difficult to create the correlations only at the agent level. Some customers collect data that is not coming from their application. So I want to create a correlation between the CDN data or a managed service. And I get the logs there or I get an API response from someone else, but I still want to correlate it with the service right from a payment service. And I get a API response from a payment payment gateway. I want to see this in my service that I own. And so we created correlations that happen in the user interface And you can correlate like there are defaults that will use the trace ID you know, span ID and then hotel will just bring you everything. But we also allow the user to click a button, say create a correlation and say this log key maps to this metric, key maps to this span key. And then on query we just do the correlation. So they can define what ownership is for them. And that's been greatly successful.

Mirko Novakovic: It is very interesting because I had this concept in Instana called application perspectives, which is essentially the same. You basically mapped how you want to have the signals, the traces, the logs, etc. correlated together. Right? It's basically defining a few queries which then defined on the fly. Right. How your component looks like. Right.

Ariel Assaraf: Yeah. Well even a fallback if it doesn't match this check this if it doesn't match this.

Mirko Novakovic: Exactly. Yeah. And but what we figured out at the time was and we had to do a lot of work is that it is kind of hard to understand for people. The magic that you can do on putting something like, I would say, a dynamic view on your data that will dynamically because it's not physically there. Right, essentially. And we figured out that some people really get it right away, but it needs some explanation. Explanation. It needs some help to understand that concept, to understand how powerful it is. Right. Because you can also use very custom data. We had the e-commerce shop, they were selling stuff for the NBA and MLB, and they could basically use that attribute and split the application into different applications so that they have all the data differently. Because the seasonality is different. You will mostly sell stuff in the finals or more stuff. And so to see that they could just create this dynamic views on it and say, oh, now I have an NBA store and I can monitor that and learn from that and learn what's normal behavior. And it was powerful. But I always had the feeling that we never really managed to explain how powerful it actually is.

Ariel Assaraf: It's extremely strong. And I'll tell you something. In general, we've learned that in observability, one of the greatest things that your customer can get from you as a vendor is your aggregate experience with other customers. And we created a customer success organization that is fanatic about response time under one minute, human response time over chat, and then dedicated TAMs to any account. That's over $25,000. And so they have joined slack channels. There are QBRs there are customer events we found. And again you know, it comes back to the same thing opinionated versus platform. Okay. And then do we create everything out of the box or do we customize things to your business? Now you look at Instana. Instana was extremely flexible and had a lot of power. So at the enterprise, it was successful. But then you understand that you have to sit with the customer and explain, here's how you do this, here's how you do that, and this is what's going to fit your organization. The easiest path to go to. And we started doing that by default, by the way, sometimes at least for the small account, is to just go by service name and correlate, go by trade, try trace ID span ID if you don't have them go by service name. But then for the bigger accounts you really want to sit with them. Understand like you said about those dimension, what is this correlation for you and set it up. But in general throughout the organization concentrating and meeting the customer every other week and actually trying to uncover use cases and come and add value is something that we found to increase engagement. But the customers, a lot of the times will say, this is the greatest feature that I'm getting is the fact that I have someone teaching me how to do observability.

Mirko Novakovic: It's a very strong point I think I can totally understand. We also had dedicated TAMs and the Customer Success team at Instana we don't have at Dash0 yet, but what I also found you could see also the quality of those people, right? If you had really good ones, because you need people who really understand technology. Right. And observability because it's a complex topic these days. You have Kubernetes clusters. Yeah. It's not easy anymore. It's not like 20 years ago when you had one application server and two different databases. Right? It's complicated. And so you need really good talented customer success people to, to make that actually happen. What you said it's almost like a consulting team, right. Highly great consultants who can help.

Ariel Assaraf: All our DevOps engineers with years of experience. That's the only thing that we found to help. We tried to take normal customer success engineers and teach them observability. But then it was, hey, what about cloud? Infra? What about Kubernetes? What about serverless? You never end.

Mirko Novakovic: No, exactly. That's what we solve. We actually also had people out from customers who were using the product. They joined us and they had this experience from the customer side. And they were DevOps people or SREs, and they used the product and they could really teach the customers the best practices and these things. Yeah. That's a really good secret point, right. How do you make complex tool really successful is by having a very dedicated, highly trained and and and performant customer success organization and tools. Yeah.

Ariel Assaraf: Exactly.

[00:34:03] Chapter 9: The Role and Future of OpenTelemetry

Mirko Novakovic: And then on OpenTelemetry, you named it, right? How are you thinking about OpenTelemetry? How has it impacted your tools? Is it just one different data source for you, or do you think it's something more? Are people using it more and more, or how do you see it and where do you see it in five years?

Ariel Assaraf: I'll tell you what I think, and then, I'll move the question back to you on that. I'm interested to learn. I think we started at it like just another integration, right? Because we had like 200 different integrations and the Logstash and Prometheus and stats, and we said, okay, there's another one. But then we learned that OpenTelemetry as it develops as a community a few years back, we understood that this is actually going to close the biggest gap that we have as kind of latecomers to the game when it comes to APM especially, but in general to observability versus the Giants versus Splunk, the Datadog, everyone. Why it's one of the biggest advantages these guys have is the ability to deploy an agent, manage that agent, instrument themselves, send data, integrate to a thousand different technologies, and customers are just like, you know, they see the catalog, they add one, two, three, four. Boom. Data is flowing. It's correlated. Everything's good. Suddenly OpenTelemetry starting to close that gap because as a community it develops. Now. You know, they they started with tracing, then metrics, then logs. Now they're looking at profiling suddenly, which is a huge load of development. If we wanted to go into profiling ourselves, suddenly OpenTelemetry does that for us. But then we found some gaps in OpenTelemetry. How do you map out full flows, not just one trace, right? Like, because sometimes there'll be a trace and then a call to a database, and then it goes back to Kafka and then another.

Ariel Assaraf: How do you map out trace of traces? So we needed to make a slight change. We wanted to go with kind of vanilla, but then we needed to make a slight change. And then for DB monitoring we needed another slight change. And then we a lot of customers wanted fleet management that we're adding now. Like, I want to control all of my configurations remotely. I want to change configurations. I want to, you know, understand the throughput and errors that are happening. So we started adding that. So, you know, gradually we found ourselves with kind of an agent that's based on OpenTelemetry. You can work with the standard vanilla OpenTelemetry. But if you really want to get the power of Coralogix, you can use that. But as we've invested in it, we're really making it our primary tool. And we see a lot of customers actually coming to us where it's not like they're saying we want to move to another vendor. Oh, we need to use the OpenTelemetry. It's actually the other way around. They want to move to OpenTelemetry as a, as a, as a mission that the R&D has. They want to decouple and not have that vendor lock they used to have and not have, you know, stranger agents running in their production, and then they're testing out vendors that have a contribution to OpenTelemetry to see which one they're going to move into.

Mirko Novakovic: Yeah, I think that's one of the biggest things. Right. Put power to the customer essentially. Right. Which some vendors will love like you. Right. Because and some others, like the big ones who, who want you to, to just use their technology, they will not like too much, I guess, but I think it's done right. I think the world will change towards OpenTelemetry on the, on the agent side. And I also agree with you that probably you will always need some sort of a distribution. It's not out of the box thing that you just use. It's more a standard, right? And a few components that you can configure. And maybe you add things that you need. I like that approach, and I think OpenTelemetry is strong enough to to deal with that and to support this what I really like about it. And that's coming back to what we discussed with ownership and service views, is the concept that they have with regards to the semantic convention. So basically standardizing the metadata and, and kind of things like service. Right. And they call it resources where service is one perspective on it.

Mirko Novakovic: But if you're more like a platform guy and you're operating your Kubernetes cluster, then you have a different perspective, right? Then you maybe don't care too much about each application service. You are more interested in how is my cluster operating? Do I have an issue in AWS zone in the US? What are my costs? How is it utilized? Right? And then you have a different perspective. You want to have a Kubernetes focused perspective on it or a cloud focused perspective on it, or a very different like a versal view on your infrastructure or Lambda version of a view. So I think it's not only your service. A service is really cool for product, for developers, programming and having responsibility. But I think there are different views with the same idea, right? That you are responsible for something that you own, right. And I think that's what I like about the resource concept, that it's kind of flexible, and it gives you the power to aggregate and give context in a way that the user actually wants it.

[00:39:18] Chapter 10: Final Thoughts and Closing Remarks

Ariel Assaraf: That's brilliant, by the way. Of course. Like building the way that it's built, that allows that structure, first of all, to standardize the entire world. And then secondly, to enable you to use that as a building block to create the view that you want. So, for instance exactly what you said about that infra engineer, we use OpenTelemetry to build a service catalog, but we also use it now to build the infrastructure catalog. Because maybe I, as an engineer, want to have a catalog of all the resources broken down by, you know, region, account, cluster, VPC, whatever that is machine. And just like the service you my ownership now is this VPC in this region. And I want to see how much this VPC is costing how many incidents I have. How many. And then I own that infra. So that's a great example what you just said.

Mirko Novakovic: Yeah, absolutely. Because we also see the shift. You mentioned it that you now have platform teams who are providing a platform. And then you have application or service teams on top of it that are more building the logical application infrastructure on top of it. Right. That's a big shift. Ariel. It was really flying at the time was flying. I think we could talk forever. It was really interesting talking to someone who has built a great observability company. Super fun. Congrats to everything you have built. I think you can be super proud of the company and your team. And I hope that we see each other in person pretty soon.

Ariel Assaraf: And I can say just the same to you, man. For me, when you texted me about the podcast, I said an opportunity for an hour with Mirko. I'm up to it. So thank you very much for having me. And you know, good luck with Dash0.

Mirko Novakovic: Thanks, Ariel. Thanks for listening. I'm always sharing new insights and insider knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on