Dash0 Raises $110M Series B at $1B Valuation

Episode 3740 mins2/5/2026

#37 - Prevention Over Alerts: How OtterMon AI Reimagines Observability with Checo

host
Mirko Novakovic
Mirko Novakovic
guest
Checo
Checo
#37 - Prevention Over Alerts: How OtterMon AI Reimagines Observability with Checo

About this Episode

Checo, CEO and Founder of OtterMon AI, joins Dash0’s Mirko Novakovic to argue that modern observability is rife with noise, reactivity and human bottlenecks. Drawing on years of frontline SRE and product experience, Checo explains why most telemetry is wasted, how signal distillation and “fingerprinting” can surface real risk earlier, and why observability must shift from dashboards and alerts to prescriptive, prevention-first intelligence.

Transcription

[00:00:00] Chapter 1: Introduction to Code RED and Guest Background

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I am co-founder and CEO of Dash0 and welcome to Code RED code because we are talking about code and Red stands for requests, errors and Duration the Core Metrics of Observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Today my guest is Checo. Checo is the CEO and co-founder of OtterMon AI based in Boston. He previously spent more than a decade managing observability at an enterprise hosting company, and then joined the New Product innovations team in 2022. He founded OtterMon in 2025 to tackle the three biggest problems observability, cost, complexity and reactivity. He also has an award winning fantasy author. That's interesting. So excited to have you on board, Checo. Welcome to Code RED.

Checo: Hey, so great to be here. Mirko. You're a legend in the field, so it's great to chat with you.

[00:01:04] Chapter 2: A Major Outage and War-Room Triage

Mirko Novakovic: Thank you. So I always start with the first question, which is what was your biggest Code RED moment in your career?

Checo: Oh, goodness. So we had a situation at one of the companies I worked at where a vendor released a live patch to our platform, and it took down almost every site that we had. And it created this chaos that I'd never seen in my entire career where, you know, obviously, our biggest customers were calling. Our executives were in the war room asking for updates by the minute. And it was one of those situations where the entire company grinds to a halt to try and watch this incident and figure out when it's going to be solved. When does business resumed? I care about this customer. I care about this customer. And meanwhile we're just trying to figure out what caused it, how do we fix it, and how do we minimize the collateral damage? It was a wild experience to live through.

Mirko Novakovic: Do you remember what it was?

Checo: So essentially what happened was a patch had been released as part of the service offering. A patch had been released live to all customers. But the patch had, I believe the situation was the patch had not been properly tested against certain scenarios. So when it went live across the entire fleet, almost every customer went down at once. And so naturally in that scenario, you go, what did we change? What did we do? We look at monitoring. We see a bunch of Red, we're checking our code deploys, we're asking who did something that they shouldn't have done. And then what ended up happening was we realized that a vendor patch had rolled out without any testing, without anything. And then it was a question of, hey, can we get the vendor to roll out a fix as quickly as possible? Or B is there a change we can make? And then it became a bit of triage, right where you're simultaneously trying to leverage executive executive support to get a change deployed as quickly as possible from the vendor, while at the same time inventing on the fly a patch that can restore operations if not fleet wide than customer by customer. It took us, I want to say, three days to fix across the entire fleet.

[00:03:06] Chapter 3: The Human Side of Incidents and Fear of the Unknown

Mirko Novakovic: That's painful. Right?

Checo: Oh yeah.

Mirko Novakovic: Sounds like the CrowdStrike problem, right? When they released a patch and all the windows machines went down, right. That's a horrible experience. Probably for the team. Right.

Checo: Well, anytime you have a situation like that, it's the fear of the unknown, right? Panic sets in, and the people on the front lines are asking, how do I fix this? The people just behind them are saying, did I cause this? And the executives are saying, oh, my God, is this the end? Is this the thing that's going to cause our biggest customers to churn and tank the company? So in those situations, they have so much panic in them and so much fear and uncertainty. And then when you add the pressure of, you know, these brilliant folks on the front lines, the SREs, the DevOps folks, the frontline managers who are just trying to fix the problem. And you've got support, you know, begging them, please get them back up. Sales teams begging, executives begging. They want every minute updates. You know, it's a situation I think all of us have to live through once to appreciate it. But it's unfortunate that that's the world we live in now that that type of panic can happen just from one change.

[00:04:16] Chapter 4: Joining New Relic’s Innovation Team

Mirko Novakovic: Yeah. Absolutely. Absolutely. Yeah. And then you join New Relic, which we all know is one of the, I would say veteran tools. Right? I mean, Lew Cirne first built Wiley, Interscope sold it to CA, then he built New Relic first for basically monitoring Ruby on Rails apps. Right. And today, it's one of the largest observability players. And you joined the innovation team. So how did that happen? And what would you do if you could share?

Checo: Yeah. No, I'll share what I can. You know, I mean, New Relic is like you said, it's legendary, right? Lou Cerny built an incredible product line with New Relic, and it evolved over time to include more than just APM, right? Involved all the different aspects of monitoring. As a part of my role at Acquia, I had been working with New Relic for a decade. So learning how to use the tool. You know, I was a huge advocate for giving direct access to New Relic to our customers because there's so much value in that type of monitoring. So, you know, when I had finished my last big project at Acquia, the question became, what's next? What's something new and exciting I could do? And New Relic was amazing. You know, it was a team that I trusted and you know, had always admired over the years for what they had accomplished. They reached out to me and said, hey, we've got a team that does innovations on the side. You would be tasked with finding opportunities to improve the product, create new features, figure out how we make this even better than it is. And it was just an opportunity I couldn't refuse. You know, I'm the type of person I grew up in, technical support. I've been on the front lines handling the types of incidents that ruin people's days or careers. And it was a huge motivator for me to go work on a project that could potentially improve quality of life for millions of developers around the world. And it was a huge learning experience to actually see how things work under the hood. It was so fascinating. You know.

[00:06:08] Chapter 5: Founding OtterMon AI to Shift from Reaction to Prevention

Mirko Novakovic: No, absolutely. You can see that we also do it, by the way, we hire product managers from customers who operate large observability platforms because they have the experience and they know what's needed in those environments. So it makes total sense, right?

Checo: Yeah. I really feel like there's there's a need in observability to balance the experts who understand observability as a science and the practitioners who have lived and breathed the problems, because I feel like that synergy is what creates a solid product line that that developers want to use and see value in versus something that just it's there to, to be used as a tool. Right.

Mirko Novakovic: No. Exactly, exactly. Yeah. And that led you to essentially found your own company, right? OtterMon AI? Yes. Tell me a little bit about it, because, I mean, I looked at the website. I think you are still able to sign up for a waitlist. So you are in the launching phase. So tell me a little bit about what? Why did you start it? What is it? And then we can deep dive a bit.

Checo: Yeah. So the origin of OtterMon is actually an interesting story. So I've always been a strange person. Like, I've always been able to see patterns and data that other people could not see. And this, you know, was huge function of my roles at Acquia is I was always looking at the types of tickets customers were filing and seeing patterns and what was causing their issues. And, you know, it became a relentless focus of mine, not just to understand the underlying root causes, but to use those patterns to design the ideal solution that would prevent this from happening again. Everything I do in my career is about how do I improve quality of life for people. So, you know, I spent a decade working on doing that as product features and documentation and stuff, and that worked to an extent. And then, you know, we gave customers access to free monitoring. That worked to an extent. But what really moved the needle for us at that time was building intelligence on top of the monitoring, on top of the platform and saying, we are looking for very specific signals. We interpret those signals in very specific ways. And this allows us to be prescriptive in how we recommend you solve these problems, like do this, do this in advance before there's fire, before there's smoke. Do these things. And it was game changing for us as a business. Every time we built tools like that, that intelligence on top, we saw orders of magnitude improvement in customer quality of life.

Checo: So, you know, that was something that I wanted to do at New Relic. It was something that I think was hard for people to wrap their heads around. In any observability organization, you know, I try to talk to other PMS across the big observability providers, and everyone had a hard time wrapping my head around what is prevention, because all we know in traditional observability for prevention is I see an alarming trend. So I'm going to react to that trend. But real prevention as we experienced it at multiple companies, my co-founders and I was knowing what signals to look for. And then saying this is indicative, indicative of a flaw in the system that, if not addressed, will be exploited, whether that's security, performance, uptime. So OtterMon was built to solve that problem too, to know what signals to look for, and then to interpret them in a way that the user doesn't look at dashboards and graphs and charts, they're just looking at these are your action items. We've identified these flaws in your system, you should do X to fix them. And what we're hearing from customers, what we're hearing from the research we do is this is the experience developers are hoping for. They want to escape from the dashboards and alerts, and they want to move towards just tell me where to go. Tell me what to fix. So that's what made us want to start this company. It's just different.

[00:10:02] Chapter 6: Redefining “Signal” and Business-Centric Health

Mirko Novakovic: What is signal for you?

Checo: You know it's funny in the broad observability context signal is something that I as a human can recognize and I know what to do with. And that's a very flawed model right. Because it's subjective. Me as Checo I'm going to see different signals than you are from your experience. And therein lies the flaw, right? We could look at elevated 500. Right. High error rates and say this is an obvious problem. A really deep SRE might look at a single page on a website and say, this one's going to be really popular in a week when it's Black Friday or on Christmas, or on this major event day. And I'm seeing a slow query. That's fine right now, but when we have 100 x traffic, it's going to take the site down. So for me, signals are the necessary measures that can be interpreted in the context of whether or not the system will be healthy under key business operations. Right. And by that, I mean, what are the things that indicate that this business will stay online when it needs to be online? And that's a very hard thing for any one person to prescribe. It's why you often need a team or a group of SREs, or a group of subject matter experts to tweak and tune dashboards and alerts until they're showing the right signals. But I think if we step back and we look at how do we know the business is healthy, we look beyond storage, I o memory CPU error rates, and we start looking at deeper signals that we can use to say, this business is going to stay online when we need it to stay online, or we have flaws that are going to be exploited under high traffic or attack. And those are the measures that matter most.

[00:11:55] Chapter 7: Limits of RED-Only Anomaly Detection and Reactivity

Mirko Novakovic: It makes sense when I'm asking because at my former company, Instana, we had the idea of basically those signals we calculated on the Red metrics of services. So we essentially said there are services in your system, and they have red metrics, request errors and duration. And so in your terms having Black Friday would be a request signal. Right. Because the number of requests would go up. And then we would expect that the duration goes up to right. And the 500 would be a signal on the E on the errors, which would say, okay, I have a high 500 rate, which is not normal. So we basically applied machine learning on top of the three Red metrics, learned what the normal behavior is and then created a signal, right? We called it an incident whenever something happened that would be not normal, right? Yeah. So that's what we did at Instana. And it kind of worked, but it also didn't really work. Right? Yeah. And the reason was that a lot of these changes come without any upfront warning, right? Oh, you are putting a Super Bowl ad and nobody knows about it, and that ads create ten x the traffic, and then it spikes immediately, and then it's too late. Right? So it's not slowly getting there where you could say, hey, I see a trend. Normally it's like, boom, something happens. Or like you said, I deploy a patch. Now I have errors, right? So, the kind of issue was that it was not really proactive. It was mostly reactive on something already changed and happening right. How do you see that reactive mode come into play here?

[00:13:42] Chapter 8: From Pull to Push—Prescriptive, Risk-Tolerant Observability

Checo: So from my experience working across over a thousand companies, right as customers and the like, you know, we saw is incidents break down to two things. Either someone changed something and it caused an issue, or there is an inherent flaw in the security performance of a website or an architecture that can be exploited under certain conditions. The first is fairly self-corrective, right? Like if the team keeps making changes that break stuff, they're going to be incentivized to do better. Ci, CD. But on the latter side, the detection of flaws. What I found is that development teams are absolutely brilliant people. Right. Some of the smartest you'll ever meet. But humans are flawed, and they can't know all best practices. They can't know all tribal knowledge about how the platform was built. Despite their best efforts, and they can't understand in a reasonable fashion or with reasonable tooling, how changes they make aggregate over time to create bigger vulnerabilities and gaps in their system. So what I found is that if you are able to detect how a site was where these flaws are right, if you're able to know which signals to look for to indicate like caching is not working well, or there are slow queries in very specific pages that have some seasonality to them, there's a little bit of do you know what best practices you might have overlooked in your in your rush to get this feature out to make these changes right? None of these are necessarily the engineer's fault, right? They're priorities are dictated by product managers and executives and everything.

Checo: So it's not them not knowing better. It's not them deliberately making these choices. It's them and their PMS and their managers, not necessarily having the information to prioritize the right work that protects the site. So again, we say there are specific signals that indicate you are not caching correctly. And that's going to be different if you're using varnish versus a CDN versus, you know, memcache etc., and knowing what those signals are. It's also a matter of doing push instead of pull, right? Right now observability is very pull oriented. Once there is a problem, we're going to pull you to do this to the platform and you're going to diagnose it yourself. What if observability was constantly looking for ways that your site could fail, and is pushing them to you in a prioritized fashion, saying, I've detected that there's this there's this slow trace, right, that you're running, there's the slow code, and traffic is ramping up on it. And based on seasonality, next Friday, when you get your recurring end of week spike, this is going to be problematic. It's saying that you don't have a CD in. It's saying that maybe you haven't configured varnish correctly or nginx based on best practices, right? And now what you're doing is you're looking less at the pattern of a signal.

Checo: You're looking at the presence of a very specific signal or the lack of a very specific signal. And then what you're able to do is push and say, I think that you are vulnerable in these ways. Have you run a load test? Do you have a major event coming up that could exploit this? There's also a function that we've been looking at around risk tolerance. Right. So if we use security as an example, this is one that I really love. If you use security as an example, everyone has a tolerance of what they'll allow for their security. Right? There are certain things that are just too hard to solve. And you won't be able to close that one. You won't be able to close it. But the odds of it being exploited are 1 in 10,000, right? Every company has certain risk tolerance when it comes to security. If they didn't, every customer would be on FedRAMP cloud. So I think that there's an opportunity to using that push model, that observability is recommending things to you to say. What is your risk tolerance around downtime? What is your risk tolerance around security vulnerabilities? And then if people subscribe to I at least want to hit this level, it allows the system to operate more intelligently about what it's recommending you change and how you recommend implementing those changes.

[00:17:50] Chapter 9: Unifying Data Silos and Holistic Analysis

Mirko Novakovic: Yeah, it makes sense, right? Some user inputs on what kind of level of performance you would allow or not. Right. Yeah.

Checo: Exactly.

Mirko Novakovic: And if I look at the website, OtterMon essentially is a software AI that you connect to all your systems, right. You connect it to your observability system, to your security systems, ticketing, GitHub, CICD and then feed all the data in. Or is it a pull or pulls data or? What is the approach? So you get the right data to do an analysis and then you pull more and analyze it. Or what? What is the concept?

Checo: So, you know, there's two things happening there. One thing I found when I was working with all those companies is most teams have never achieved the dream of single pane of glass, right? It's the big illusion. Delusion of observability is that if you just standardize on one vendor, just use Datadog. Just use New Relic, just use whoever. Right. Then you'll be able to see everything. But what we've learned in 25 years is that that almost never happens. It very rarely happens because there's always some other monitoring tool. There's your bespoke solution, there's your security tool, right? Yes, we're using Datadog, but we also use Wiz for this and Splunk for that and CloudWatch over here and a little bit of Prometheus right. So what you end up with is you end up with no holistic picture of everything you're monitoring. You don't get the access to, let's say you're using Datadog and Wiz and Datadog. They're going to have their own conclusions. Wiz is going to have their own conclusions. So on the one hand, you have this issue of all this data. Data silos are preventing you from understanding what is my real health. But then you also have this lack of interpretation and this lack of aggregation, right? If I look at my Datadog data independently, it may paint one picture of my system health and security. If I look at my wiz data, it's going to be a different picture. My health and security, if I pull them together, a signal over here combined with a signal over here might indicate something much worse, right? So we are lacking the ability to pull in those insights between vendors or to pull in the data and measure them holistically.

Checo: So at OtterMon, we see that as an opportunity where if we pull that data in both the conclusions and the raw telemetry, that allows us to start doing some more holistic analysis and stack ranking of the issues we're finding. So we might be able to self detect something that's a clear indicator of security vulnerabilities from your Datadog data. But then Wiz is going to have a completely different interpretation of where your security vulnerabilities are. We create that single pane that should have existed. And then there's a second layer on top of that, which is we are seeing consistently that even the best DevOps teams don't know what data they need and what data they should get rid of. There are really powerful companies out there, like one of my favorites, where they're consistently seeing that they can reduce customer costs by 90 to 99% just by getting rid of junk logs or condensing them. Right. So what we see is an additional opportunity to pull in this data, distill it down by 99%, because we are specifically looking for critical signals that indicate gaps, vulnerabilities, optimization opportunities. And we have a thesis that if you are only looking at those things, it gets rid of all the noise and bubbles up just the action items, and then suddenly there's less of an urgency around having custom dashboards, custom alerts. So this is all part of a broader effort to demonstrate that observability can be more efficient if it's conclusion based instead of feature based.

[00:21:38] Chapter 10: Beyond AI SRE Reactivity—Role-Based Interpretation

Mirko Novakovic: But that means you are. On the one hand, I would call it a little bit of an AI SRE agent. So you connect to the different tools, including security and you come up with intelligent solutions for problems. But the other part is, if I understood it correctly, you are kind of observability pipeline intelligent one where I feed the data through your pipeline and then you reduce it, or you look at the tools and you remove it from the tools. How do you do the reduction of the data to the signals? Do I have, as a customer, send the data first to your tool and then or how does that work?

Checo: Right, so. So what we learned as a team over the years, is that obviously a telemetry pipeline tool can be super powerful, right? We've seen why Cribl grew to $2.5 billion valuation in just five years. We've seen the value of solutions like edge, Delta and beyond, but there's inherent problems and challenges around telemetry pipelines, right? You have to have teams that understand how to build them. You are inserting a service into the critical path. And again, it's all about in a telemetry pipeline solution. The data has to flow through it for it to understand it. And not all types of data can flow through a telemetry pipeline. So what we're doing is we're creating in a similar approach to the AI, SRE, where we're creating this kind of layer of intelligence on top of observability that pulls like we can push to us if you'd like. But primarily we pull from existing data sources because we want to create that holistic picture and interpret it above. It's not putting us in the critical path. It's providing those insights above and beyond what you already have. But then there's a second thing that's a really important distinction. You know, a lot of AI, SREs, and there are more and more every single day.

Checo: Every time I check my LinkedIn feed, it's like, hey, we've got an AI SRE. So the thing with AI, SREs is that they can be super powerful, right? They can look at all the best practices they can pull in your documentation, your past incidents. They can look at your observability data and they can mathematically assume that trends are bad. They can mathematically detect especially the really good and big ones that are coming out now. The results, the traversals, they're building really expensive, really powerful models to look at. How do we detect outlier signals mathematically that indicate you may be having a problem? And I think there is a lot of power in that. You know, we're also seeing I went to Grafana. I love Grafana. They do such a great job. But the biggest feature that they announced last year or this past year was the AI. AI assistant. Right. So many developers were excited at the notion of not having to build their own dashboards or have something customized for them. But my fear in the AI SRE approach is that it is still reactive. You're pulling in all this data to look for mathematical patterns or signals that may indicate you have a problem, and I feel like we as an industry have to be more prescriptive at this point.

Checo: We know that CPU high, for example, CPU high may not actually be a problem. Why are teams still creating alerts on it? Right? But CPU high plus negative sentiment. That's a problem. Specific signals indicating that varnish is not configured correctly or your database isn't right. All of these things, there's a subtlety there that requires interpretation upfront, things that only people like you and I Mirko who have lived in the trenches and seen this across hundreds, if not thousands of use cases can look at and say Under these circumstances, X may be fine, but keep an eye out for these correlated signals because that's when you really have an issue, right. So for us, we are looking at adding in a new framework that typical SREs won't have around role based analysis. And what I mean by that is, if you look at the architecture of websites and their underlying infrastructure as a living, breathing organism, you're going to see recurring roles in the sites. This one's e-commerce, this one's brochure where this one's something else, right. You're going to see recurring roles in the sites and you're going to see recurring roles in the infrastructure. This is for cron. This is for APIs.

[00:25:52] Chapter 11: Measuring Success Through True Prevention

Checo: This is a transaction. This is an email. And when you start understanding the role in organ plays and the larger organism, that allows you to narrow down the signals that actually matter even more. And now you're looking for okay, high CPU doesn't matter on a cron server because it's cron and it's going to have spikes every hour on the hour. But if I notice the cron job stops running, if the number of cron jobs changes, that's my key signal, right? And it's building that deep SRE knowledge into the platform in a way that it can take these signals, interpret for you and start providing those prescriptive recommendations. That's where we see the big difference between us and SREs. We're not waiting for the smoke or the fire. We're predicting where the most likely smoke and fire will come from and saying, don't do that. Don't do that, don't do that. Go change this. That way, for us, success is if the DevOps team can start planning their prevention and sprint plans two weeks ahead, a quarter ahead. That is prevention, not. I got alerted an hour before storage filled up, or an hour before the 500 reached an untenable rate. That's success. That's what we have to strive for as a community.

[00:27:07] Chapter 12: Distillation, Fingerprints, and Efficient Context

Mirko Novakovic: No, absolutely. I like your approach of defining roles for certain types of I would say application loads. Right. E-commerce or a cron job or whatever. And then applying what you call knowledge to it, which is essentially kind of a best practices model, right? Where you know, if it's a cron job, then these signals are relevant. If it's an e-commerce site, then these signals are relevant. I would probably have a payment provider I need to check. Right. These are the things I can share with the system. That makes a ton of sense. I also like the approach of pulling right. I would have been a little bit skeptical if I have to feed all the data through OtterMon, right? So it's better if you pull it, but you have storage, right? You keep some of the signals for yourself or not.

Checo: What we're doing across the board is distillation, right? We genuinely believe, and our experience has taught us that. And depending on whether it's your guys research or, you know, any other companies, right. It's somewhere between 80 and 95, sometimes even higher percent of observability data never actually gets used. Right. That is consistent across so many companies. So we're creating all this telemetry that has very low value. So we see it as an opportunity to say we're not going to keep everything right. It would be horribly inefficient for us to just take all of your Splunk logs and all of your metrics from Datadog and all of your APM from New Relic, and just duplicate it, right? So for us, it's about building the technologies that allow us to take those signals in and retain the key points that indicate this is normal. In general, this is normal for you. This is anomalous. The type of anomalies we see in general. These are the types of anomalies we see for you. And forcing ourselves to use those types of what we call fingerprints to then make informed decisions about next steps and analysis. And it shouldn't just apply to metrics, logs, traces, alerts, events, sessions. You know, all those all those key measures, right. We should have a distillation process for incidents. What are the types of incidents that are faced by these technologies, these types of software? It should apply to code. What is the customer trying to accomplish? What are their coding methodologies? Are they using blue green right. Understanding and bubbling down everything into the smallest possible footprint, not just because it eliminates one of the biggest cost drivers in observability, but because we are in a world where that finite context is essential to AI algorithms operating efficiently.

Checo: Right? There's essentially two modes we're in right now. There are companies that are just feeding obscene amounts of data to AI models and hoping that they pick up the right patterns and trends, and it's very expensive and it's very time consuming. You have to have a lot of data. The alternative, which is something a lot of companies, not a lot. Some companies are starting to do now is pre distillation, where you're being very selective and making sure that only high quality data gets fed into the models. And if you do something like that, it is wildly more efficient. The results are more accurate, and it allows you to do this thing that we've been pioneering, which is instantaneous training, where the model doesn't need to have extensive background on this IP address that hit you for two years, or this specific path URL, and how often it gets hit. Right. It doesn't have to. It doesn't have to understand all that nuance up front. If you can feed it that context in seconds or a minute, now suddenly you're able to work with pretty much any scenario on demand, and it doesn't require the model to be pre-trained, at least not in the stuff that matters. Right? So we think there's huge opportunity there. We've been experimenting with it for the past few months, and we're seeing some encouraging progress, because a customer can spin up an OtterMon AI instance in a minute, and then it automatically knows the past hour of what's normal for them. Then it learns the week. The month. Right? It's something you can't do readily with a lot of other approaches.

[00:31:04] Chapter 13: Complexity of Identity, Re-Identification, and State

Mirko Novakovic: And you auto detect those things. You called it fingerprinting, I like it. We have a lot of terms in common when we build Instana. We had things like the signaling, the fingerprinting. And I know from our experience, the way we saw it, it's super complicated. Yeah.

Checo: Oh, it is. I mean, that's what's taken us the longest time.

Mirko Novakovic: Yeah. Simple tasks like re-identifying a process, right? So you have a process in a container, it goes down or crashes, or you spin up another one. How do we know that? It's the same and you aggregate it to one service? Or how do you re-identify that something comes back, could be gone for two days and then comes back, right? Or you lost the network connection, right? How do you know if you lost a network connection or if this thing is really gone? And then fingerprinting it, understanding it, seeing it coming back and mapping that back. It's complicated. Right. Technically it sounds easy, but it is. Oh, it is super complex.

Checo: Exactly. But what you just described is exactly the point, right? You and I can look at a scenario and say, this is a case of an errant process, right? Or a malicious process or a mount point has become disconnected or we as humans are very limited in the number of things that we can. We can keep track of the things we can learn, the things that we can keep front and center as we're working with these complex systems. But here's the thing. Humans are designing these systems, and there's only so many different ways to build these architectures. There's only so many ways to break these architectures. And being able to design the system to know that a mount point can go away, or a process can either be malicious or just fail to know that containers can have all sorts of different issues, right? Knowing building a system that knows this is a thing that exists, and knowing what signals to look for to detect it and how to react to it. That is the opportunity and that is the challenge. The fingerprinting is a technological one, but it is just a hurdle in delivering that next stage logic, right? It's just about how do I make sure mathematically, I'm storing this in the smallest possible footprint so that it can then be fed through a reasoning engine that knows what to look for and how to interpret that. That's the point we have to get to, because generating mountains of data and storing mountains of data and feeding eyes on mountains of data is ultimately not going to work. We have to find a more efficient approach as a community.

[00:33:32] Chapter 14: From Human Dashboards to Automated Reasoning

Mirko Novakovic: Absolutely, absolutely. And I like the way you think about signaling and these things. What I also learned is, I mean, would would be happy to hear your man. You said you worked a long time and you could see things right. You could see patterns. And that's also my experience. We humans are actually really good at looking at dashboards. If you look at them. Yes, for a long period of time, We can. I mean, you know what I mean, right? You're sitting in a room and you look at the same day, then you actually know how charts look like. And then someday you look at the dashboard and say, oh, something is not normal, right? You can see it. You don't really know why, but you just know that it doesn't look right. Right. And that's, by the way, something machines are not really good at. Right machines are not really good at sometimes finding those patterns that we humans can see. That's why we use charts. And we are not looking at numbers. Right. Because we are good at interpreting a lot of data when we put it on a chart.

Checo: Right. You know, and it's but that's also that. Therein lies one of the biggest flaws in our ecosystem, right. Somebody has to be as smart as Mirko who can build the chart. Somebody has to be as smart as Mirko who can interpret the chart correctly. And notice that that spike looks a little bit different than the ones that we normally have. And then instead of slightly different time, right. Also, Mirko can only be online so many hours of the day. Like your job cannot be to sit and stare at that graph or that dashboard, right? So the question is, how do we build this intelligence in such a way that without being able to necessarily look at a graph, it understands the context that would lead to a spike or the context that would lead us as humans to interpret that spike as problematic. And I think that's where it's going to be really critical for us to build systems. Again, this is where I love Edge. Delta is they've kind of, from my perspective, pioneered the notion of negative sentiment analysis. I really believe that it's going to come down to understanding not that there are bad signals in the sense that they indicate negativity.

Checo: It's also understanding that this correlated with another signal, another anomaly, another bit of seasonality taken as a whole indicate a risk, an active problem or an optimization opportunity that could turn to an issue later, right? That's where the real innovation is going to lie. We can't scale in a, in a world where AI is creating instantaneous threats and you no longer need to be a tenured blackhat hacker to take down hundreds or thousands of websites, we as a community have to develop the technologies to react instantaneously. Humans looking at dashboards is a flawed it's what my ex CEO would have called a plan to fail. We have to figure out how to take the humans out of it and train the machines and build the logic in such a way that they can draw the conclusions for us. They can diagnose it for us. They can recommend a remediation that we believe in and in nine out of ten cases, ideally, the system designs, deploys, tests and redeploys the right fix for us without the human in the loop. That's what we are, that's where the puck is going that we have to get towards. Together. We are far away from that.

[00:36:52] Chapter 15: OtterMon’s Initial Focus and Roadmap

Mirko Novakovic: I would love to hear where you are in terms of releasing and and and in terms of building the product. How is it going so far?

Checo: It's been awesome. So right now what we're doing is we're working with over a dozen companies that are using Prometheus, because that's the most I feel like that's the most abandoned use case. Right? In the sense that everyone starts out generally with Prometheus and the Grafana stack self-hosted, or they'll use CloudWatch. And a lot of enterprises don't think about how we can make their lives easier. Right? So there's thousands and thousands of companies out there that are using this do it yourself monitoring, and they don't have the engineering capacity to to make it really robust. They're kind of afraid to go to Datadog or New Relic because it could be a huge investment. Right. So we've been working with those early stage companies to test out our logic, see if we can distill all their data correctly, auto detect the system, the roles and then once we've got that squared away. So we're looking at doing really robust testing with that in January, then it's about building out that next order logic that isn't just I can see your system and I understand it, and you can have a conversation with your data. It's about that next layer of proactivity. We're detecting issues for you without the need for custom alerts. We're able to visualize key data points for you without the need for custom dashboards. And we can draw conclusions on where your attention should be focused without a human having to look at a chart. That's where we're going after January, and we're looking for anyone who's using Prometheus and wants to help test with us. Reach out, and we'd love to partner.

[00:38:29] Chapter 16: Industry Comparisons and Closing

Mirko Novakovic: Well that's great. I actually work with Manoj from Asserts AI, I knew him from AppDynamics times and I don't know if you're familiar with the Asserts AI, but they got acquired by Grafana.

Checo: I am,

Mirko Novakovic: But they were also building AI on top of Prometheus, right? A graph model, service model. And it was a really good tool. I haven't followed what happened inside of Grafana with it, but the team was really good. So check them out.

Checo: Yeah. No, Asserts was doing data distillation before it was cool. But they're also one of the only companies that have pulled it off successfully. So it's still a huge challenge for observability. And the question is how we bring that to the open source model in a way that many companies can use.

Mirko Novakovic: Absolutely. So that was really great talking to you and I. I'm really looking forward to see it in action. So hopefully you will give me a demo once you're ready. And I would love to see it.

Checo: Absolutely. Anytime. Mirko you guys are doing great stuff over at Dash0. Keep it up.

Mirko Novakovic: Thank you. Thanks for listening. I'm always sharing new insights and insider knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on