Episode 925 mins9/26/2024

The Observability On-Ramp: Early Monitoring Made Easy

host
Mirko Novakovic
Mirko Novakovic
guest
Micha Hernandez van Leuffen
Micha Hernandez van Leuffen
#9 - The Observability On-Ramp: Early Monitoring Made Easy with Micha Hernandez Van Leuffen

About this Episode

Fiberplane founder and CEO Micha Hernandez van Leuffen joins Dash0’s Mirko Novakovic to walk through his new API debugging tool, his view on metrics’ role in observability and why he’s investing in the ‘building blocks of tomorrow.’

Transcription

[00:00:00] Chapter 1: Introduction and Guest Background

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I'm the co-founder and CEO of Dash0. And welcome to Code RED. Code because we are talking about code and Red stands for requests, errors and duration the Core metrics of observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Today my guest is Misha Hernandez. Misha is founder and CEO of FiberPlane. He previously founded Verka, a container native CI -CD platform acquired by Oracle in 2017, and he is also an investor at NP Hard Ventures. He's focused on the building blocks for tomorrow and we have much to discuss today. Misha. Welcome to Code RED.

Micha Hernandez: Thanks for having me. Great to be here.

Mirko Novakovic: Yeah. And before we start with my first question that I normally ask, I think there are two things to mention. One is that I'm also an investor in FiberPlane. So happy as an angel. And the second is we met the first time when I basically took your apartment in San FranCIsco. So you had sold your company and you were moving back to Europe, and I came over for Instana in 2018, 19, I don't know and I bought an apartment. Turned out it was your apartment, so. Right. We share an apartment. My first question is always, what was your biggest Code RED moment?

[00:01:32] Chapter 2: Code RED Moment and Early Challenges

Micha Hernandez: In the worker days. So we. So, as you said, it was a container native CI, CD platform. We were very early in the container journey, so we the original version was built on LXC. And then this was before Kubernetes came out. We migrated towards core OS. Even we're using fleet at the time and flannel for, for networking And then obviously Docker as a container format. And I think we were like on one of the first versions of Kubernetes. And we had this weird bug where the Docker daemon would like, run out of memory and then stop our job system for the CI jobs, which is like, stop accepting new jobs. And like, this thing would just pop in every now and again very non-deterministically. And that always kept us up at night because we just never knew, really when this thing would fall over and our CI jobs would fail.

Mirko Novakovic: How do you fix it? How do you figure it out? How to work? What the problem was.

Micha Hernandez: Honestly, I think eventually we just went to a newer version that came out, and it was solved via that. It was just like, too early, probably to adopt Docker at the time.

Mirko Novakovic: Yeah. And that's also the truth, right? That some issues you can't really debug with observability tools or it's hard to debug. And you, you have, especially if it's so core to the operating system as in this case. Right.

Micha Hernandez: Yeah, exactly. So yeah, this is just waiting until they shipped the fix and we were able to adopt.

Mirko Novakovic: Yeah. And then you sold the company to Oracle. You probably did as as I did. And then you also figured out that you get bored, right?

[00:03:30] Chapter 3: Post-Acquisition and New Ventures

Micha Hernandez: Exactly. Yeah. So Yeah, as I said, the company was acquired by Oracle. Spent a couple of years there as a VP of software development, focusing on their cloud native efforts and a bit of open source. And then indeed also started doing angel investing, which sort of culminated later into, into actual fund. But also, yeah, sort of had the entrepreneurial itch and wanted to do another company and basically you know, during the workaday sort of one of these, these bugs that that I just mentioned but also inside of Oracle, one of the things that we noticed was that there's sort of no sort of real collaborative software for observability and that, you know, we were very used to maybe to this day still are to use these sort of static dashboards. Right. You set them up in advance. You think, you know what to monitor and what to measure. And then when, you know, push comes to shove, something else falls over and you need to debug something else. And that kind of inspired sort of the collaborative notebooks that we were working on with FiberPlane, which is more of an explorative form factor. Right? You pull in different types of observability data, be it sort of metrics, be it logs, be it traces, and then trying to, like, build up this narrative the sort of investigation of what's going on? Yeah. I'm very much inspired by sort of. Yeah, going back and forth between all these different dashboards. Like, what if we could put this into sort of one explorative notebook form factor, obviously very much inspired from the data sCIence space with Jupyter notebooks. And make that collaborative where you know, people that can work on these things together. I think also like this is around Covid. When we started the company, which also, you know, people were working remotely you know, I think the classic way is always like, you know, something's up and you swivel your chair over to your colleague and you look at your monitor together. But now with remote work, that just gets increasingly harder and harder. So that also kind of inspired this more collaborative form factor.

[00:05:52] Chapter 4: The Value of Collaborative Observability Notebooks

Mirko Novakovic: Yeah. And I, to be honest, one of the things I really learned over the years in observability is that that's why I really like the idea of FiberPlane, by the way. Is that I mean, if you have problems, especially in larger organizations, you then at some point understand what the problem was, and you do a postmortem or something, and then that knowledge should be shared between teams. Right. Because if that problem occurs, again, if you understand what the issue was, you really shorten the time frame to fix it. But people use wikis or notion or Google Docs or whatever. And it's really I mean, whenever you have used such a tool, it's really hard to find things and to document things. And when I saw FiberPlane, I really liked that you have, like, this markup style way of documenting, but plus you can easily integrate, for example, Prometheus metrics or chart or a log or a trace into the document. Right. So it is really a collaborative notebook that you can use to either do something like a postmortem or look at a certain scenario or work on a problem together. Right. And that's absolutely targeting a problem that I hear over and over again, and that tReditional observability tools do not really have a solution for.

Micha Hernandez: Yeah, exactly. I do think like, people tend to, you know, share screenshots that they take from dashboards over slack, which now need to search and find what you were talking about. It does, of course, assume that in the first place, you've instrumented your application in the correct manner. And that's just simply not always true. Right? That's the other thing that we're kind of like seeing as well, that the world might not be as sophisticated, at least not every organization. And there's always sort of this friction between, I think sort of SREs, on the one hand, trying to convince developers to instrument their code properly and for developers to sort of toil work. Like writing unit tests. Right?

Mirko Novakovic: Yeah. And it's interesting. I mean, then you. Establish a second product which goes into that direction. Right? Which is around metrics and supporting developers to instrument code. Better to collect those metrics. Right. Can you tell us a little bit about how you came up with the idea or which problem you saw and wanted to solve?

[00:08:25] Chapter 5: Developing an Open Source Metrics Framework

Micha Hernandez: Yeah, I think it's kind of interesting. I was sort of going through the these, these sorts of various products that we've iterated on. It's kind of been a sort of continued journey on sort of on the, in the same space and trying to, you know, solve pieces of the same problem. So yeah, we created this open source framework called metrics implemented in different programming languages. So we started off with a rust version because some of the people on the team are rust engineers. There's a Python version. There's a TypeScript version and there's a go line version. And the whole idea is that you kind of use metaprogramming like macros in, in rust or decorators in Python to auto instrument all your functions. And then if you add a decorator, the automatic decorator by default you get the top level metrics such as latency, error rate and request rate out of the box. And not only that, like we track these metrics on top of that, we know exactly because of that, because we you know, set, set, settle on the naming conventions. We know exactly what promql query is attached to sort of visualize and show these metrics. The nice thing as well is that you can also attach SLOs. So service level objectives to each function. So top level, you could say, hey, I'm interested in sort of 99% uptime or error rate, right? And or every request should respond under, you know, 300 milliseconds or something. That's something you could define at the sort of top level service level objective and then attach that to every function that you've instrumented with metrics. And then once you have that, you can have alerts that go off if one of these thresholds are not met or you know, if SLOs are violated.

Mirko Novakovic: Yeah. But what I really like is if you talk if you said instrumenting auto, instrumenting you instrumenting the code to generate metrics, right. So if you have an API or something, I get a metric for each call. And I understand how many errors or can you explain a little bit what you are instrumenting and how you generate those metrics or which metrics.

[00:10:51] Chapter 6: The Future of Metrics and Observability

Micha Hernandez: Yeah, yeah. So we it's all Prometheus compatible. So yeah, every function yeah gets sort of the error rate. So we track the errors. If it's a, you know, a 404, we track the response time for every function and then also the concurrency. So the request rate as well. And these are all Prometheus compatible metrics. And then can if you have Prometheus running, it would get scraped. And you can visualize it in, of course, Grafana. We've got a set of Grafana dashboards actually out of the box. Or we actually built our own explorer interface, which has some nice spark charts and shows the, the query, the PromQL query that is attached to each metric. And then actually the nice thing that you can do as well is because it's sort of function level metrics. We know if for instance a function is not behaving, we exactly know what line of code staffing is I'm running.

Mirko Novakovic: And I'm not sure if you're following the current discussion around metrics in the observability space. I think Honeycomb is taking that to the next level, basically saying, okay, there's a observability 1.0, which is kind of metric based, but it doesn't really work. And then there's observability 2.0, which is basically event based, where everything is event spans, logs, even the metrics derived from that. And then you can correlate that. How has your view on the metrics part and why have you chosen metrics. And how do you see that with customers. Are they, I mean, we all know that Prometheus is super popular and it's a metric based tool, right? And I, I do think if I would put an order, though, I'm more from the traCIng side. Right? I know that if you talk to customers it's really metrics than logging and then traces right in that order. If you would look at what somebody is implementing. So how do you see that discussion.

[00:13:22] Chapter 7: FiberPlane's New Developer Tool - Hana

Micha Hernandez: Yeah, I read that article as well. By Charity. I think metrics are great for having signal, right? Like, they're very cheap to store. You know that something is wrong and you can act on that. I think the thing with logs is, of course, that it captures a lot of data. Like it has more context. I think what we've seen and we've also seen that sort of with metrics like, I think in an ideal world, yeah, you might be attaching metrics to all of your functions. But again, I think the world is not that sophisticated. I think there's just like only few organizations, like, invest heavily into capturing these metrics and then having some kind of correlation as well to the logs. Right. Like, great that you got a signal, but now you need to figure out you know, which more the more context which to the function that was misbehaving. So I'm actually pretty on board with that article, to be honest. And I think also like less sophisticated companies like startups, right? They don't think about metrics. They might not even know what a metric is. What they do know is logs. Right. Like that's the first thing that a developer does is like, let's console log or print line. What I'm seeing and, you know, print that to my terminal and then I can go ahead and fix it.

Micha Hernandez: Which kind of brings us to the sort of the third product that we're iterating on is focusing on a framework called Hana. So we're just going pretty narrow on a TypeScript framework for building APIs. It's a bit like express or Fastify. But having Otel middleware for that and then a sort of local debugging companion. So the question that we asked ourselves was like, hey, what if we power a developer tool for local first development you're building? You're very in the early stages of building your API. What does that tool kind of look like to help you debug your application as you're building it? Because what we've seen is that you're kind of like going back and forth between different tools for that workflow. So you might have a postman to sort of, you know, hit requests and see how your API is behaving. You probably have a bunch of terminals open to sort of see what you're printing out and what errors you're getting. And obviously you've got your IDE open as well. Sort of maybe go ahead and sort of fix those things and sort of generate tests. So we were kind of thinking, what if we sort of fold all those tools into, into one and effectively sort of build a supercharged postman powered by OpenTelemetry data to help you quickly debug and build out your APIs.

[00:16:01] Chapter 8: Auto-Detecting APIs and AI-Generated Insights

Mirko Novakovic: I like that. Postman powered by Otel.

Micha Hernandez: Yeah, yeah. And then. And there's just a bit actually in the same vein. Right. Is that where a postman, you know, you just do a request and you get something back, right? Or response. But here we actually, because we're using Otel middleware, we actually know the behavior, right? We know the context of what's going on. So especially I think, in this sort of day and age of like, we're building all these different APIs that connect to other third party services, like an OpenAI or an anthropic. Right. It's actually quite valuable to sort of see, hey, I'm doing this. I'm shelling out to OpenAI. How long is that request, that piece taking? Why is it slow? What's going on? So that's kind of what our third sort of piece of the puzzle.

Mirko Novakovic: But in this case, you take more of the metrics. You also have the other signals, also traces and logs or.

Micha Hernandez: Exactly.

Mirko Novakovic: And how do you get that data?

Micha Hernandez: So that's done via OpenTelemetry middleware. It's just a two liner. So assuming you're using this framework. So we obviously will be expanding to different API frameworks such as Express or Fastify in TypeScript land. We've gotten some requests for fast API in the Python world, but we're kind of like going deep and narrow on this new, sort of relatively new kid on the block called Hono. So it's a two line middleware that you just instrument your application with. So you import our middleware, and then you instrument your app. Yeah. It's sort of still call it debugging data powered by telemetry data.

Mirko Novakovic: And it's very semantical on, on, on API's. So I will see my APIs and see the cores of the APIs and where I have errors or how do you design the UI for that?

Micha Hernandez: Yeah, it's a good question. So the interesting thing that we can do versus a postman because we're using these middlewares that we actually auto detect your entire API, right? So we know every route. We know the method we know the function signature which kind of leads into new superpowers that we can give you. So actually, because we know the function signature and every route, we can use AI to sort of generate payloads, right? So one of our example APIs is a API called Goose Quotes. It's a character AI only consisted of geese. And maybe you want to create a goose. And using AI, we can know what payload. So it has a name. It has a bio. It has a favorite programming language, and we can sort of generate all of these JSON bodies for you. And then you can hit send, and then you get a response back and that might fail. And now you're ready sort of to go ahead and debug that.

Mirko Novakovic: Okay. Because you know, the signature of the actual method behind the API call, you extract it basically from the message. Exactly.

Micha Hernandez: Yeah.

Mirko Novakovic: Cool. And you use it for testing or what's the.

[00:19:16] Chapter 9: Channels of Integration and User Interfaces

Micha Hernandez: Yeah. So the first use case is very much, I would say sort of local on the laptop debugging your API as you're building it. And again, sort of I would say we're sort of FiberPlane was a pretty like, new mental model, right? The notebook product new mental model around debugging observability data. This is more, I think, in the vein of a thousand paper cuts. Right. All of these small tasks that you would do, all these different screens that you would have open whilst developing your API. We're kind of shaving off seconds in your development time. So it's very. Yeah, it's local first, very much focused on when you're sort of, you know, early, early with your team building your API and sort of about to hit production, obviously what you do at production, we could store this data as well. Right. And give you a more, you know, collaborative interface for you and your team to debug everything.

Mirko Novakovic: Yeah. It's also very interesting to generate synthetics tests. Right. Because if you have those APIs and you know the signature, you know, the payload, you could actually generate synthetic tests for those services to see if the uptime is there, if the performance is there. Very interesting use case in my point of view.

Micha Hernandez: Yeah. The other use case is around integration tests. So say, you know, you're getting like one of the examples that we have is you do a request, you get an ID back, but that ID is expecting an int, right? But you put in a string and, you know, you should probably account for that. That edge case. But say you don't, you get a 500 back instead of a 422, right? And what we can do then is sort of again, we've got entire contexts and you can sort of generate a prompt. And if you're using one of these sort of newer AI Ides, you can sort of take that prompt, put it in your IDE, and not only sort of generate integration tests for that speCIfic test runner that you're using in that framework, but are probably also like it will give you a fix for that edge case as well.

Mirko Novakovic: That's pretty cool. So basically you generate the fix for making sure that if there's a string coming, that it either converts it into an int or throws an error or whatever. Right. Yeah.

Micha Hernandez: Yeah. And again, like it's one of those paper cuts. Right. Like that you just run into as you're developing your API.

Mirko Novakovic: And where does it fit in. Is it integrated into the IDE or is it a separate web interface? Where do you see that? Because it sounds like a little bit like an IDE feature, right?

Micha Hernandez: To some extent. Yeah. I think we could definitely capture this in a VS code extension, but right now it is a separate web interface. You know, similar to a postman. It looks quite similar to one of those HTTP testing clients. Has your roots on the left, you know, has your payload in the middle, and then sort of your response on the right.

Mirko Novakovic: I would say now you have three tools, right? You have the notebooks, you have the auto metrics, and you have the API debugging tool. Do they fit together? Do you see a way to basically bring them into one FiberPlane, or are there three different tools for different problems? So how does that work?

[00:22:41] Chapter 10: Holistic Approach to Observability

Micha Hernandez: Yeah, I think again, sort of we've been in this space for a long time and it's sort of we're like chipping away at this problem of, of, I would say like there's no on ramp for observability, right? There's sort of these two discrete worlds. Right. And we've experienced as well, like you're sort of a series, in my opinion, like a series B plus type of company where, hey, you're making revenue, you've got users and you hire your first set of SREs to sort of set up the dashboards and set up observability. But sort of if you're at seed stage or series A and you're sort of still figuring things out, like getting to product market fit, you don't have time to invest in observability. What do you kind of use? Right. And I think unless you've got maybe have the disCIpline from a previous company that you need to like set all of this up in advance. It's hard to get started with this stuff. And I think also sort of there's a piece of that charity majors article that speaks to, you know, the world's gotten pretty complex, I think in, in, in observability land. Like how do you get started. Right. We've got all these sort of different data types for observability maybe, you know, having Well structured logs or traces that sort of as the, as the format is the way in. So coming back to your question, I do see a world where in the, you know, maybe in the near future, right, we get an alert and instead of you going to a dashboard and figuring things out, it does make sense that there's sort of this report via the notebook that's generated from that alert that says, hey, I've looked at the, at the, at the previous data, I've looked at the previous notebooks.

Micha Hernandez: This is what I think is going on. Mirko actually solved it a month ago. This is what he did. This is sort of the runbook that he went through. And sort of. Here are some charts, right? That sort of showcases the data. So I think eventually like that, that notebook form factor, that sort of reports that actual intelligence. Right. That makes sense where it comes back. I think there's still with the sort of the more recent product there is this lack of, this on ramp for developers at the earliest stages. Like how do they get started in this world? And I think sort of the way in is not through observability. It's like the way in is through a dev tool that is powered by observability. I think the Autometric thing was more of like of an experiment where we got some learnings from that right around functional level intelligence you know, being able to sort of go to the line of code that's misbehaving based off an alert. So there's quite some, I would say, sort of knowledge and pieces that we can use from that framework. But I would say, like it's probably more the future is more geared towards OpenTelemetry than it's towards or traces. Right. Like than it is towards metrics.

[00:25:43] Chapter 11: Industry Trends and Serverless Challenges

Mirko Novakovic: Yeah, I would totally agree with that. Yes. But I also think there are two things. So one is I understood. I totally like the idea. I'm just thinking through it, right? Because now with LLMs, you could actually not only learn from your notebooks, but you can generate them, right? That's a pretty cool use case because then you can see an issue on your API level and then generate for the team a notebook that's either based on the knowledge of this and other customers. Right. Because it learned from it. And also makes it very understandable. Right. Because there's text there's graphs. It gives you kind of a journey of the problem. Right. Not only a graph.

Micha Hernandez: Yeah I agree I do think there's some tricky issues, right where I think this can work, maybe based off just OpenTelemetry data. I think it's harder and sort of how FiberPlane started out with like, oh, you've got Elasticsearch, you've got Prometheus maybe you've got CloudWatch going on, you've got all these different sources of data That might be very different for each customer. Like how do you make sense of of that all and how do you counter? I would say also like the blank canvas problem, like you do need a place to start.

Mirko Novakovic: Yeah, absolutely.

Micha Hernandez: And sort of having a common form factor such as OpenTelemetry I think probably helps there.

Mirko Novakovic: Yeah, especially for APIs. Also, I agree with your statement that probably something like a span or a lock will work better than a metric, right? Because there are cases. I mean, that's something I'm sometimes missing in this whole conversation, is that there are things where you just have metrics, right? Cpu, memory, some other stuff. You only have metrics, right? This is not derived from a trace or from a span. But then you have other things like counting the number of calls to an API. Yeah. You can either have a metric with auto metrics, right. Or you, you have all the spans, and then you do a count of the spans to that call. Right. And you also get the count. Right. But it's derived from the spans. And what that adds is context right. Because now you not only have a metric you can also understand how many of those calls were by customer X or had this parameter set or, or came from this country. Right. So you get this power of the high what's called high cardinality, right. You get the high cardinality. And then you can really debug things. And what honeycomb it's called bubble up. Right. The feature that then automatically does the analysis and compares. Oh there was a problem. And here are the attributes that look suspiCIous right. Because all the problematic ones had country code China. Right. As an example.

Micha Hernandez: Yeah I do think your point on CPU, right, like the infrastructure level metrics are often forgotten in this conversation a bit. Right. You're still obviously customers and companies that are not kind of like building on serverless, right, that still have regular servers and or containers, even where these metrics are important to understand.

Mirko Novakovic: Yeah, absolutely. Or the whole Kubernetes environments. Right. You have tons of metrics on, on each of the elements of the Kubernetes stack, right. Which are especially important for the SREs platform team to understand sizing issues on utilization, etc.. Yeah.

Micha Hernandez: And these metrics need to be correlated with the application level metrics as well. Right.

Mirko Novakovic: Like exactly, exactly. That's why I'm a big fan of not kind of saying oh metrics are kind of bad right? Yeah, I think the best tool can correlate the metrics with the logs and traces. Right. Because then you get the full view and you can even pinpoint to some infrastructure related problems. It could be that your API call gets slow because the underlying infrastructure is under-provisioned on Kubernetes and it doesn't get the amount of CPU. So that's why it's not responding right. Let's be fair. It's a problem. That's not that often anymore right? Because you can just automate, spin up more resources or have serverless where that doesn't matter that much anymore. But it's still a very complex environment where you need to correlate metrics with logs and traces, in my point of view.

[00:30:20] Chapter 12: Building Blocks of Tomorrow

Micha Hernandez: Yeah. And of course, serverless is an entirely different beast to debug as well.

Mirko Novakovic: Oh, yeah. Oh, yeah. And is that important for you?

Micha Hernandez: Well, yeah. So the framework came out of Cloudflare I think not. Not like officially, but it's sort of a side project from one of the engineers there. But obviously like it works well in serverless environments. So it works in Cloudflare workers. It works on Deno. It works on fastly as well. But yeah, that's a, that's a, that's a big thing as well for us to help you debug.

Mirko Novakovic: Yeah. It's getting bigger and bigger. We hear it from more and more customers. They are switching at least parts of it to serverless platforms in their cloud environment or into other platforms. Yes, absolutely. Or things like Vercel. Right. Where you get more to the edge. And that's really interesting. Yeah. And finally, you said that you are very interested, especially with your venture capital investments in the building blocks of tomorrow. Yeah, I like that. And, and and can you give a good answer about what the building blocks of tomorrow are.

Micha Hernandez: So. Yeah, the fund is called NP Hard Ventures. So the theme is Building blocks for tomorrow. So that could be sort of tools and platforms that help us build the future. So obviously developer tools and new layers of infrastructure are a good example of that, and we've done quite a few of those. But then also more and more robotics companies. So we invested in a company called monumental and they actually building a bricklaying robot. Yeah. Imagine a wall for a house that needs to be built. They've built autonomous robots that can sort of lay that brick wall, brick for brick and do that quite accurately. So that's another it's a literal, a literal building block of tomorrow.

Mirko Novakovic: Yeah. I think robotics is one of the biggest and most interesting, especially combined with AI. Right. If you get this autonomous part. Right. I think that's pretty interesting. Right. And also like I can tell you from Germany that I'm a little bit in the construction business and getting people who actually know how to do those things, it's getting more and more difficult to get those people right. I think we need to have alternatives in the near future because there will not be enough.

Micha Hernandez: We'll just lose the skill set. Yeah.

[00:32:54] Chapter 13: Conclusions and Reflections

Mirko Novakovic: Perfect. Micha, that was really interesting. It was fun talking to you. Very interesting. Also to see, like, it's clear that you're an experienced founder, how you started with one idea. And then by getting and collecting the feedback and doing experiments, you're getting to a totally different product essentially by utilizing the feedback of customers.

Micha Hernandez: Yeah. As I said, we like to keep iterating on this, this problem set up around observability. So yeah.

Mirko Novakovic: Yeah. Great. I'm as an investor, I'm really happy to watch you succeed with it. Thanks, Misha.

Micha Hernandez: appreciate it.

Mirko Novakovic: Thanks for listening. I'm always sharing new insights and insider knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on