host

Mirko Novakovic

guest

Marc Klingen

Episode 2339 mins4/17/2025

Engineering Intelligence: How to Build LLM Applications at Scale

host

Mirko Novakovic

guest

Marc Klingen

Listen on

Apple Podcasts Spotify Youtube

About this Episode

Langfuse CEO Marc Klingen joins Dash0’s Mirko Novakovic to explore the evolving challenges of building and scaling LLM-powered applications. They unpack how Langfuse went from a YC pivot to becoming an open-source platform used by thousands of teams. The conversation dives deep into LLM-specific tracing, evaluation, and cross-functional debugging, highlighting how Langfuse is redefining the developer workflow around AI evaluation, prompt management, and production monitoring.

Transcription

[00:00:00] Chapter 1: Introduction to Code RED and Langfuse

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I am co-founder and CEO of Dash0. And welcome to Code RED, code because we are talking about code and red stands for requests, errors and duration, the core metrics of Observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Today my guest is Marc Klingen. Mark is co-founder and CEO of Langfuse, an open source LLM engineering platform founded in 2023. Langfuse focuses on helping teams accelerate their LLM application development. Today, it's used by thousands of teams including Twilio, Samsara and Khan Academy. I'm excited to learn more. Welcome to my show, Mark.

Marc Klingen: Yeah. Hi, Mirko. Thanks for having me.

Mirko Novakovic: Yeah, absolutely. And we always start with the first question, which is what was your biggest Code RED moment.

Marc Klingen: I think the biggest Code RED since founding Langfuse was definitely like last year when I mean, the project took off. It's a very data intensive product and we had like this version two, which was running on Postgres and like a single Docker container, very simple, like how a team gets started to just build things and see whether they work. And then as we were working on like a version three, which really I mean runs on Clickhouse, multi-container setup, caching, queuing, everything like as you would imagine how this works as a data intensive product in the observability space. Like as we were working on version three, we kind of neglected a bit version two and just like SLO was degrading, but we didn't know because we didn't have the setup and like some customers churned, but luckily they turned to, like, self-hosting on their own infra because they just like, forked improved the tiny bit that caused issues for them and the whole setup. But yeah, this was painful. We learned our lesson to, like, always monitor SLOs and never kind of like ignore kind of like warning signs, even though there might be like some kind of like help of a new version that will fix all of this in the future. So yeah, this was like a big Code RED big learning for us.

Mirko Novakovic: Yeah, absolutely. But I always say this is kind of a good Code RED, right? If as a startup you crash because you have too many customers. That's a good problem to have, right?

Marc Klingen: Yeah. I think like for us it was really like the lesson of it's great to get started as crappy as possible just to validate that people actually want this. And if then these teams scale, I mean, many, many operate like B2C applications, have a lot of scale. If this then causes scalability problems on our end, we prioritized in the right way and not have like focus on scalability from the get go.

[00:02:34] Chapter 2: The Genesis of Langfuse

Mirko Novakovic: Yeah, absolutely. So how do you get the idea? How do you get started with like Langfuse?

Marc Klingen: Yeah, we like my two co-founders and I like Max Clemets and me. We initially worked on like other topics. But then we got into work combinator early 2023, realized our other idea was quite generic. We had a couple of customers, but basically the GPT-3 API was available and just there was like this big new solution vector to all sorts of problems where you don't need to, like, call up customers and figure out what their problems are, but rather, yeah, people want automated code changes or they want automated market research. So it's very obvious what kinds of problems exist. And we just shut down our previous product and started working on building LLM based applications, and then realized it's really easy to get to a demo, but it's really difficult to build like a meaningful application that's more complicated. And as we were going through a couple of these kinds of projects like using LLMs, this was like a continued problem. And all of the teams around us had the same issue. We felt like we are late to this market, but turns out 2023 was still early, and our tooling then was more impressive than our GitHub issue to pull request automation, which with the GPT three API was a very impressive demo, but it never worked on actual applications back then, so I think it was a good pivot to just pivot to our tooling.

Marc Klingen: Initially, we sold it to a couple of YC companies, so more like early stage startups. But then as we went open source. We quickly after launch had like a large install base in like more of an enterprise setup and then grew from tracing observability into evals, prop management and like adjacent tooling. And yeah, today with like tens of thousands of teams using it some of the very big brands and generally just, just believe that on the application level, there are so many new challenges that need to be figured out that are very distinct from traditional observability. So super excited to chat. And thanks. I think we once had a conversation on otel and your view on the space which has been very, very helpful for us. So excited to be back.

[00:04:27] Chapter 3: Y Combinator Experience

Mirko Novakovic: Yeah, absolutely. And tell me a little bit more about this Y Combinator story. Did that all happen during that? Is it the three month period? Right.

Marc Klingen: Yeah, it's like a four month period. What happened within four months was basically joining, shutting down one product, then iterating through like 3 or 4 different kind of like products and then realizing our fourth product, which was this GitHub issue to Pro Request Bot request bot is cool, but not like won't work for the foreseeable future. And that our tooling is interesting, but back then, we just had, like, a prototype. And then over the next, like, 1 to 2 months, we got to something that we could launch. So it was basically YC went on until April and we launched, I think in June. So like two months later it was something where we felt like, okay, now, like a team that doesn't know us can pick this up. So that was kind of like the timeline.

Mirko Novakovic: Still sounds like a pretty crazy speed of iterations and and work in that short period of time.

[00:05:22] Chapter 4: Evolution from Observability to Engineering Platform

Marc Klingen: Yeah. I think like, our technical founding team really helps to just get something out. Because in the end, it's about like, I mean, if when developers are customers and there's like a very new market, it's not like your market where it's very like you have an idea of what the product is when going in, and it's like a big technical lift from the get go. And you have a clear path to what to build initially, where here it's way more about understanding what the workflows are of these teams really, and how this works. And nobody knows. So it's really like a closed loop with customers.

Mirko Novakovic: And now you I mean, maybe I'm wrong, but as far as I got it at the beginning, you said you were an LLM observability company, and now I read it's an LLM engineering platform. So what is the difference? What does it mean? And what are the additional things? So tell us what Langfuse is doing.

Marc Klingen: Initially we went in with the kind of problem of if you build like a GitHub issue Pro Request Bot, then when you want to debug it, when, for example, the wrong files were chosen, or if like a diff wasn't applied correctly then you want to see like logs of simple like LLM calls or tool calls, which then manipulate files. Now we have like options to do this usually like in your traditional observability setup, like long text blobs don't really look good or don't really work, meaning you either consume them as logs or you like need to write them to some sort of like application database. And then you build some like own rendering tooling around like viewing these logs. So just like a lot of text to process. And also you don't know whether it works based on like status codes because the bot will return like a 200. But maybe the code change doesn't make sense. So you have like an asynchronous basically point in time where you might get like user feedback or some sort of evaluation mechanism where you then realize it doesn't work. So basically what we started with was Reimplementing Otel tracing with then like LLM application specific UI, which was really helpful for teams starting out to just understand how the application works. But then like, why did we change? Basically the description, it's because people need like traces. Traces are like a great abstraction to describe what's going on in application.

[00:07:27] Chapter 5: Understanding the Langfuse Platform

Marc Klingen: But what you really need then is like a structured way to evaluate whether it's working well and whether it's secure and safe. And this is very distinct from traditional observability, where you usually want to keep traces for years, then to like fine tune down the road where you have like different evaluation patterns that overlap and that you then need to contrast to understand what's good and bad, what's working. Maybe intent classifier to understand what your users do with the application, because it's kind of like a string input bot, but it can be anything. So many like new challenges. And then like lastly, we went down the route of helping like cross-functional teams from engineer together, because usually you now have more of a split of like an application team, like building the platform that's then scalable, but then domain experts, PMS executives being in product and iterating on prompts and seeing how it impacts the application. And yeah, basically, observability doesn't capture this kind of end to end workflow tool that we built. If you have a good idea how we could call this, let us know, because engineering platform sounds quite generic and sounds like platform as a service, which it isn't, because we are fully like async and generally don't fiddle like with like runtime. What's going on there?

Mirko Novakovic: But when we last time spoke, we talked a lot also about tracing how well it works, like as a first version, can you walk me through what a trace looks like for an LLM application. So I prompt something. So it's the prompt. The first span in the trace. Or how does the trace look like.

Marc Klingen: Yeah. Let's take a simple example. Like some kind of like support chatbot is probably something everyone can like visualize where what would you assume you have like a customer message. And then you do like one LLM call respond to the user. That's it might work as a demo, but usually you have like a couple of things that you want to add, like security checks. Is the user question in the distribution of things that you would expect and that you want to respond to. So is it safe, for example? Second would be you usually want to, for example, then in this context retrieve additional like standard operating procedures or knowledge documents. So you have like a retrieval problem. And then lastly often you have like multiple LLM calls to actually produce the response to for example, first distill like a large context down and then summarize the response, maybe then rewrite to fit to the style that you have within your organization. So like tons of different things that people could optimize to optimize quality, latency or cost. And generally then for like this application, how would it look like? Okay, you have like one trace is basically a single message. So you get like a custom message in and then whatever goes into producing the response like a single trace. But usually you have like spans of like security check embedding process, the embedding process then like subspan. So for example, you need to embed the user message to then do retrieval with your vector store. You get documents back, then maybe you want to summarize the documents like rank them again. Because maybe just like the vector similarity isn't enough to get to a good ranking of what's relevant. Then you summarize them again, and then maybe do like a final security pass to just make sure that you haven't, like suggested any kinds of discounts that don't exist or something. So there's like these kind of like step by step issue. And why are then trace is interesting because you want to evaluate each of these steps individually.

[00:10:31] Chapter 6: Use Cases and Users

Mirko Novakovic: Though it sounds like a little bit different structuring wise. It sounds more like it looks like a chat. Right. Something here talks to something, it comes back. So it's not like a classical trace where you say something and then it's a deep trace. Does some database calls, comes back with a response. It sounds more like it's going back and forth. Is that something you represent in the visualization, or do you use a normal trace visualization for it?

Marc Klingen: I think the back and forth happens within the more of a conversation that these applications have with users, or if it's like a threaded kind of application. We have like a session ID, which is basically a correlation ID that you can add to traces, which then rolls up again into a chat view. Because then if, for example, a customer support agent or domain expert wants to replay what went on in the application, it's way more helpful to see, like the basic root span level IO of these traces and have them rolled up in a conversation. So that's something that's very application specific. But then the individual trace is a very classical trace UI of like seeing the different spans, being able to navigate them, seeing attributes. They just all have like these specific IO fields that can have, for example, like a chat ML, which is this chat message data structure that many LLMS have. So we have some opinionated UI for something that we expect in this context, but the single trace view is very, it's quite generic, I'd say.

Mirko Novakovic: No, no, no, I get it. I saw this whole session as one trace, but now I get it. There is a session, but each individual call is a trace. And then you can roll it up basically like a rum. Real user monitoring for this whole conversation where it goes back and forth and you just show the chat conversation, but each individual call is a trace, right where you can deep dive and see what's actually happening underneath. That's pretty cool. And it's also who are the users? I can imagine that's also something that kind of the business users can look at, because they want to understand how users interact with the system. Right.

[00:12:25] Chapter 7: Role and Benefits of Evaluation in LLM Development

Marc Klingen: Yeah. Generally I mean, two parts to the response. Initially we mostly work with startups, building new things, but now as the platform is more mature and like generally very appealing to more like a platform, teams and larger companies. Because many platform teams are tasked with having some sort of infrastructure that then scales and generalizes well across hundreds of different AI use cases to accelerate these teams. So these are generally like the people who bring online views. But then the core audience is usually initially engineers that are tasked to build something that's kind of useful. But then quickly, it's a very cross-functional problem of engineers working on it. ML engineers are researchers trying to figure out how to improve a specific step. Product managers who come from the domain side, some people who maybe do the job right now so often support or sales people who then own the use case and then often executives will really wonder what's going on, want to understand, want to learn. Because usually these topics are now like boardroom level, have boardroom level attention, and people really want to push their agenda. So understanding how this application works is usually something many people are curious about. So it's a very cross-functional problem, which then I mean for us is kind of a problem because UI needs to work for all of these people that usually don't spend all of their day in Datadog or Grafana. So we have some different UI components that appeal to a more, more technical or less technical audience.

Mirko Novakovic: Yeah, that makes total sense. By the way, when we search for AI engineers at Dash0, we figured out that that's a very wide range of different types of people, right? You can see there, right, there are people who are more on the data science side, right? Python than there are people who do product engineering. And then there are people who can actually bring that to production on large scale. Right? Very different types. Do you see the same? Is there kind of a new AI engineer type, or do you also see this very different types of engineers working on these AI projects.

[00:14:21] Chapter 8: Managing LLM Applications in Production

Marc Klingen: I think in startups or small companies where there maybe one feature that's AI related is delivered by a single person, then this might be like an AI engineer, might be like a new kind of profile. But in larger teams, I would say any software developer that has done like AI projects before is then like an AI engineer. But they would always have a PM that's very experienced in these kinds of things. Or like a data scientist, ML engineer because what usually software engineers aren't that used to is this kind of iterative development based on the data that they see in production, to then evaluate what's going on, take things back, have like very robust data sets for testing, because it's not like you can spec something. For example, we want to automate customer support and then get from spec to something that works immediately. But it's more like how do we learn iteratively about the problem, which is usually something like software developers aren't used to. So I would probably say AI engineers is like everyone who has done this before in like another project. But usually the teams we see move fast are more like cross-functional where someone brings this in. But then also like the software engineers bring in more of the like pragmatism and maybe like experience and scaling the actual application because many of the ML engineers haven't done this before. So both sides need to learn.

Mirko Novakovic: And if I start as an engineer and I start building my AI projects, when is actually the point where I look for something like Langfuse is that when I bring it into production, I have. The first problem. Is that something I already use during the development phase to figure out how I build the product. So what do you see? Because with observability in my space for the last two years, I always try to have this tool in the early phase, but nobody's using it. Everybody's using it. When something crashes, right? Then they need something, but nobody uses it during development. Is it similar here or how do you see that?

Marc Klingen: It depends on the use case. I mean, we think a lot about this because obviously we want to create like good documentation for these kinds of point in time. We see like three big, big points in time where one is I have a very complex application where while I'm building it, I struggle with understanding how it works, even on like a few test cases where I just want to see a nice graph or trace view to understand what's actually going on. Just because I've built so many abstractions, so many like prompt templates that feed into each other that I just want to see when it fails. Like be able to see why it failed. Because even just having the logs in my terminal it becomes like, not not digestible. One point in time. Second would be if my application, if it's scrutinized a lot, if I, for example, have like a regulated environment or work in a larger company where just the risk appetite is low and where I need to demonstrate that my application actually works before releasing it, then usually teams put in a lot of effort into like development data sets and then evaluations to prove that on the expected set of inputs, the application behaves correctly. So this would be the second point. And the third is if my risk appetite is higher, then usually I launch something into production. Let's assume I built like a customer support chatbot tool for agents for actual humans. Then probably these humans are have like a high risk tolerance for just creating like a suggestion. Because I still review it before sending it out to a customer. But for these kinds of suggestions that I generate, I then want to learn, like how many of these are accepted by the support agents to then iteratively see what works in production. Because otherwise if I have like thousands of users, I need tooling around this. So this would be like the third moment in time where basically you have something and then you want to learn from like mixed feedback that you get.

[00:17:57] Chapter 9: Interaction with Traditional Observability Tools

Mirko Novakovic: Yeah. And do you also help on the second and third use case. So do you help me understand if, for example, an answer was too rude or not compliant with some rules that I have, can you, can you can you help with that too?

Marc Klingen: Yeah. Generally evaluation which would be the term for this generally is the core value proposition of what we do. Because generally traces are then how you describe what's going on both in development and production. And then evaluation is more how you like label annotate it to make it useful and to be able to drill down on things that are interesting. Where evaluation is usually a mix of direct user feedback of people, manually labeling data to to enrich it, or some automated techniques where automated can mean I generate a SQL and it didn't work like I couldn't execute it like something that that you can understand at runtime, but can also mean I use like a larger model to have a look at this again and understand, like does my therapy conversation, does it match, like what I expect therapy to look like? So there are different ways to evaluate. And we have some documentation on this published a couple of blog posts. It really depends on the application that people build. But as all features that we build, we have like a very open API where then usually the teams just they have an idea of what they want to achieve, and then they build this around Langfuse and use us as like modular blocks of things that they don't want to implement themselves.

Mirko Novakovic: But you are just monitoring. You're not blocking it. Right. Or would you also block. So if I if you see an answer is wrong. It still got answered and you would alert on it or show that there was wrong answers. Or would you also interact or interfere with the answer and say no, this one is not compliant, so I block it.

Marc Klingen: Yeah. There's generally like a difference between evaluation and more like guardrails where guardrails are this kind of at runtime evaluation to take action and change the like what's happening at runtime, where for guardrails, the like requirements are very different. They are like you want to run them with low latency directly within your application stack, where for evaluations they happen async after the fact. You don't know whether the evaluation actually work. And you do a mix of it generally lengthiest focus on evaluation. Because for guardrails you don't want to have it async and have no profile in your application, but you want to have it like, like a smaller classifier that you can run within your application is preferred. And generally there's like a lot of open source tooling like Nemo guardrails like LM guard. We have a list of these things, but usually now the problem is with guardrails. I mean, with the priorities that they have. Like having, like a low memory CPU profile run within the application, like they might not work. So you need to evaluate them again. And generally we recommend to, to try a couple of these open source solutions and then monitor Langfuse if you scrutinize them manually, whether they actually reflect the correct conversations. So we rather focus on evaluating the guardrails and then logging the guardrail results to use to monitor whether they work.

Mirko Novakovic: Yeah, that makes total sense. And it's pretty interesting that, I mean, you can see that it's very different from a normal application, right. What, what you would normally. So it's a very different type. But at the end of the day, it will also be a part of a larger application in a lot of cases. Right. So how do you interact with observability tools. Right. If you are doing this part, as far as I understand, it's interacting with the outside world, which is a larger application stack. So somehow there must be a combination of data that you send to those tools or, or retrieve from those tools. How do you see that working out?

Marc Klingen: Usually there are like some shared IDs. So like using the same trace ID and then deep linking between tools or like somehow correlating this because some of our users ask us like do you do logs? Can you do like all metrics? Can you have like a Prometheus endpoint that we can scrape? Generally we invest into like compatibility there. But the problem that I described earlier, like it's very different. It's not like an ops team that wants to get alerted when OpenAI is down, but rather like a team that struggles with users have like mixed feelings about the application, like what to do to improve it. So generally we like length is used like alongside the observability stack because the problem is way more like cross-functional people spend way more time on app. It's not like an alerting use case, but it's more like an iterative understanding what's going on. Use case. How we then interact when teams pick up more OpenTelemetry instrumentation. Also for the observability use case, usually they just like report these, trace and export them to the observability stack and to Langfuse. Where then in observability stack they they might report overall usage numbers or latencies of their providers or gateways. But then like the quality aspect of understanding where that works like it's really difficult to make this work in observability stack and all of the providers that observability capabilities like they usually stop at aggregating metrics that are available at runtime. Because it's just very hard, like what we need to do to make this work doesn't scale to like observability tool in a classical sense, as running length is way more expensive on like a per span basis to enable these kinds of workflows. So I think it's just like two distinct problems and distinct kind of workflows.

[00:23:07] Chapter 10: The Future of LLM Instrumentation

Mirko Novakovic: Yeah, you already answered my next question. I would have asked if you can read that almost every observability vendor is releasing some sort of LLM monitoring. Right. And so my question would have been where's the difference? But I understand that you say the observability vendors do what also we would do they do classical monitoring observability metrics analyzing it. But what you're doing is more on the evaluation side also. Right. Which is something that's totally different than normal observability and also very expensive. I can see that because you probably use LLMs to evaluate two right in that stack.

Marc Klingen: Yeah. And we have like a lot of like workflow queues to them, for example, rule based on every trace that's ingested. Again realize like should this be evaluated like all of these kind of things where in like in observability sense, you just collect a ton of data, fetch it right. Never update in our kind of stack you usually want to update, you might have like a data scientist who just wants to fetch 1 million traces and then cluster them and then report tags based back to Langfuse. And then we need to update these records and usually you wouldn't be able to update a trace after it happened. But we need to make this work because otherwise you can't work with the data and learn what you want to do with it. So yeah, that's kind of the difference. I think cost reporting latency of APIs usually happens within like Dash0 or Grafana. But then our workflows, I think very distinct.

Mirko Novakovic: Then I actually like the LLM engineering platform name because I think it is better than just saying observability, because then you would think more of the kind of metrics, logs, traces, use case where your use case is different, right? It really gives you a way of iterating and understanding the application better and makes total sense to me.

Marc Klingen: Yeah, the big similarity is kind of like we also need to instrument. And also they are like you can view traces in product because historically ML ops focused on more of like logs, like single model invocations where now these applications are more complicated. They do many things before returning something to the user. So trace the abstraction. So I think that's why we now see kind of like a convergence of observability and like MLOps tooling, which then looks like observability but is more like MLOps tooling.

Mirko Novakovic: Yeah. And talking about instrumentation, right. I think last time we discussed that and also discuss OpenTelemetry. And if I remember correctly, you said, yeah, OpenTelemetry and the libraries that are there, it's not there yet. Right. In terms of semantic convention for different models that that specified. So it's not yet standardized enough to really build upon it. So you build your own instrumentation essentially. As far as that's still the case, how do you see your instrumentation OpenTelemetry and the future of instrumentation of LLMs?

Marc Klingen: Generally, I'm very excited about OpenTelemetry. As the space is moving quickly. Many teams operate all sorts of different application stacks. So especially in the enterprise, usually they don't know switch to like a JavaScript Python stack just because like all of the LLM products are built on these stacks, but rather want to do it in like spring or go and thereby having compatibility with these stacks and having like like standard exporters is really, really, really helpful. And also the number of frameworks that people use to build these applications increases a lot, where initially we built native instrumentation for many of the core frameworks like the OpenAI SDK, Lama Index the Versailli SDK, like we built native integrations. But, I mean, the space is growing quickly. And historically, every vendor in our space built their own instrumentation. I think it's very exciting that now it's rather the frameworks building their own native instrumentation, and then just reporting auto spans to whoever wants to collect them. I mean, the current state is there's like a decent, decent coverage of the JNI semantic conventions within OpenTelemetry, which makes it usable. But at the same time, the space is still moving extremely quickly. So initially it was just like token counts, for example, that are going into cost reporting, which is like input output.

Marc Klingen: And now we have cash tokens, multimodal tokens, like so many different kind of like types of reasoning tokens or for like multi modality, we now have like images that are somehow like blobs that you need to directly upload, for example, to some sort of blob storage. So things are moving very quickly. But at the same time, like the thing that have been around for like a year, increasingly standardized. So what we do is we basically do both. So we continue to like improve on our more vendor specific yet open source instrumentation while increasing compatibility with whatever is standardized already. And we are looking into like switching our own SDKs in the next major version, maybe also to use OpenTelemetry more natively under the hood. And then just, I mean, we'll just carry a couple of attributes that are Langfuse specific and then roll them over to the semantic conventions whenever something is standardized. Generally very excited about it because, I mean, otherwise, we basically fiddle with problems like how to export traces from AWS lambdas, which is something that's solved via like AWS Lambda stacks in already. But we just need to reinvent the wheel on many of these things. Which is something like we are not excited about it.

[00:28:25] Chapter 11: Frameworks in LLM Development

Mirko Novakovic: No, I can tell you from Instana times when we supported all the different environments I had, I don't know, 50, 60 engineers only working on custom instrumentation, and it's breaking all the time because stuff changes and then you have to change your instrumentation. So it's. Yeah, that's why I love OpenTelemetry. Right. Let's, let's really put that into the hands of the community and not do it for every framework yourself. Right. But as far as I understood, you're saying that some of the vendors of the models already integrate observability based on open telemetry, or did I miss misheard that?

Marc Klingen: It's less it's less modeled because yes, there are standard OpenTelemetry instrumentation for many of the LLM vendor SDKs, but usually for a vendor SDK. Yes, it's interesting, but what's even more interesting is when people use some sort of frameworks to build their applications, because they are way more complicated. So having like native instrumentation for these kinds of frameworks is the even bigger lift. But yeah, there's like an increasing coverage of decent instrumentation in the space.

Mirko Novakovic: That's good to hear. And the frameworks. I'm not that familiar with the space, but what do they do? What is the framework doing on top of the model. Is that kind of an application framework to build applications on top of the models or what do they do?

Marc Klingen: Going back to like the earlier point of many applications are more than just like a single LLM call. If we, for example, take applications like deep research by OpenAI, where we have like an input question, then like a model over and over searches the web and until it can provide a research report, this kind of how to build something that continuously loops does like two calls to browse the web and then feed it back into some sort of like memory or state. There's some common patterns on how to build this. And for some teams that want to like jumpstart the development of a common usage patterns like some frameworks develop that just have like standard templates for these kinds of things, which are interesting to get started or which are interesting if you're like a large organization with like tens or hundreds of different application teams where you don't want to invest internally into building abstractions, but rather adopt some sort of like standard from outside. I think the take on these frameworks is mixed in our customer base, because some just feel like they have like a big external dependency in a space that's moving really quickly. And that's like good writing on that with increasing model capabilities. Maybe agents are just like, I think the meme is kind of like a model call in a for loop. And you just do like model. This is what already happened. These are your tools. Do you want to do anything or are we done yet? I think there's no final verdict on like, value of these frameworks.

[00:33:04] Chapter 12: Current and Future Trajectory of AI Applications

Mirko Novakovic: Yeah. It's similar, I remember. I mean, you're too young, too young for that. But when the whole web started, right, there were thousands of web frameworks, right? Everybody built a web framework because it was always the same pattern. And whenever you chose a web framework a year later, you were outdated. Right. And you wanted to use the next, next one. Right. It was really this catch up game where at the end of the day, you figured out that it was not so much about the frameworks at the end of the day. Right? It's something that was more around patterns, how you use certain things. And if you focus on that, you had some sort of stability in your application. Right? But always trying to catch up with the frameworks is a tough, tough game at the end of the day.

Marc Klingen: Yeah, I think frameworks are interesting in this space to quickly learn as an application team how something could be built. I'd say it's nice for like this kind of macro level, like exploring options and seeing how they behave in a specific domain and whether they could be a solution. And then you can still decide, do I eject and basically build this abstraction myself, or am I happy with the framework and stick with it? I mean, from an observability standpoint, it's interesting to use a framework because it gives you like instrumentation. Some of them support basically custom span attributes, which then allow to roll up a trace into like a graph that's cyclical, which is really difficult. Like otherwise you need to develop your own markup to then produce graphs. There's definitely value in frameworks for like a team that's small and wants to move quickly.

Mirko Novakovic: You said tens of thousands of teams are using your tool already. Probably also open source and closed source. But that also means you see basically an explosion of AI apps, right? Is that what you see happening? Is that already in production? Is it in development? I mean, you read a lot about it, but I'm trying to understand how fast this is moving. So do you see a lot of those are in production or still in a evaluation phase, or how do you see the market growing and how do you see it moving forward?

Marc Klingen: Yeah, I think over the last year it's been impressive to see how many move to production, like when we talked to, for example, like an AI ML platform team in like a large organization with like thousands of developers. Usually they talk about like a three digit number of LM related use cases that they are exploring. It varies, but still it's like a large size of different projects that are going on. And if they are internal low risk, then many of them go into prod and deliver some sort of value. I think the number of projects that are more customer facing, like, will continue to increase as people make like some sort of like get experience from the internal use cases. But we work with many enterprises that run like customer facing communications back office work and have like a semi or fully automated kind of experience already, which helps them accelerate these kinds of workflows and increase like improve the user experience. Yeah. I'm very I'm very bullish on that. This continues to increase. And we see like a big excitement I think the kind of like knowledge that builds up in these more like platform teams around like how to do things, how to structure these projects, how to iteratively learn and then get some sort of sense for metrics to then actually deploy use cases improves. And I mean, these conversations also like are very exciting for me in a sense of, okay, like two years back when we started like it just confirms the hypothesis of this is a workflow problem. Like, it's not easy to build something that you can launch and where then if management asks, you can defend that it's safe to deploy it. Like, I don't know, run your customer support through it. So I'm very excited. Like we see lots of interesting applications, lots every day.

Mirko Novakovic: No, absolutely. I think, I mean, you, you were at the right time at the right spot, I think right with your tool. And I mean, in my 25 years in this business, I've never seen something like that. Right. Some technology taking off, like the ChatGPT moment, right where everybody was like, oh my God. Right. But really, then getting this into the hands of developers, building new things, it's so fast, right? It's I can I can just second that when we talk to our customers, we see it everywhere, right? Everyone, especially enterprises, now have an AI budget to just go out and roll out as many AI applications, test it and try to figure out where they can automate, save money, and do new things. Right. It's it's amazing. Right? And as you said, I can see that everyone needs something like Langfuse. Right. Because you want to have these evaluations. You want to understand how your things are running. You can you can, especially with those kind of upgrades. You can let them run without observability, essentially. Right. Because otherwise you really have some problems if you get the wrong answers to your customers, for example. Right. That's something you don't want to have.

[00:35:48] Chapter 13: Impact of Open Source and Community

Marc Klingen: Yeah. As long as you don't try to build like a ChatGPT where people just expect that they have like raw model output. Yeah, I think what applies to observability and OpenTelemetry in that case. So you have like a big community that then crowdsources improvements to the standard. I feel very grateful for the community that we've built up where, for example, GitHub discussions has worked extremely well to be closely in touch with what our community needs because like from a startup perspective, I'd say the product has strong product market fit with like our community right now. But at the same time, like demands or like requirements change on like a monthly basis. So yeah, being like backed by a big community and then like partnerships with some other open source projects. So for example, like LLM which built like an LM gateway or like other frameworks like having close ties with these other projects, then together deliver some sort of like AI platform, LLM platform or whatever people need to adopt, like in a joint offering. Yeah. I'm very grateful about all of these partnerships and the bigger community, which helps shape the product.

Mirko Novakovic: Yeah, and open source is probably the best way of doing it, right. Because things are moving so fast that if you are close source, you can't really understand what's happening. Right? So give it to the hands of a lot of engineers and then get their feedback. We also see the same when we use open source parts or agents, is that developers are kind of tolerating things differently. If it's open source right, then if it's closed source and paid, then it's a different conversation, right? It's just a very different conversation.

Marc Klingen: Yeah. I think our open source strategy was quite distinct in a way of like all of that is like the whole back end with its full scalability that we also run on cloud like it's fully MIT licensed, and we just have some like in UI features that add on to it. So for example, like a playground like you don't really need it, but it's kind of like nice to have, which is something that you can like get under like a commercial license. But like all of the core aspects that you need for like a more like a platform and then want to build your custom workflows around like everything that's API bound, licensed without like being kneecapped on anything or like, I mean, your IDP enterprise is open source as well. So yeah, I think we have like a good sense for what? What's good in OSS and what's like a commercial model to make this like a big company as well.

[00:38:08] Chapter 14: Concluding Thoughts

Mirko Novakovic: Yeah. Perfect. I will cheer from the sideline. I will watch what you're doing. Mark was super interesting for me because this is such an interesting and fast evolving space. But thanks for the conversation and looking forward to talk to you again in a year or so, because I think then it will be totally different, right? It's such a fast moving space.

Marc Klingen: Yeah, thanks for having me. Mirko. This was fun.

Mirko Novakovic: Absolutely. Thanks, Mark. Thanks for listening. I'm always sharing new insights and insider knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on