Episode 3232 mins8/28/2025

#32 – Data Observability at the Source: Ido Bronstein on Upriver, Bad Data, and How To Monitor A Future Full of AI Systems

host
Mirko Novakovic
Mirko Novakovic
guest
Ido Bronstein
Ido Bronstein
#32 – Data Observability at the Source: Ido Bronstein on Upriver, Bad Data, and How To Monitor A Future Full of AI Systems

About this Episode

Upriver co-founder and CEO Ido Bronstein joins Dash0’s Mirko Novakovic for a deep dive into the hidden risks of bad data and why “shift-left” observability is becoming essential. Ido shares why catching data issues at the source is critical for reliable pipelines, how Upriver helps engineers take ownership of data quality, and why AI adoption is making data accountability non-negotiable.

Transcription

[00:00:00] Chapter 1: Introductions and Code RED kickoff

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I am co-founder and CEO of Dash0. And welcome to Code RED code because we are talking about code and RED stands for request Errors and Duration the Core Metrics of Observability. On this podcast, you will hear from leaders around our industry about what they are building, what's next in observability, and what you can do today to avoid your next outage. Today my guest is Ido Bronstein. Ido is the co-founder and CEO of Upriver, a startup committed to preventing bad data and ensuring data accountability across organizations. He previously led engineering teams for Israel Military Intelligence, where they built a central platform to extract insights from vast amounts of data. Ido, glad to have you here.

Ido Bronstein: Glad to be here. Glad to be here. Yeah. Very exciting.

Mirko Novakovic: And I always start the conversation with the first question, which is what was your biggest Code RED moment in your career?

Ido Bronstein: So of course there were a few. I decided to choose the one that happened like a couple of months ago. Funny story. So me and my partner, like my CTO of the company, were at a bachelor party in Greece. Like having fun with our friends like we are very best friends. Yeah, like years ago and we landed on Tuesday night and then on Friday morning our phones, our slack like got crazy. Our customer support like consistently calling us. We started to investigate a huge sawdown of our platform. So a bit of our platform behind the scenes we are running Databricks. And Databricks has a major incident of their own in a couple of regions. And because we didn't handle a timeout correctly in our API server, it completely blocked all ability to access our platform. So while we are in, like the bachelor party in the villa, everybody is having fun, drinking. We are outside debugging trash out everything with our engineer in the slack and our customer success, detail that updates the customers all the time. So all the events and like we finished debugging, understand what the problem is and fixing this incident I think it took us five hours, something like that. But it was crazy. It's like a huge incident at the wrong time with very, like, little ability to understand what happened easily.

Mirko Novakovic: Can I say, did you catch up on the beers later on?

Ido Bronstein: This was our payment. So we had to.

[00:02:47] Chapter 2: Defining shift-left data observability

Mirko Novakovic: Yeah. Let's talk about . I mean, I found it very interesting. Right. The whole data observability space and you coming from the military, I think you're now the fourth or fifth Israeli, an ex-military in the kind of observability space. And it makes sense, right. So I, I think you have worked on data intelligence. And so tell me a little bit how you come up with the idea of up, Upriver and what is the core of it?

Ido Bronstein: So Upriver is a shift left data observability platform that creates a layer of observability to the data specifically for data pipelines. And Shift left means that it means that we found the problems at the source when the wrong data or bad data ingested through the pipeline, and very close to the producer of the data. Like other tools in this space that find the problems very downstream when it's already passed through a couple of stages where it loses the ownership on the incident. And only there. So our goal is to bring the ability to understand data problems for the producer, usually, or mostly software engineers and take ownership on those problems. So this is in a sense, what Upriver is. I will start by explaining what data observability is for. The audience does not know the platform. So data observability means that you need a layer of observability to the data. So in heavily data in heavy data systems, everything from an infrastructure perspective or an application perspective can looks very good. But due to small modification in the data the downstream product, the downstream data product can be unreliable, cannot be wrong, can be an a massive incident for the company and I will give like a couple of examples. So for example you have like a gaming application that uses like extract a lot of metadata and data on the user so they can command the user on his next stage on his next boots, and they have AI behind the scene that doing this process.

Ido Bronstein: And now there is a new list to iOS version to Android version and a couple of the fields that the application usually collect isn't collect-trived. Or the format had changed. So let's look on the flow of the data. So the data is going to start from the application. It's then going to go to the back end. Then it will go to some legs or some ingestion process. Then you will have a pipelining ETL that will transform clean and model the data. Then only then it will arrive to the AI engineers that we try to run the model. Then they will train the model on something, and the data that the model will receive in production will come from a really different pipeline. So for example, right now in this new iOS version, someone changed how the timestamp is formatted. It's going to affect the model. It's going to affect the transformation. So any change of in this data could affect the model the performance the product. So our ability to find those problems, those changes at the source connect them to the downstream impact and explain to the data engineer, data scientist and ultimately the software engineers that creates the data. What is the effect of this change. This is, for me, data observability, the ability to look and understand how data changes affect my system.

[00:06:54] Chapter 3: Where Upriver observes data

Mirko Novakovic: Where do you look at the data? Only on the database or also inside of the application code?

Ido Bronstein: Today, what usually happens is that all of those events are usually thrown into like Kafka or S3 bucket, and only then started to be monitored by the data engineer. So today we look on the data in the ingest to the pipeline from Kafka or from the S3. Ideally, we will want to connect it also to the application layer, like how the data is passed through the microservices themselves.

Mirko Novakovic: Okay, but that makes sense, right? You look at Kafka messages and then you analyze those and you see any difference in messages, but you are also looking at the data inside of the database or not.

Ido Bronstein: Yeah. Of course.

Mirko Novakovic: Yeah.

Ido Bronstein: When you are living in the area of microservices, you have a lot of governance like you have a, you have API's and you have tests for the data. But once you threw this, throw this data to the data land. You have no guardrails. You don't have idea what these data engineers, analysts, data scientists doing downstream, and you don't have any idea that some change going to affect them, going to destroy their production systems, their production system. So I think, like we are sitting in the in the point in the middle that makes this connection between the two domains.

[00:08:26] Chapter 4: Market awareness and AI-driven demand

Mirko Novakovic: I remember when I was an engineer 20 years ago, you already had the same problems, right? You did the data change, but then all the ETL pipelines that shifted those data into a different data store, they were all broken, right? Once you change it, they are not working anymore. Right? And that's but I found it interesting that, How do people find you? Are they searching for the problem? Because it's something that I think is for a normal developer, not a typical tool you're familiar with, right? And you, you even know that it exists.

Ido Bronstein: In companies that are very heavily based on data. They know they need data observability, data quality in place. So there is some education in the market that this layer is needed in companies that the product is heavily based on data. And they and they know that what is happening right now that more and more companies is shifting to be very based on data. And just for a nice side story. And in this silicon that is going to happen in October, there will be a main track on data in AI and a lot of different lectures on data observability. I know because I'm on the technical committee. So there is a shift where SRE will companies is more aware to the need of data observability, but it is education that is going to accelerate with the adoption of AI, with the adoption of AI, not just Pok pok pok.

[00:10:10] Chapter 5: Relevance for GenAI and context quality

Mirko Novakovic: Yeah, absolutely. So how does it essentially also play into what I can see in the classical machine learning I can definitely see that. But with GenAI, I can imagine that. Is it still that important in that area? Because, I mean, it's trained on literally text more or less. Right. And interpret the text. So is it still as important there as it is in areas where you really rely on the format of the data?

Ido Bronstein: Now we understand that. Then I like more and more that what is different between how one company uses AI and another company uses AI is not the model, it's the context, and the context is the data that the company have and the company use. And this is actually the knowledge that you bring into this generalized model. So if the knowledge that you bring to this model is wrong, it's not going to act correctly. And you can see that there is another category today that is called like LLM observability. I'm sure you know that they're trying to trace all the prompts during the way. So you have so in the end, like everything will cluster and that when you are looking on chatbot and we have this like already a customer and we check the data before it goes to the chatbot that without being able to say, okay, the data is reliable, the data is correct, they cannot release this chatbot production.

[00:11:45] Chapter 6: Users, roles, and shift-left ownership

Mirko Novakovic: What is your typical user then a developer?

Ido Bronstein: We have a user that are engineers and data platform data engineers. So when you're looking about like all the data engineers, there was a lot of engineers who came from like the buy side and a lot of data came from like heavily data pipelines, park and area. So usually we work with more data engineers that are more engineering Instacart. And we have for example, one of our customer is a huge edtech company. They are building data pipelines. They have models, but in the titles they're all engineering. In the end, we are helping any engineers that build data pipeline, even data engineers. Software engineer in end. And when you are doing shift left, usually we bring this incident like the understanding of what happened in the data to the engineer responsible, the data.

[00:12:42] Chapter 7: Deployment and data connections

Mirko Novakovic: But you need a running application, right? You need the application to be running in an environment so that you can actually monitor it. Or can you already do that? Looking at code? No.

Ido Bronstein: So we are at production and we have production monitoring. We connect. So I will take a two minute to talk about how our product works. So we connect. Maybe it will give the whole picture. So we connect the data to all of our customer from the stream on the Kafka to the bucket, on the S3 to the data warehouse, to the operational DB, the Mongo, the rescue, the Postgres to everything.

Mirko Novakovic: And that's an agent. Or what is it? Or you connect from the outside.

Ido Bronstein: So we have both deployment. We can know how to run this compute in our environment or in our customer environment. Okay. Yeah. And then we are starting to analyze the data. We extract semantic and statistical insights from the data. And then we create automatically. We create monitoring rules in four different layers. The first one is schema just helping companies manage schema in a complex pipeline. Secondly is format semantic format of the data. If it's a URL, if it's a country, if it's a phone number. Third metadata freshness volume of the data. And lastly metrics on the data itself. Histogram. Cardinality. All these metrics we are putting in our AI workflows. The rules learn what is the usual behavior of the data. And when there is an incident. And our AI recognize that, we send an alert and we connect this alert to the lineage. So the lineage, it's like the APN of the data. People like the flow of the data, how the data move from one table to another table to a model.

[00:14:39] Chapter 8: Lineage construction and standards

Mirko Novakovic: You detect the flow automatically, how the data is moving or.

Ido Bronstein: Yeah. So it depends if it's a table that is built by SQL. So we know how to automatically create it from the SQL logs by reading the dialogue. Or we use an open lineage that is an open source. I don't know if you are familiar with that. That is like OpenTelemetry. So they standardize the telemetry on changing data. And it looks like they adapt our users adopt this open source SDK and we know how to collect it and show end to end view of the data.

Mirko Novakovic: Okay. I didn't know this. Then our lineage is the name.

Ido Bronstein: Open lineage. Open lineage. Yes.

Mirko Novakovic: Okay.

Ido Bronstein: Yes. Amazing project. Like I think that I really take the inspiration from the open telemetry.

[00:15:35] Chapter 9: Incident detection examples

Mirko Novakovic: And so then what? What is an incident? How do you detect an incident? It is when there's a change in the data format or.

Ido Bronstein: So I will give you a few things that happen to our customers. So for example, like it's easy to color it with like very stupid incident. But it is of course more complex. So for example, one of our customers had like a name column and a description column, and suddenly he pushed the name to the description and the description to the name. So dataflows everything look okay, but the data itself was wrong. So I'll build it to look on the distributions of the length of the strings, enable us, enable us to say, hey, something wrong happened here. You have a shift in the distribution here and here. And it's going to affect you on those data products using the lineage for example. Another example a URL format is something very simple. So one customer concatenate the domain twice to the URL when you build it in the backend. So suddenly we saw like a couple of times like the HBS suffix, prefix and its a wrong format. So it happened only in 10% of the data. So we told them you have a shift in the format histogram in your data.

Mirko Novakovic: Yeah, that's pretty amazing. I mean I'm just trying to understand it, but that's pretty cool because that happens a lot of times. Right. If you change code, you basically introduce a bug that's only visible on the data as you explain. Right? I add the HTTP suffix twice and I normally can't see it because from a I would say programmatic perspective, right? I'm a normal observability guy. Everything is fine, right? The performance is fine. There's no error code. You're injecting a string basically, and the string is a string, right. And you don't have a semantic check. But what you are doing is you monitor the data. You know how that field looks like, right? It's an URL. And now you see a change. You call that shift. As far as I understood a shift in Data. So it has changed. And now you can inform me. Hey, it's probably not. You don't really know if that's an incident, but you can at least tell me. Hey, there's a there's something has changed, right? Significantly. So it looks like there is either a problem or it could be a change. Right. But that's

Ido Bronstein: Exactly.

[00:18:01] Chapter 10: Alerts, learning, and onboarding

Mirko Novakovic: And then you, you what do the users do? Do they create alerts on it.

Ido Bronstein: So we'll like to recreate alerts automatically for them. So we learn all the threshold by our AI. And the experience of the customer is very simple. Like run our CloudFormation give us like ability to connect the data from our environment or his environment. We learn data for a week, or if we can backfill the data so we can learn from the past and we need a week in order to handle seasonality. And then we're just starting to send him an alert anytime that something weird had happened.

Mirko Novakovic: So tell me about your AI. I mean, I love to talk about AI right at the moment.

Ido Bronstein: Yeah.

Mirko Novakovic: Everybody's talking about it. Just before this talk, we talked about AI native company. Right? Everybody's talking about it. Nobody knows what it is, but it doesn't matter. Everyone is an AI native company these days. Yeah. So you too, probably?

[00:19:00] Chapter 11: AI components in Upriver

Ido Bronstein: Well, of course, AI native company. All we do, we are using AI, but joke aside, like we're really using. I personally use it all the time. Like, it's just such a powerful tool. So yes, from brainstorming to like, everything. So I don't know why it's native is just integrates also in what my developer is done. My marketing is doing and myself. But in our product we have smart component of AI in couple of places. First of all is the semantic. I understand that the column is currently how I understand that the column is a name. How I know that the name is connected to the last name. So what we do, we connect the raw data to the semantics. We extract the entities, the semantic entities from the raw data. And this is what we want to monitor. So the extraction of the semantic is one place. Then we use AI. It's a combination of advanced AI and statistical mathematical statistical mathematical statistics. Yes. In order to understand what is the threshold, what is the usual variable of the the data. And when you have all the data. And lastly, we use it in our root cause analysis. So we also need to understand and what is the regular correlation between places. And then what we do is we also give like we take all the metadata that we collect lineage data, sample formats, shifts in distribution, and take this and put it as a context to an LLM in order to create a very good workload analysis.

[00:20:55] Chapter 12: Scale, sampling, and efficiency

Mirko Novakovic: And, and I can imagine I mean, we have we do something similar on logs, right. We also check the semantics, the patterns. One of the issues there is or not issues. One of the challenges is that if you have millions I can see the same for you, right? There could be millions of messages on Kafka per second, per minute, whatever. So you have to do this on a very high scale, right? You have to look at those messages, and then we do something like fingerprinting, right? We fingerprint messages so that we don't have to do the heavy lifting all the time, but understand patterns, etc., like a fingerprint, and then only do like the real AI on messages that we haven't fingerprinted yet. Right. Which are new to us. Do you do something similar or can you really analyze all the messages?

Ido Bronstein: We sample the data from the start because like we work with a massive data pipeline, like our biggest customers have something like 200 billion records per day. And if you want to sample the data, like the compute of this will be enormous. So we know how to sample the data smartly in order to be efficient on resources, but not to create a very. Long prediction to the statistics and the threshold and the rules that we create. So we needed to handle these problems. Let's call it at the beginning where we actually extract the data from the customer environment.

Mirko Novakovic: Yeah. That's a I mean I know sampling it's not easy right, to decide because you can do easy sampling, which just means hey, I keep every 10th message, but that normally is not really sufficient, right? So intelligence sampling is not easy to create. That's interesting. Right. Because it's one of the problems that also in OpenTelemetry, like in The Collector, you have different sampling and different sampling approaches. And, and one of my ideas on sampling, by the way, is that you kind of and I can see it same here. You need to apply the usage pattern to the sampling algorithms. Right. If you understand and you do right. If you understand better how the data is used, what data is used, where it's used, what is queried, then you can apply that knowledge somehow on the sampling statistic, right? For us it means an observability. If I understand which data is queried, which data is put on dashboards, which data is used on alerts, that data is then a higher priority in the sampling pipeline than other data that is not used. Right? Because you know that it is in dashboards, you know that it's queried. And these kind of things we are at the moment evaluating how can we make a sampling algorithm more intelligent based on the usage of that data? Right.

Ido Bronstein: Of course, I completely agree with you. And this is exactly the things that we are handling. I will say that on our sampling algorithm that the user have ability to segment the data. Tell me, okay, I want to look differently on the US and Europe. And then I know to sample differently those areas. So if the US is very big and like the US is 90% of the data and Europe is only 10% of the data, and you tell me, look, I want to look at those as a different segment. So I will focus on the US and create all the statistics on the sampling on only specific segments of data.

[00:24:33] Chapter 13: Shift-left workflow and CI/CD integration

Mirko Novakovic: It's like a tariff. You put a tariff on the EU data?

Ido Bronstein: Yeah. Yeah.

Mirko Novakovic: That's cool. And then you talked about the shiftlag. How does it change the way those developers and data engineers are working if they have your tool compared to. I also saw on your website you have CI, CD integration. So how does it change my day to day work. Is that data feed back into my pipeline into.

Ido Bronstein: It's so amazing.

Mirko Novakovic: Amazing.

Ido Bronstein: For the software engineer, it gives them the ability to be proactive about those data incidents instead of getting a call from a data engineer telling them that yesterday they did the deployment and through their models. Now the software engineers will get the alert in the slack or in its CI CD if it's like run our cleanse and monitor during the process. You are right. We have this ability. You will get the notification will we know what the problem is, what its effect. And instead of working out in the middle of the night because a data engineer or a or a, the CRO complains that the prediction on the revenue of the company is not working right now, and he needs to do the quarterly report now. He will get this advance. He will be a proactive about solving this, because you will understand how his changes affect on the data that affect a business asset.

[00:26:03] Chapter 14: Company stage and product vision

Mirko Novakovic: Upriver is pretty early right. Still for you in the journey.

Ido Bronstein: Yeah pretty early. Yeah. We were running for like a year and a half right now.

Mirko Novakovic: Wow and how big is the team if I can ask.

Ido Bronstein: So we are around 15 people.

Mirko Novakovic: Oh nice. So how do you see that going forward? What are the things you want to build? What's your vision on the next steps of the product? And what do you see working here, especially in the context of AI, right? Where I think for every company, data will become a, as you said, right? Context is king, data quality is king. So I think this topic will become a major topic for every company, right? Because every company will use AI, every company has to provide context and provide their data to that AI model. Right. So this will become a major topic for everyone, right? If it's not already right for larger database companies. But I know also in Germany a lot of companies are not there yet. Right. That's it's still at the beginning. So how do you see that product evolving and how do you want to help those companies.

Ido Bronstein: So there is two fibers for this question. First of all, how AI will change my product and how AI will change my customer's day to day.

Mirko Novakovic: Exactly.

[00:27:27] Chapter 15: Convergence with SRE and observability platforms

Ido Bronstein: So I will start with the other one. For my product. We are going not only to detect problems, but also to remediate them automatically. But we are pushing our ability forward to not just understand what has happened, but also to understand how to fix that. And this is where we're going.

Mirko Novakovic: And sorry if I ask, do you do this like are you providing MCP server and connect it also to some other tools or how do you want to solve that.

Ido Bronstein: So we do it today, but we also connect to Git and analyze the changes and correlate them with the data. Understand how a code change affects the data and connect it to our incident response platform. And the second part, and for me is more importantly, it's how it will affect my customers. So more and more companies are going to be an AI company that AI native, as you said, and they're going to adopt AI as core part of the product, which means they will need to trust the data they mean. It means that they will need the data observability tool. So my team, the company that needed my product, is only going to increase. And the tipping point will be when the SRE, the SRE group in those organizations will adopt data observability. And it's happening and it's shifting. Today data availability is usually is being sold to the data organization. But I see more and more SRE groups that are searching and data would be to tour as part of their and and tools that they have to acquire. And I see it like you can see it from a couple of things. First of all, as I mentioned, like the SRE icon and how it put data in the I in the focus. And secondly, just a quarter ago, I think Datadog acquired a data observability company. I don't know if you know that data plane. So Datadog is going to bring to the market the first observability and data observability value proposition. And I believe it will resonate with a lot of customers. So they will be in the end, as I said a merge between those categories. Right now there is not. But if I look one year forward and they see how AI affect companies, what needs to thrive. The SRE will need data observability tools in order to continuously create a reliable system.

Mirko Novakovic: I know that Datadog is investing heavily in this whole AI space, right? So it makes sense for them to integrate that. And if you look back, those observability platforms are becoming wider and wider, right? Security use cases, data observability use cases. Yeah. I can totally see that it is today. I would say it's still a category of its own. Right. The Monte Carlo? Yeah. Yeah.

Ido Bronstein: You asked me in the future.

[00:30:31] Chapter 16: Partnerships and closing remarks

Mirko Novakovic: Yeah, yeah, yeah, but I can see that I, I think I agree with you. I think it would probably merge together. Right. Because with AI, data is coming closer to the core of any application. Right. And it will influence the yeah. If an application is running correct or not. Right. Heavily. So you will need it. Absolutely. Yeah. So what's your plan? You want to integrate with vendors like us or how is your thought on it.

Ido Bronstein: So we have a couple of partnerships with Opus Nobility platform. The Israeli ecosystem is full with a good Opus Nobility platform. There are plenty, as you mentioned. Yes. Ground cover, yes.

Mirko Novakovic: Yeah, had a lot of them here. Coralogix and EVGO. Lumigo, right. It's also there. So we have, we had a lot of them here on the podcast already.

Ido Bronstein: Yeah. Yeah yeah I know. Yeah. good guys.

Mirko Novakovic: Hey Ido, I really enjoyed it. For me, it was kind of a new topic. I know the space a bit, but for me now it clicked right? I really understood how you got from the I would say, Kafka topic, following the data into the database, generating this lineage, which I haven't heard yet, but then understanding how the data flows through different systems and then understanding semantical changes, right? Which I like. Right? Not only syntactic, but also semantical changes on the field level. So pretty amazing technology. I really wish you all the best and hope that you will kill it.

Ido Bronstein: I really enjoyed being here. Thank you very much for inviting me.

Mirko Novakovic: Thanks for listening. I'm always sharing new insights and insight and knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on