host

Mirko Novakovic

guest

Marcin Wyszynski

Episode 2844 mins7/3/2025

#28 - Infrastructure in Flux: Marcin Wyszynski on OpenTofu, Observability, and Standardizing IaC at Scale

host

Mirko Novakovic

guest

Marcin Wyszynski

Listen on

Apple Podcasts Spotify Youtube

About this Episode

Spacelift co-founder Marcin Wyszynski joins Dash0’s Mirko Novakovic to talk about the past, present, and future of Infrastructure as Code. They unpack the birth of OpenTofu and why standardizing observability – especially with OpenTelemetry – is critical for managing increasingly complex infrastructure. Marcin also explains Spacelift’s mission to build infrastructure tooling that plays well in today’s agentic ecosystem and why he thinks DevOps often misapplies AI.

Transcription

[00:00:00] Chapter 1: Introduction and Guest Background

Mirko Novakovic: Hello everybody. My name is Mirko Novakovic. I am co-founder and CEO of Dash0. And welcome to Code RED code because we are talking about code. And red stands for request errors and duration. The core metrics of observability. On this podcast, you will hear from leaders around our industry about what they are building. What's next in observability and what you can do today to avoid your next outage? Today my guest is Marcin Wyszynski. Marcin is technical co-founder and chief R&D officer at SpaceX. Lift, a platform designed to enhance infrastructure at code management. He's also the co-founder and main sponsor of OpenTofu, which we will talk about, and he previously worked at both Google and Facebook. Marcin, happy to have you on board here today.

Marcin Wyszynski: Thanks for having me here, Mirko. Looking forward to our conversation.

[00:00:55] Chapter 2: Code RED Moment at Google

Mirko Novakovic: Yeah, and I always start the conversation with one question, which is what was your biggest Code RED moment where something went totally wrong?

Marcin Wyszynski: Just one. I probably have a whole book of those. I can give you the most expensive one. So I was back at Google working on cold storage tape libraries. Like, imagine a high tech company like Google is using tapes to backup data. And we're backing up a lot of data. Obviously, a lot of data was actually Gmail. Now with Gmail, Gmail has so much data that they occupied a lot of our tapes. And some of that data is quite time sensitive. So what we thought or what they thought, Google thought, well, we should probably have a way to very quickly get the right data replayed or retrieved from the tapes. Now, if you have a lot of data, then you have a problem with how many tapes do I need to retrieve to get like a part of my Gmail back? Now, luckily Gmail was sharded, so shards were logical and physical data centers. And that, you know, what they figured out is that when something goes wrong, it generally goes wrong in 1 or 2 shards and there are like 50 of them, right? So the logical approach is to essentially backup each chart separately. There's enough data for us to actually not make a difference. And we assumed, okay, if the tape is already open in the library, then we just prioritize it for the shards. So, each tape, each shard gets its own tapes.

Marcin Wyszynski: But because there's so much data, there's always going to be let's say you're shard 21 of Gmail. There's always going to be a tape in the library for shard 21. Right. And we had multiple libraries per region like the they're behemoths, like a $10 million behemoth sitting in the middle of a data center with like, 64 drives each and like, ten like libraries for storing tapes for quick retrievals. So, like, beasts. Really, really beasts. Eight robots per each library. And so here's the question. Right. Gmail is such a big part of what we do. Will that increase costs or will that decrease the capacity for everyone else? And my work as necessary was to do the calculation to replay how it would impact, like if we started charting Gmail backups, how would it impact our costs and our capacity for other users? And so what I did is, like I essentially took all the historical data and replayed it with different parameters to simulate a few options, and I made an assumption that if a tape is open in one library in one region, then the data will be prioritized in there, right? It's an obvious assumption to make. And the assumption was wrong. The code that was meant to do this was commented out.

Marcin Wyszynski: And essentially when the data came for a given project, a combination of project and chart, you wouldn't care whether there was a tape open in whatever library in the region. You would open a tape in another library. So first you'd first choose a library, and then you choose a tape rather than choose a tape first and then choose a library. It's so stupid that we never assumed that it would be the case, and the code was there to make it different, but it was common to that. So we start backing up Gmail data. We're so happy it goes live and everyone starts complaining. You know, the libraries are not backing up anything but Gmail. You know that each Gmail tape now is about 10% capacity, because there's not enough data, because they're opening a Gmail tape in every chart, in every library. And suddenly all of the operations came to a halt. All of them. Right. So now you have two options. Obviously, you can't continue. Obviously, you have to stop it. And then what do you do? You fix the code to you. You essentially tell Gmail people, no, it's probably not like probably not going to do that. And it was actually the latter. We told them, look, we're not going to fix the code. We're actually building a new system for now. You probably want to back up to disk and to tape with tape being your last resort.

Marcin Wyszynski: You're probably going to have more luck doing like restore from disk first. And then if you're missing some data, then we'll retrieve it from tape. We essentially said, look, the code is so we don't even understand. Like, why don't we uncomment the code? We'll probably there it was commanded for a reason, and because it was a legacy system and we were working on a new system, it was more like, okay, what are the unintended consequences of on commenting this code? We actually didn't know. And so like I know it sounds crazy that we didn't just fix it, but we didn't know what the unintended consequences were. It's like there is this rule that if you don't like Chesterton's friends, right? If you don't understand why there's a fence, I will not let you remove it. I'll let you remove the fence. Once you explain to me why it existed in the first place and why it no longer applies. So that's why we chose to roll back fully and not try to do some shenanigans. But it was pretty drastic. Like, it was like Google backup tape backup is completely broken And right now, all we do is we backup some Gmail data. Nothing else. Right? So something happened in the meantime. Oh, my.

Mirko Novakovic: Yeah. I mean, it's not the first Google Code RED moment here in this series. We had a few Google, but I always there's two things I always learn. One is the scale is just crazy right. And so the problems are very specific. But can have a really high impact. Right. That's a but b what I really like is that you kind of think that Google is doing things like, very differently, but then finally you hear, hey, they have tapes and tape robots and they do backups on tapes like a lot of companies. Right? And so it's not it's normal technology. Right. Just on a different level and scale.

[00:07:39] Chapter 3: Lessons Learned from Failure and Tech at Scale

Marcin Wyszynski: Different scale for sure. But again, like, it's still humans and humans will make mistakes and robots will break. So it's like you'll get the same failure modes at Google as everywhere else. You know Google is not magical. I mean, it's like a very well-run company, and SRE at Google is very strong. There's a culture of reliability and there's competence still. Humans were humans. Yeah. You know.

[00:08:05] Chapter 4: From Big Tech to Founding Spacelift

Mirko Novakovic: Yeah. And how did you get them from Google Facebook to becoming a co-founder of Spacelift. Did you first do the OpenTofu step and then found that Spacelift or was it Spacelift?

Marcin Wyszynski: So I left Facebook because I wanted to move back to my home country, to Portland when my first child was born. I really wanted them to grow up with, with family around. I was in Ireland for Facebook. Not just in Ireland, but mainly based in Ireland. And I thought like I, I miss, I miss my family and I'd love the kid to be raised with family. So for two years I worked in my Friends software studio on an internal startup that they were trying to do, to kick start. It was static code analysis. It's still around. It's called Code Beat, but it didn't really take off. After two years of it not taking off, I kind of gave up. I saw it going really nowhere, and I wasn't really interested in software studio work. So I was like, well, I love cloud and I will do some cloud consulting, so I built clouds at Google, I built clouds at Facebook. I'll build on the clouds now. And so I was consulting for numerous companies mainly in Europe, for example. mobility, like I can hear a German accent. So you probably know your mobility, right? Yes. Berlin based, but also Deliveroo, which is a UK based pretty popular.

Marcin Wyszynski: It's like a DoorDash in Europe. And as part of especially Deliveroo I, I was leading a major replatforming from Heroku, which was not scaling even like it was the largest installation of Heroku in the world in history. Delivery. At some point it was just not like Heroku could not ship what delivery wanted. Heroku was actually begging Deliveroo to get off Heroku as soon as possible. And so we did. But as part of that, we needed to replatform to AWS. And a large part of that was Terraform. On behalf of that client, I got an approval for Terraform Enterprise, which was the HashiCorp official product. Tldr it was so bad that it nearly wiped out our production, and then I had to go back to Circleci and I was like, I was really? I said sorry to my boss, I didn't know I. I meant, well, I understand that I'm on a very good day. Right? And I still cost you, like a few hundred quid for Terraform enterprise. I'll make it up to you. And she's like, no, there's no need. Like, humans make mistakes. And I'm like, no, honestly, I feel I can do better. So I build Terraform Enterprise as a way of saying, sorry, I retain the IP rights.

Marcin Wyszynski: And what I noticed is that when I built it, it was, it was an instant success. But what I also noticed is that when delivery people were leaving delivery for other companies and in the City of London, you have a lot of people moving around, right? So like people are moving to like 43 people are moving to all the different banks. And in the City of London, all the different hedge funds. The first thing that they would do is when they looked at the infrastructure, how infrastructure was done and those new companies, they would come back to me and ask me, can I implement this in their new company? And I thought maybe some of the ideas that I had were not that bad, so what I did is I went to HashiCorp and applied for a product manager role for enterprise. I said, look, I had an experience with your product. The experience was so fruitful that I built something different. And I wanted to show you some of the ideas and maybe we can collaborate. They were like, you're crazy. Well, I don't think so. And I like proving people that I'm not crazy. So I started my own business.

[00:12:06] Chapter 5: Open Source, Terraform, and the Birth of OpenTofu

Mirko Novakovic: That's a good way to start it. But at that time it was still terraform. It was not yet to OpenTofu or.

Marcin Wyszynski: No OpenTofu was two years ago when, you know, for nine years HashiCorp has been benefiting from the open source community. So Terraform was open source, the provider ecosystem was open source. And this is essentially they built the whole company around open source. And when the business was not going great, they said joke, we were joking. It's actually like we're closing the source now. It's going to be BSL Business Source license. Right. So now open source we get to. It's a play to play. We get to choose whether we really all like to use it or not allowed to use it. And we thought, I mean, we're not like we're not using Terraform directly. Our customers are using Terraform directly. But it's horrible for a user, right? It's putting the future out. The whole community are in jeopardy, right? It's putting our business model in jeopardy. And when that BSL change happened, we essentially got together with our beloved competitors, and we said, look, we need to keep the open source infrastructure alive, not just for, you know, our own businesses, but also for the whole community. That depends on this being available, right? You can't imagine how many different open source, but also commercial systems have Terraform as some part of dependency, like the part like a good part of CNCF ecosystem, is based on Terraform in one shape or another. It's like one of those very foundational technologies. And it turns out that if you do a rug pull on that key technology that underpins a lot of clouds, then a lot of people get very nervous and a lot of business models and a lot of products are becoming endangered, right. And so that's why OpenTofu has been such a resounding success.

Mirko Novakovic: Yeah. It makes sense. And so and then you build a top. You told me before, before this conversation we had that it's kind of a multiplayer mode for your space, right. It's built for enterprises. So that.

[00:14:20] Chapter 6: Spacelift as the Multiplayer Mode for Infrastructure as Code

Marcin Wyszynski: Mainly yeah.

Mirko Novakovic: Yeah. So give us a little bit of a quick overview of what it does and why companies choose it.

Marcin Wyszynski: Yeah. So Spacelift is for IEC infrastructures code. What's GitHub is forget. You can run terraform or you can run Plumie or CloudFormation from your command line, from your laptop. And as long as you're the only person working in the company, then it's probably fine. The moment you're here you start working as a team. It's like trying to build software without some central server. Technically speaking, you could send each other's like patches, git patches, or git dev something. You could review it. You can rebase on top of patches, but like no one works this way, right? Everyone works for GitHub because it's just more efficient. It's like you want to centralize this, this workflow. You don't want to step on each other's toes. You don't want to have unresolved conflicts, right? This is where centralization gives you flexibility and control and speed. And essentially this is what space left is, right. We technically speaking, we're a Cicd platform for infrastructure as code. But from a functional point of view, we serve as a multiplayer mode for any sort of infrastructure work. So you know, we connect different pieces of your infrastructure. We make sure that the changes are, are reviewed, that they're compliant with your company policies because, like, policy as code is not a huge layer. It's not just tar, but it's like open toed through policy, but also open policy. Agent. Right. So, you know, you get a lot of like automated review, automated notifications, incidents being fired, etc. if something goes wrong. Really just automation around the process of building, deploying and monitoring the deployment of infrastructure.

Mirko Novakovic: Okay. So it's also a lot of workflows you can define and do these things.

Marcin Wyszynski: Yeah. Dependencies between different parts of your infrastructure. Right. So your infrastructure is always a graph. But sometimes like if you're a large company that graph would comprise of hundreds of thousands of elements. So you can't really manage them as one graph anymore. You need to manage them as individual graphs. But then you have a problem of graph of graphs, right? Because those things are always connected. There's always some form of connection between pieces. Right. There's is very rarely that a part of infrastructure or piece of infrastructure is entirely self-contained. Yes, you do have them. But most of the time the company infrastructure is a graph of graphs. And then you you go from how do I synchronize a huge graph to how do I synchronize multiple small graphs? It's a better problem to have, but it's still a problem, still a challenge that you need to solve. And this is where space comes in.

Mirko Novakovic: Yeah. By the way, very similar to what we have in observability, right. We also kind of maintain a graph of the infrastructure components and the components that run on top of the infrastructure, right? The application component services, so that we understand the dependencies. Right. So if I don't know you, you have a problem in a cloud in a zone that you can understand okay. What is the impact. Right. Because without the Dependencies, which you can create on many ways, but tracing is a good example for it. To understand the flow and the dependencies, it wouldn't work right?

[00:17:45] Chapter 7: Observability, Graphs, and the Role of Dependencies

Marcin Wyszynski: I think, a graph of graphs and terraform infrastructure is not very different from something like a service map in observability.

Mirko Novakovic: Exactly. Yeah that's true. And so talking about observability, I know you're a Datadog customer, right? So how do you use observability internally?

[00:18:01] Chapter 8: Observability Stack at Spacelift

Marcin Wyszynski: So we use two main observability tools. We use Datadog for almost everything. And we use I think they're called differently now, but they used to be called Bugsnax. It's like a Sentry flow, right? For exception tracking. Well, I do understand that Datadog, those exception tracking it was not exceptional, pun intended, for us. So the only thing we used outside of the Datadog suite, was bugsnax. We use everything we use, metrics, we use logs. We use APM. We're very heavy on APM. Like we don't do unstructured logging all that much. This is mainly unstructured logs or maybe for forensic purposes. Right. But for any sort of day to day monitoring, there would be metrics. And we also have APM and we built metrics from APM span spans. We also use some of the Datadog value added services, like for example API testing. They call it synthetic testing. We do plug into CI, CD observability a little bit. And we do database monitoring, although we also have a separate database monitor solution. So the reason we use Datadog for that is mainly because we set it up once and forgot to turn it off, I guess. Honestly, I think that's the reason I can't remember what's the name of that dedicated Postgres solution, but it's it's pretty good. What I like about the current setup is that everything is connected, right? That, that you can cross-reference different parts of, of your infrastructure that logs and metrics and spans and hosts. You can easily navigate through the whole thing. Is it magical? No, it's not that we use some Datadog special magic there. It's just that we like the fact that it's connected. And since we've set it up like five years ago, we haven't seen much reason to, like, really change. Well, now, maybe with the cost we do, but it's always been an opportunity cost, right? Do we focus on moving off Datadog because they cost us too much, or do we build something that can give us better revenue?

Mirko Novakovic: No, absolutely. I mean, they are not the leader in this space. Or they are the leader in this space for a reason. Right? It's a good, good product. It is a good product by a wide range of products. And I like that you said that you are a big APM user because I've been in the APM space for 25 years now and I'm a big believer in APM, but still a lot of our customers that we talk to are not using tracing APM too much. They are still very much in the logging era, right? I don't know really why, but I think a lot of developers tend to prefer logs over tracing, though. Tracing at APM can be much more powerful, right? Especially for getting.

[00:21:12] Chapter 9: The Value of APM Versus Logging

Marcin Wyszynski: Context, right? Like logs or logs. You understand maybe the scale at which something is happening because you can group logs and see like what? What are the common, you know, log lines or patterns with APM. So you get the whole understanding of what was the call stack? Right. How did we even get to calling this particular log line? Right? Yeah. It's like it's not frequently. Not obvious how I got here.

Mirko Novakovic: I personally think it's really about control, right? I think developers love to have full control. And with logs, you just I mean, you use a login framework and you can basically log where an APM was. And we are turning now into the Otel discussion. Right. But before Otel, it was pretty much proprietary. And it was kind of magical. Right. You had an agent which did kind of auto instrumentation, and I had this company Instana. Right. That talking about HashiCorp was also sold to IBM. We sold to IBM in 2020. And our kind of approach was to have this auto discovery on the agent side. Right. You use the agent, it would discover, oh, there's Node.js running or Java or.Net and then it would inject an agent. It would trace out of the box. Right. You don't have to do anything. Which I always loved. But what I figured out is that some developers, they don't like it, right? Because it's the magical part also means you are kind of losing control of what's happening. So developers kind of like if they have full control, right?

Marcin Wyszynski: It might be a bit noisy the way I think about it, but I yeah, I've worked with systems like this. I believe that New Relic did something like this for Ruby applications. Like you didn't touch anything and it was auto instrumented for you back from my Ruby days. I haven't used New Relic for ages now, but. Yeah.

Mirko Novakovic: Yeah, it's. I mean, I would say it's still very similar to Datadog these days. It's not big. Also probably more than a billion in revenue. So it's a huge platform. And it's a good tool too.

[00:23:13] Chapter 10: Migrating from Proprietary Observability to OpenTelemetry (Otel)

Marcin Wyszynski: Yeah. It's just I haven't had a chance to use it. I used it back when I was doing a bit of Ruby.

Mirko Novakovic: Yeah. They started magic, right?

Marcin Wyszynski: They're magical. They're magical.

Mirko Novakovic: Exactly, exactly. And so I heard that you also switch to Otel internally in parts, right? Yes. Give me some of your thoughts around it. And why did you do that?

Marcin Wyszynski: Sure. So. So it was very immediate reason why you switched to Otel is that we started as a SAS like pure SAS. And in that case, like it, nobody cared what we were using internally for observability. But then we started getting requests for self-hosted. And then when we did for self-hosted the first, because we were on AWS, we the SAS is still on AWS. The easiest way to ship to customers was to ask them to give them to give us an AWS account. Right. So like we will give you scripts to deploy and almost the same thing minus some public stuff on your AWS account. And then we can't assume that they have Datadog anymore. I mean, they might, but they might also not. So we can't really assume that the only thing that they're going to be using is Datadog. However, if they have AWS, then what they definitely have is CloudWatch logs, and they definitely have CloudWatch metrics, and they definitely have X-ray. They can do they can enable X-ray, right? So in order to ship to AWS like self-hosted AWS, we said, okay, if you have Datadog, here is the magic incantation to use Datadog telemetry. But if not, then what we do at the box is we will be using AWS building blocks. And now what we had to do is we had to take all the code that was used in that, that was used to instrument Datadog or instrument things with Datadog and make it work with AWS tooling. Luckily, we always did Middlewares. So it was like it was a well-defined interface between, you know, between the code and Datadog. So the code was didn't even know that it was dealing with Datadog. We had our own like, abstraction layer that we used going for only. So it passed the context around. And in the context, you had the ability to to essentially call your current flavor of, of observability be that logs or APM or metrics. Right.

Mirko Novakovic: Like an interceptor or something.

Marcin Wyszynski: Yeah, exactly. Yeah. So it was like, yeah. So, so, so we had it for Datadog and we had it for X-ray. And at that point people started asking, okay, well, but we want a full, full autonomy like you want to set up on GCP. You want to set up in, in, in a fully air gapped environment and like, okay. So I can't assume Datadog anymore and I can't assume AWS anymore. What can I see. Well, there's just one thing that you can assume. And luckily Datadog can use it. Aws can use it. Everything can use it. And if your observability system doesn't speak at all, frankly, it's your observability system's problem, not mine. Right. Then you're running something really obsolete. And, well, at which point I can't really help you much other than spitting everything to standard out, which I do anyway. So like if that's okay. Observability. If that's not observability for you, then fine. But otherwise Otel is what you can use. It turns out everyone speaks. So these days even legacy vendors have some form of Otel compatibility. So it's never since we implemented Otel, it's never been a problem.

Mirko Novakovic: Oh, it makes sense. But how was the implementation process? Was it hard or was it pretty straightforward?

Marcin Wyszynski: The because of the abstraction layer that we were using, the implementation like it was just about Re-implementing interfaces and in day Datadog. And actually they followed very similar principles. So it wasn't crazy. Like we didn't have to rewrite. We just push based metrics. We don't use Prometheus, we use push based metrics. So it wasn't that crazy to essentially port to auto collectors a few hundred lines of code tops. And then the ability like, of course you need to test it with, with a few solutions and then the ability to choose which, which telemetry system you want to use. I do believe that we still have the ability to use native Datadog, but I would encourage it, like there's absolutely zero value in using this over Otel. And then the documentation is the same for everyone.

[00:28:09] Chapter 11: Implementation Details and Dashboards

Mirko Novakovic: No, it makes sense. And how do you deal with things like dashboards and stuff? Is it on the customer side? So if you or how do you do that.

Marcin Wyszynski: We tell them, we tell them, We tell them what? What dashboards say. They keep using Datadog. We have a giant JSON that they can implement, and that gives them that similar dashboard to what we have. But for a single tenant implementation for other vendors, it's really we just show them the sort of metrics and what spans might actually be interesting for them.

Mirko Novakovic: And yeah, we use Perses right. Perses is the new standard, CNCF standard for dashboards. You should look into it.

Marcin Wyszynski: I don't know right.

Mirko Novakovic: Yeah. It's a pretty new standard. It's called.

Marcin Wyszynski: Perses. Perses. Mythological.

Mirko Novakovic: Right.

Marcin Wyszynski: Yes.

Mirko Novakovic: Okay. yeah. And it essentially is a standardization of the dashboarding format. Not too many vendors, to be honest supporting it yet. Right. I think chromosphere we do, and 1 or 2 others. But at least you now have a standard where you could do the same and it supports infrastructures code. You can deploy them with the code and do stuff like that. Right.

Marcin Wyszynski: You know, what I really like about standards is like it shifts to blame. Like you can't tell me that I, you can't tell me that I don't support something. I'm using the most common standard. If you're not using the most common standard, it's not my fault. I know. It's like it's a bit cynical, but, like, you can't support every vendor that is out there. That's why I really like open standards. It shifts the conversation towards like, we all need to implement the same standards. Otherwise it's just like Cartesian product of all sorts of interfaces and all sorts of protocols you could be using, right? No you're not if you're not, if you're not implementing your like in your product, if you're not implementing or in your observability stack, you're not supporting open standards, then there's only as much as I can do. Of course, if you're going to pay me $2 million to interface with your proprietary logging system, then I will consider it. Right. But, like. Yeah, like the responsibilities on you to some extent.

[00:30:30] Chapter 12: The Power and Benefit of Open Standards

Mirko Novakovic: No. And like what you said. Right? You said okay, but the value of using something proprietary like Datadog is just not there, right? And I mean, if OpenTelemetry is, like, good enough or does provide anything you want, why wouldn't you not go with the standard in contrast to being proprietary? There's no reason.

Marcin Wyszynski: There's no reason. And then no vendor has a way to keep you from using or to keep you using a product that is suboptimal, right? Or lock you into pricing. That is not fair. Like if you can quickly move your or take your toys and move to another sandbox, then the maintenance of a sandbox is a good reason to actually maintain it in a good shape. Right? So, it shifts the balance of, of power in the consumer favor.

[00:31:21] Chapter 13: Envisioning Observability for Infrastructure as Code Tools

Mirko Novakovic: So how do you see then opentelemetry becoming part of the whole infrastructure is code. Terraform OpenTofu stack. Right. I mean, in the pre discussion I learned that there could be a terraform apply could take 45 minutes. And you don't have too many insights. What's happening, what's failing. So what is needed to get that kind of visibility observability into that stack.

Marcin Wyszynski: Yeah. So sadly even today even with tools like Spacelift the Terraform and similar tools are black boxes you put your code in and then it spits out either very little or a ton of completely unstructured logs. And best of luck understanding what it does without some, like crazy like ETL and analysis and linking that stuff together. It's insane. Right. So you can't really you can't really understand what that black box is doing. So what OpenTofu is, has started working on and they implemented that on their end, is Otel. So instrumenting all of the call stacks and OpenTofu to understand how operations are, are proceeding through the whole dependency graph, the whole call and call stack in OpenTofu where things are coming from. How do pieces of infrastructure come together from the codes that was fed to it. Right. And so it's no longer a black box inside OpenTofu. Sadly, it's still a black box between OpenTofu and providers, OpenTofu. And Terraform previously has this interesting architecture where the workflow manager is separate from the providers. So you have the central workflow manager, which essentially coordinates providers, which are connectors between the central workflow manager and individual APIs. You have an AWS provider, you have a GCP provider, you have an Azure provider for Microsoft. And those providers would abstract the API. So even though you have open Terraform nicely nicely instrumented in order to get the full value, you probably want to also have all the providers instrumented. And because the ecosystem is so vast and because it's third party supporting those plugins, it'll sadly take a bit longer to get there. Right. See, you know, you'll try to provide increasing value by, by instrumenting most common providers like there's a big five obviously. And we'll try to instrument them to a point where you get the value, but the long tail, you can't possibly afford it.

Mirko Novakovic: And instrumentation means you add a kind of logging or you add a tracing to it.

Marcin Wyszynski: Tracing, I think. that's more tracing, right? Like logging, there is a limited value of logging in, understanding how things are tied together. And it's really it's really like where things are coming from step by step that, that that same graph, if you can if you can see that this is where the value is, why certain things are being repeated, maybe like maybe your API is rate limiting and then you can see which calls were being rate limits. And then you can slice and dice however you want. But you get this I love graphs, right? You get this full view of what was being retried when it was being retried, how long it took. But sadly a lot of that is on the provider now. You can still get a lot of value from seeing what OpenTofu was calling, right? And if OpenTofu was calling your provider and your provider was responding with an error, then you could probably go to the logs and see what that error was and you already have some value. But if you could make an APM for everything, it would be the ultimate observability insight. We're not there yet. It's an aspirational goal.

[00:35:38] Chapter 14: Tracing, Instrumentation, and Visualization

Mirko Novakovic: Yeah, but it's good. Now that we have the standard, people can do it. Right. That's and it will automatically. If you have a provider and it supports it, it could it would automatically plug in. Right.

Marcin Wyszynski: As I said, it's like shifting. It's like shifting discussions. Like you're no longer asking why am I supposed to implement this standard? And not that standards like this is the standard. Like if you're if you're not if you don't want to implement Otel, then you're what? You're you're you're arguing that observability is abilities is is unnecessary. Like what? What is your point if you're refusing to actually go with the standard, right. You're, you're you're defending against like common common sense.

Mirko Novakovic: I guess I really have to look into it, because I can also see that you could require some sort of specialized user experience to it, compared to a standard trace in an application, probably. I could see.

Marcin Wyszynski: That, really. I mean, you can just pass I think you just pass an IDs through through the gRPC calls and through the through the command line, like when you're starting

Mirko Novakovic: Just in terms of visualization. Right?

Mirko Novakovic: What do you think a normal trace is good enough?

Marcin Wyszynski: It's no different than distributed tracing.

Mirko Novakovic: Yeah. Okay. That's good, that's good.

Marcin Wyszynski: Then each provider becomes like a separate service. Right. So you have the workflow orchestrators as one service and then each provider because it's a separate service, and then you visualize it as a distributed trace.

Mirko Novakovic: Yeah, it makes total sense. Yeah. Do they have some special semantic convention for the tags already in place?

Marcin Wyszynski: Oh, yeah.

[00:37:19] Chapter 15: Instrumentation Standards and Semantic Conventions

Mirko Novakovic: Yeah. That's good. Yeah. Yeah. I have to look into it. That's good. Yeah. Because we are slowly coming to the end of the conversation. I, I love to talk about AI with everyone also and I saw. So it's probably not by accident that on your website there's not too much talk about AI, right. And I guess it's not an accident. Right. So how do you see AI in your space infrastructure as code and, and these things evolving. And what do you see working. Where do you see. It's a lot of FOMO. Right.

[00:37:54] Chapter 16: AI in Infrastructure as Code—Opportunities and Hype

Marcin Wyszynski: I think it's not FOMO. I think it's real. But I think we don't know how to use it. What I see is people using AI to in a very shallow way, and I don't want to name names, but some of our competitors have implemented code generation like I will generate Terraform for you. I'm like, can cloud code do it for you? Can't do it for like, what is the value of space in space doing it for you. Like, look, I'm not against AI and I think it's like internally we put a lot of effort into understanding AI and boosting our productivity with AI. If you really want to be strategic about AI, if you really want your products to live in this AI ecosystem, then you should maybe shift your thinking from how do I do something shiny with AI? How do I? Because like, generating Terraform is a parlor trick. Yeah, and it's a stupid parallel trick because every IDE extension that deals with cloud code or ChatGPT or Gemini can do it for you. Like, what's the point? Like, don't do a parallel trick. Think about how your tool, how can your platform live in the AI driven ecosystem. So what we for example, have been investing it in is MCP. So you can expose Spacelift and Spacelift API through MCP to agent workflows. Right. We have an CLI. So you get a standard base MCP server. And then you can have agents talk to Spacelift using MCP or programming in spaces.

Marcin Wyszynski: Because we also expose like our programming interface. Like if you want to vibe code something on top of Spacelift API, then we give you the full suite of tools to do it right. This is meaningful generating AI, like using AI to generate code Or the other thing that we've seen is like, oh, I will provide a magical interface so that you can click on your Terraform code or Terraform code. And I will explain this Terraform code for you. Well, yeah, but I can I can literally take my snippet of Terraform. I can post it in ChatGPT and it will do a fine job explaining that to me. It's like it's a stupid parallel trick. What are you are you an interface to ChatGPT? Like, what's the value added that you're using? Right? Is it because you want to impress your investors with some sprinkle like AI does? Or are you treating this seriously as a potential paradigm shift? And I treat it seriously. That's why you don't see all that much on AI on space left, because I think we're still in discovery mode. What it means to be a good citizen of the CI ecosystem. And my feeling it's not about showing shiny like, oh, look, I made ChatGPT do this for you. I mean, cloud do this for you. It's shallow. It's shallow. I think there's much more potential that that we can use, but I don't think it's fun. I don't think it's fun.

Mirko Novakovic: No, but it is. It is funny because we have talked about this before, but we came to the exact same conclusion, and we did exactly the same. We just launched our MCP server, and we have integrations with all the coding agents. And, and we came to the same conclusion saying, why would we do a chat interface into our tool? Why would we build another? Hey, build me a query or whatever, fix me code because the way it should be is that in cloud code or curse, however, you just connect to the MCP server and say, okay, take that information from there. Zero and be natively integrated. We came to the exact same conclusion. Yes. Yeah.

Marcin Wyszynski: Yeah. And I think it's the future. Right. So it's in a sense like you're, you're working with AI, you're not trying to build AI, you're working with AI. And interestingly, AI itself. I think it's becoming commodity. Right. But he ends up using cloud or ChatGPT or Gemini. I honestly had the exact same success with connecting to all of them. Like the difference is now is like 2%, 3%. Some are better than this, some are better than that. But to me, it almost feels like you're plugging in like it's like an electricity socket. Like I plug my, I plug my agent, I plug my workflow to an agent or the other agents. Like, do you see different models? Like same thing honestly. I mean probably, probably AI aficionados will call me crazy, but to me I get almost the exact same results.

Mirko Novakovic: And I mean, when I think a Tropic released MCP, right. They said it's like USB for for AI, right? It's a standard plug. And I like the idea. Right.

Marcin Wyszynski: Yeah. They also have remote MCP now. Yeah. I mean, both standards are good. I mean, sometimes you want local, sometimes you want remote. But I think remote gives you the sort of SAAS friendliness with MCP.

[00:43:07] Chapter 17: Future of AI Integration and Closing Thoughts

Mirko Novakovic: Yeah. Marcin, thank you for joining Code RED. It was fun talking to you. It was very quickly. I learned a lot, which was really interesting. I love the final discussion around AI because we see exactly the same, that a lot of our competitors are adding this AI sprinkles. And we totally agree that you should be fully integrated. It's meaningful. It's there. It's impressive what you can do with those agents. Right. And you have to be become part of it.

Marcin Wyszynski: Yes. That's our strategic goal as well.

Mirko Novakovic: Thank you. Marcin.

Marcin Wyszynski: Thank you.

Mirko Novakovic: Thanks for listening. I'm always sharing new insights and insight and knowledge about observability on LinkedIn. You can follow me there for more. The podcast is produced by Dash0. We make observability easy for every developer.

Share on

More Episodes

#27 - The Hard Truths of Modern Observability: Lessons on Cost, Complexity and What Needs Fixing with Andrew Mallaband

Episode 2734 mins2025-06-12

Andrew Mallaband

#28 - Infrastructure in Flux: Marcin Wyszynski on OpenTofu, Observability, and Standardizing IaC at Scale

Listen on

About this Episode

Transcription

More Episodes

The Hard Truths of Modern Observability: Lessons on Cost, Complexity and What Needs Fixing