Monitoring Minecraft with OpenTelemetry

One of the secret pleasures of life is to be paid for things you would do for free. On a completely unrelated note, this blog post documents my time figuring out how to monitor a Minecraft server with OpenTelemetry, Prometheus and Dash0.

The Minecraft Server repository on GitHub provides step-by-step setup instructions you can follow along.

Update: Hacker News set me straight about the programming languages used in the Minecraft Bedrock server.

A Minecraft server to call my own

I want a Minecraft server for multiplayer, where I can do mischief with the kids. And hopefully do not embarrass myself too much by being repeatedly killed by angry chickens or whatever.

Kids these days think Java is not a cool programming language. Little do they know that it powers one of the games they love most: Minecraft.

(Well, technically the “original” Minecraft server is written in Java. Microsoft made things confusing by adding the Bedrock server, which reportedly uses a combination of C++ and Java on Android, and is different in terms of gameplay from vanilla in subtle ways. Opinions on Reddit differ on why Bedrock needed to exist in the first place.)

There are so many ways to host a Minecraft server when one considers the multiple launchers (ATLauncher, CurseForge, Bukkit, Fabric, Fork and the fifty more you are likely already typing in the comments). But I am a man of simple tastes, and running the “vanilla” Minecraft server as a Systemd unit on a Linux VM in the cloud is exactly my cup of tea.

And if I have learned anything about providing IT infrastructure to my family, it is that their expectations on SLOs and system reliability are up there with NASA’s Moon exploration programs. So, the Minecraft server should work reliably and, if it goes down, I should know well before they do.

Hence: I need monitoring.

Lots and lots of monitoring.

The monitoring setup at a glance

The monitoring setup for my Minecraft server is shown in this diagrams:

There are three components that collaborate in collecting the telemetry to send to Dash0:

The OpenTelemetry Java Agent runs inside the Java Virtual Machine powering the Minecraft server itself, and it reports runtime telemetry about the JVM to the OpenTelemetry Collector.
The Minecraft Exporter for Prometheus collects Prometheus metrics that are specific to Minecraft like player count, how many blocks have been mined, and most importantly: how many cake slices have been eaten.
The OpenTelemetry Collector, well, collects more telemetry, like Systemd logs for the Minecraft server and the other components, receives telemetry from the OpenTelemetry Java Agent, scrapes the Minercraft exporter, adds some normalization to the telemetry (most importantly: resource metadata to neatly categorize which telemetry comes from what), and sends it all off to Dash0.

Runtime metrics with the OpenTelemetry Java Agent

The OpenTelemetry Java Agent has an extensive set of automatic instrumentations for distributed tracing for many application protocols like HTTP, databases, messaging queues, etc. But none of them are quite applicable to a Minecraft server: Minecraft clients talk over a TCP socket with a protocol specific to Minecraft, and there is no distributed tracing instrumentation for it in the OpenTelemetry Java Agent for it.

And to be honest, I am not missing it: I don’t see value in creating a span, say, every time I get manhandled by an Enderman. First of all: I don’t need that recorded for posterity. And secondly: that is way too much entropy for the good of the universe.

The Java dashboard in Dash0, showing me metrics about my Minecraft server.

But what we get out of the box with the OpenTelemetry Java Agent, is runtime metrics about the Java Virtual Machine itself, and especially, CPU and memory. In Dash0, it’s one click to import the Java integration and get out-of-the-box visibility in the key JVM metrics.

Minecraft-specific metrics with a Prometheus exporter

Collecting JVM metrics will tell us a lot about whether the server is running, or why in some situations it may feel slow, e.g., CPU usage too high, for example because of garbage collection. But what about the fun stuff, like how many players are connected, how many blocks are mined?

Besides, there’s things I will make sure do not get recorded, like how many times I died (the minecraft_deaths_total counter). Luckily, Dash0’s Spam filters will allow me to bury my shame with just a couple clicks.

Many, many Prometheus exporters

First of all, there is a lot of software out there that can expose Prometheus metrics about a Minecraft server. A search on GitHub yielded:

The Minecraft Prometheus Exporter, which uses the extensibility of Bukkit to add a Prometheus endpoint via an additional JAR file. I wanted to use a vanilla Minecraft server, so anything relying on Bukkit is not an option for me.
The minecraft-prometheus-exporter (names in the Prometheus ecosystem tend to be pretty to the point, which leads to name clashes) which uses Fabric, another way to run Minecraft servers with mods. Like Bukkit, Fabric was not an option for me.
The minecraft-exporter, written in Python. I truly had no wish to wrangle Python and its package ecosystem, so I gave this a pass.
And finally, the one I went with: the Minecraft Prometheus Exporter by Engin Diri, which ticked for me all the boxes: written in Go, easy to download from the GitHub releases pages, and has a lot of cool telemetry.

Now, finally armed with a Prometheus exporter that tickles my fancy, let’s have a look at how Prometheus exports generally work. Because, if you come from the world of OpenTelemetry, it may not be what you expect.

Push vs Pull

Collecting metrics about the Minecraft-specific aspects of a Minecraft server is a bit more convoluted than collecting telemetry about the Java Virtual Machine it runs on. While both the OpenTelemetry Java Agent and the Minecraft Exporter collect metrics, there is a fundamental difference in how they ship the metrics to a backend. Specifically: the OpenTelemetry Java Agent pushes metrics towards a destination (in our case, the OpenTelemetry Collector), while the Minecraft Exporter needs to have its metrics pulled.

The pull model is a distinctive aspect of the Prometheus ecosystem, where you need something to scrape (i.e., pull at regular intervals metrics out of) your Prometheus endpoint. In our case, scraping too is something that the OpenTelemetry Collector can do with its prometheusreceiver.

Note, the name “receiver” can be misleading here: usually the OpenTelemetry Collector receives telemetry, as in, it is sent telemetry from something else. So components that “add telemetry” into the Collector are generally called receivers, even when they “actively get” telemetry from somewhere else, like in the case of scraping Prometheus endpoints or, as we will see later, collecting Journald logs.

Collecting logs

The last piece of telemetry we need for our setup is logs. And those are not just the logs of the Minecraft server itself, which includes information about crashes, slowdowns, player activity and so on. But also the logs about the other components of the monitoring setup, namely the OpenTelemetry Collector and the Minecraft Exporter. Since the OpenTelemetry Java Agent runs inside the Minecraft server, that is, inside the same Java Virtual Machine, the logs collected from the Minecraft server include the OpenTelemetry Java Agent ones.

In my setup, I run each of the components as a Systemd unit. The logs generated by Systemd units are collected by Journald, and the OpenTelemetry Collector can get the logs from it.

Alerting

For now, I am going to keep it simple: I want to be notified if the server is down. Ideally, before my son sends me some sternly-worded messages.

Is the server running?

In Dash0, I can check for that with the following PromQL expression:

123456
absent({
  otel_metric_name="jvm.cpu.time",
  process_command_args=~".*server\\.jar.*",
  service_name="minecraft-server",
  service_namespace="minecraft"
})

The alert fires if there is no CPU usage reported by the JVM running the Minecraft server, which is a pretty good proxy for “the server is not running”.

Notice that this also nicely doubles as a dead-man switch for the entire setup: for example, if the Minecraft server is running, but the OpenTelemetry Collector is not, I’d still get paged.

Is the server restarting?

Knowing whether no server is running is a good start, but not nearly enough. Specifically, I should check for restarts of the JVM, which Systemd is going to do if the server crashes. This can be accomplished with a rule on logs, which in Dash0 I can query with PromQL using the “magic” dash0.logs metric:

1234567
sum by (otel_log_severity_range) (
  increase({
    otel_metric_name = "dash0.logs",
    service_name = "minecraft-server",
    otel_log_body =~ "^Starting minecraft server.*"}[10m]
  )
) > $__threshold

This rule will trigger when the server restarts, and the alert will resolve itself the next evaluation without restarts. In Dash0 I can set this to be a warning and route those alerts to Slack, so that I get a ping when there is some downtime. The $__threshold symbol is something optional in Dash0 that allows you to specify the severity of an alert. This is an extension of the Prometheus alerting model, where an alert has severity modeled only in labels, and that means that a rule can have only one severity, and one ends up having multiple copies of a rule, differing only in terms of hard-coded threshold and label.

Is the server crashing?

But there is an even more interesting alert I can raise based on logs, and that is when Systemd fails to start the Minecraft server altogether, which was happening a lot as I was working out the setup:

1234567
sum by (otel_log_severity_range) (
  increase({
    otel_metric_name = "dash0.logs",
    service_name = "minecraft-server",
    otel_log_body =~ "^Failed to start the minecraft server.*"}[1m]
  )
) > $__threshold

Besides, I do not need to know PromQL to create this rule: Dash0 has a query builder for counting logs matching specific filters:

Yeah, working out the server configuration was bumpy.

Why use logs instead of metrics?

In the Prometheus ecosystem, the traditional way to know if a server is up, is to check the up metric associated with scraping the server itself. I felt like I would get much more bang for my buck checking logs instead, and in Dash0 that is also accomplished via PromQL.

What I do miss, however, is metrics about the status of Systemd units. While there is a systemdreceiver in the OpenTelemetry Collector, when I looked into its code hoping to see a metric reporting data on Systemd units, and specifically the status (which would have been perfect for my alert rule), I was surprised to find out that the receiver seems to do precisely nothing.

Conclusions

Setting up a Minecraft server and monitoring it with OpenTelemetry, a Prometheus exporter and Dash0 was a really fun project. Having an excuse to dust off my Java and Linux sysadmin muscles was very welcome.

I could have spent more time on the setup, and come up with dashboards. But honestly, that is not the important bit to me: if the server is up, I’ll be busy playing. So, instead of a dashboard, I did this:

This is what time well spent looks like.

Sources

GitHub repo: https://github.com/dash0hq/minecraft-server

Bedrock programming languages: https://news.ycombinator.com/edit?id=43925005#43959839