Last updated: September 28, 2025

Mastering the OpenTelemetry Filelog Receiver

The OpenTelemetry Collector is the swiss-army knife for modern observability, but not all applications are born cloud-native. Many critical systems, from legacy applications and databases to infrastructure components like NGINX, still write their most valuable diagnostic data to local log files. This is where the filelog receiver comes in.

It's a component that tails log files, parses their contents, and transforms them into structured OpenTelemetry Log Records. While the official documentation provides a reference of its many configuration options, it can be difficult to grasp the key concepts needed to use it effectively.

This guide will take you from the basics of tailing a file to building a reliable, production-grade pipeline for ingesting, parsing, and enriching file-based logs. You'll learn how to handle complex formats like multiline stack traces, manage log rotation gracefully, and ensure no data is lost when the Collector restarts.

How the Filelog receiver works

Before we get into configuration details, it helps to picture how the receiver handles a log file throughout its lifecycle. You can think of it as a simple repeating four-step loop:

  1. Discover: The receiver scans the filesystem at regular intervals, using the include and exclude patterns you've set, to figure out which log files it should pay attention to.

  2. Read: Once a file is picked up, the receiver opens it and begins following along as new lines are written. The start_at setting decides whether it begins from beginning or just tails new content from the end.

  3. Parse: Each line (or block of lines, if multiline parsing is used) runs through a series of operators (if configured). These operators parse the raw text, pull out key attributes, assign timestamps and severity levels, and ultimately structure the log data.

  4. Emit: Finally, the structured log records are passed into the Collector's pipeline, where they can be filtered, transformed further, or exported to your backend.

This Discover -> Read -> Parse -> Emit loop forms the foundation of everything the receiver does.

Quick Start: tailing a log file

One of the most common cases is when your application is already writing logs in JSON format. For example, imagine you have an app logging to /var/log/myapp/app.log:

json
12
{"time":"2025-09-28 20:15:12","level":"INFO","message":"User logged in successfully","user_id":"u-123","source_ip":"192.168.1.100"}
{"time":"2025-09-28 20:15:45","level":"WARN","message":"Password nearing expiration","user_id":"u-123"}

Here's the minimal Collector configuration to read and parse these logs:

yaml
1234567891011121314151617181920212223242526
receivers:
filelog:
# 1. DISCOVER: Find all .log files in /var/log/myapp/
include: [/var/log/myapp/*.log]
# 2. READ: Start reading from the beginning of new files
start_at: beginning
# 3. PARSE: Use the json_parser operator
operators:
- type: json_parser
# Tell the parser where to find the timestamp and how it's formatted
timestamp:
parse_from: attributes.time
layout: "%Y-%m-%d %H:%M:%S"
# Tell the parser which field contains the severity
severity:
parse_from: attributes.level
exporters
debug:
verbosity: detailed
service:
pipelines:
logs:
receivers: [filelog]
exporters: [debug]

Here's a breakdown of the above configuration:

  • include: Points the receiver to all .log files in /var/log/myapp/.
  • start_at: beginning: Ensures the receiver processes the entire file the first time it sees it. By default (end), it would only capture new lines written after the Collector starts.
  • operators: In this case, there's just one: the json_parser. Its job is to take each log line, interpret it as JSON, and then promote selected fields into the log record's core metadata.
  • timestamp and severity: Within the json_parser, we're pulling the time and level fields out of the JSON and promote them to the top-level Timestamp and Severity fields of each log record.

Using the debug exporter, e can the beautifully structured output. The original JSON string is gone, and its contents have been promoted to the Log Record's Attributes, Timestamp, and SeverityText fields.

With the debug exporter, you'll see the parsed and structured output. Instead of just raw JSON, each field is now properly represented inside the log record:

text
12345678910111213141516
LogRecord #0
ObservedTimestamp: 2025-09-28 20:48:36.728437503 +0000 UTC
Timestamp: 2025-09-28 20:15:12 +0000 UTC
SeverityText: INFO
SeverityNumber: Info(9)
Body: Str({"time":"2025-09-28 20:15:12","level":"INFO","message":"User logged in successfully","user_id":"u-123","source_ip":"192.168.1.100"})
Attributes:
-> user_id: Str(u-123)
-> source_ip: Str(192.168.1.100)
-> log.file.name: Str(myapp.log)
-> time: Str(2025-09-28 20:15:12)
-> level: Str(INFO)
-> message: Str(User logged in successfully)
Trace ID:
Span ID:
Flags: 0

Now the Collector isn't just tailing a file; it's transforming raw JSON into structured OpenTelemetry log data that seamlessly flows through the rest of your pipeline.

Parsing unstructured text with regular expression

Most infrastructure logs don't come neatly packaged as JSON. More often, they're plain text strings that follow a loose pattern, such as web server access logs, database query logs, or custom application messages. These logs are human-readable but difficult for machines to work with until they're given some structure.

To bridge that gap, the Collector provides the regex_parser operator. By applying regular expressions with named capture groups, you can slice a raw log line into meaningful pieces and promote them into structured fields.

For example, if you're tailing an NGINX access log file that contains entries in the common log format:

text
12
127.0.0.1 - - [28/Sep/2025:20:30:00 +0000] "GET /api/v1/users HTTP/1.1" 200 512
127.0.0.1 - - [28/Sep/2025:20:30:05 +0000] "POST /api/v1/login HTTP/1.1" 401 128

You can use the regex_parser to

yaml
123456789101112131415161718192021
receivers:
filelog:
include: [/var/log/nginx/access.log]
start_at: beginning
operators:
- type: regex_parser
# Use named capture groups to extract data
regex:
'^(?P<client_ip>[^ ]+) - - \[(?P<timestamp>[^\]]+)\]
"(?P<http_method>[A-Z]+) (?P<http_path>[^ "]+)[^"]*"
(?P<status_code>\d{3}) (?P<response_size>\d+)$'
# Parse the extracted timestamp
timestamp:
parse_from: attributes.timestamp
layout: "%d/%b/%Y:%H:%M:%S %z"
# Map status codes to severities
severity:
parse_from: attributes.status_code
mapping:
"401": WARN
"5": ERROR # Use '5' to match all 5xx codes

The core of this setup is the regex field with named capture groups. Each group labels a slice of the line so the parser can turn it into an attribute: client_ip grabs the remote address, timestamp captures the bracketed time string, http_method and http_path pull the request pieces, status_code picks up the three-digit response code, and response_size records the byte count.

Once those attributes exist, the timestamp field parses the timestamp string into a proper datetime value, and the severity block translates status codes into meaningful severity levels using an explicit mapping: 2xx and 3xx responses as INFO, 4xx as WARN, and 5xx as ERROR.

The debug output confirms our success:

text
1234567891011121314151617
LogRecord #0
ObservedTimestamp: 2025-09-28 21:17:42.31729069 +0000 UTC
Timestamp: 2025-09-28 20:30:00 +0000 UTC
SeverityText: 200
SeverityNumber: Info(9)
Body: Str(127.0.0.1 - - [28/Sep/2025:20:30:00 +0000] "GET /api/v1/users HTTP/1.1" 200 512)
Attributes:
-> status_code: Str(200)
-> response_size: Str(512)
-> log.file.name: Str(myapp.log)
-> client_ip: Str(127.0.0.1)
-> timestamp: Str(28/Sep/2025:20:30:00 +0000)
-> http_method: Str(GET)
-> http_path: Str(/api/v1/users)
Trace ID:
Span ID:
Flags: 0

With a single expression and a couple of parsing steps, a flat NGINX access log is transformed into structured OpenTelemetry data. From there, your pipeline can enrich it further—for example, by mapping the captured fields to the OpenTelemetry Semantic Conventions for HTTP attributes.

Handling stack traces and multiline logs

Not all log entries fit neatly on a single line. A stack trace is a classic example:

text
123456789
2025-09-28 21:05:42 [ERROR] Unhandled exception: Cannot read property 'foo' of undefined
TypeError: Cannot read property 'foo' of undefined
at Object.<anonymous> (/usr/src/app/index.js:15:18)
at Module._compile (node:internal/modules/cjs/loader:1254:14)
at Module._extensions..js (node:internal/modules/cjs/loader:1308:10)
at Module.load (node:internal/modules/cjs/loader:1117:32)
at Module._load (node:internal/modules/cjs/loader:958:12)
at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:81:12)
at node:internal/main/run_main_module:17:47

If you feed this straight into the Collector, the receiver will treat each line as its own log entry. That's not what you'd want here since the error message and every stack frame belong to the same record.

The fix is to use the multiline configuration, which tells the receiver how to group lines together:

yaml
12345678910111213141516171819
receivers:
filelog:
include: [/var/log/myapp/*.log]
start_at: beginning
multiline:
# New entry starts when a line begins with "YYYY-MM-DD HH:MM:SS"
line_start_pattern: ^\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}
operators:
- type: regex_parser
regex: (?P<timestamp>\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+\[(?P<severity>[A-Za-z]+)\]\s+(?P<message>.+)
timestamp:
parse_from: attributes.timestamp
layout: "%Y-%m-%d %H:%M:%S"
severity:
parse_from: attributes.severity

Here, the line_start_pattern acts as the anchor: a new log entry begins only when a line starts with a date in the form YYYY-MM-DD HH:MM:SS. Any line that doesn't match is automatically folded into the body of the previous entry.

The result is that the entire stack trace, from the error message down through each at ... line, gets captured as one structured log record. This way, you don't lose context when analyzing errors.

Handling log rotation seamlessly

Log files don’t grow indefinitely. At some point, they'll get rotated. The filelog receiver is built to handle common rotation patterns (like renaming app.log to app.log.1) automatically and without losing data.

It works by tracking files with a unique fingerprint (taken from the first few kilobytes) rather than just the filename. When a file is rotated, the receiver recognizes that the old file has been renamed, finishes reading it to the end, and then begins reading the new file from the start.

There’s no special configuration required for this behavior as it happens out of the box.

How to avoid lost or duplicate logs

What happens if the Collector process restarts? Without care, you risk either re-ingesting old data or skipping over new logs. If you set start_at: beginning, the receiver will reread all your log files and create massive duplication. If you set start_at: end, it will miss any logs written while the Collector was down.

The solution is checkpointing. By configuring a storage extension, you instruct the filelog receiver to save its position (the last read offset for each file) to disk.

yaml
1234567891011121314151617181920
extensions:
file_storage:
directory: /var/otelcol/storage
receivers:
filelog:
include: [/var/log/myapp/*.log]
start_at: beginning
# Link the receiver to the storage extension
storage: file_storage
# ... processors, exporters
service:
# The extension must be enabled in the service section
extensions: [file_storage]
pipelines:
logs:
receivers: [filelog]
# ...

With the storage extension enabled, the receiver will:

  1. On startup, check the storage directory for saved offsets.
  2. Resume reading from the saved offset for any file it was tracking, ensuring no data is lost or duplicated.
  3. Periodically update the storage with its latest progress.

This is an essential best practice for any production deployment.

Filelog receiver tips and best practices

When troubleshooting the filelog receiver, a few issues come up again and again. Let's look at a few of these below:

  • The most common issue is that the logs don't show up. In almost every case, the cause is permissions. The fix is to ensure the user running the Collector can read not just the log files, but also the directories that contain them.

  • Another frequent culprit is the start_at setting. By default it is set to end, which means the receiver will only collect new lines written after startup. If you are testing against an existing file that isn’t actively being written to, change it to beginning so the entire file is ingested. Finally, double-check your glob pattern. If you are trying to match files in nested directories, remember to use ** (for example, /var/log/**/*.log).

  • Another common frustration is when your regular expression doesn't match the log lines. When in doubt, test it outside the Collector first. Tools like Regex101 are invaluable for verifying your expression, especially if you select the "Golang" flavor to match the Collector’s regex engine. Subtle whitespace or hidden characters are often the reason a pattern fails.

  • Finally, if your logs are being duplicated on restart, you need to enable a storage extension that allows the receiver to checkpoint its position in each file and resume cleanly, without data loss or duplication.

Final thoughts

The filelog receiver is an essential bridge between traditional file-based logging and the world of modern, structured observability. By mastering its core concepts of discovery, parsing with operators, and stateful checkpointing, you can reliably ingest data from any application that writes to a file.

Once you have transformed your raw text into well-structured OpenTelemetry logs, the full power of the Collector is at your disposal. You can now filter, enrich, and route this data to any observability backend, turning forgotten log files into a rich source of actionable insight.

Authors
Ayooluwa Isaiah
Ayooluwa Isaiah