How to Monitor NGINX Performance

Q: How to Monitor NGINX Performance

Monitor NGINX performance using stub_status, access log metrics, and the OTel Collector nginx receiver. Learn which signals matter and what each one tells you.

NGINX exposes its internal state through two native signals: a real-time status endpoint and access logs. They measure different things, and using only one of them leaves you blind to a whole category of problems.

The status endpoint tells you about connection saturation: how many connections are open, how many the server dropped, whether workers are approaching their limits. The access logs tell you about request-level performance: latency, status codes, upstream response times. You need both.

Enable the stub_status endpoint

First, check that your NGINX build includes the ngx_http_stub_status_module:

bash

1
nginx -V 2>&1 | grep --color -- --with-http_stub_status_module

Most package-managed builds include it. If the output shows the flag, you're good. If not, you'll need to rebuild with --with-http_stub_status_module added to the configure script.

Add this location block inside a server block, ideally one that only listens on localhost:

nginx

12345678910
server {
    listen 127.0.0.1:8080;

    location /nginx_status {
        stub_status;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}

Restrict access strictly. This endpoint doesn't require authentication by itself, so a misconfigured allow/deny will expose connection counts to the internet. Reload the config and check the output:

bash

1
curl http://127.0.0.1:8080/nginx_status

You'll see something like:

1234
Active connections: 82
server accepts handled requests
 18432 18432 24007
Reading: 0 Writing: 4 Waiting: 78

Here's what each number means:

Active connections: All open connections, including those in the waiting state.
accepts / handled: Cumulative counters. Under normal conditions these are equal. If handled falls behind accepts, NGINX is dropping connections. That usually means worker_connections or file descriptor limits have been reached.
requests: Total requests served. Divide by handled to get requests per connection, a rough measure of keep-alive efficiency.
Reading: Connections where NGINX is reading the request header. A persistent spike here often points to slow clients or a Slowloris-style connection flood.
Writing: Connections actively writing a response. This is where real work happens.
Waiting: Idle keep-alive connections. High numbers are usually fine; they just consume memory and connection slots.

To calculate requests per second, sample requests at two intervals and divide the delta by the elapsed time:

bash

1234
R1=$(curl -s http://127.0.0.1:8080/nginx_status | awk '/requests/ {getline; print $3}')
sleep 10
R2=$(curl -s http://127.0.0.1:8080/nginx_status | awk '/requests/ {getline; print $3}')
echo "RPS: $(( (R2 - R1) / 10 ))"

Configure access logs for latency

The default NGINX log format doesn't include request latency or upstream response time, both of which are essential for diagnosing slow responses. Override the format in your nginx.conf:

nginx

12345
log_format monitoring_format '$remote_addr - $remote_user [$time_local] '
                              '"$request" $status $body_bytes_sent '
                              'rt=$request_time urt=$upstream_response_time';

access_log /var/log/nginx/access.log monitoring_format;

The two fields that matter most:

$request_time: Total time from when NGINX received the first byte from the client to when it sent the last byte of the response. This is your end-to-end latency from NGINX's perspective.
$upstream_response_time: Time the backend took to respond. If $request_time is high but $upstream_response_time is normal, the problem is between the client and NGINX. If both are high, the backend is slow.

To get a quick p99 latency check from recent logs:

bash

1
awk '{print $NF}' /var/log/nginx/access.log | grep -v '-' | awk -F= '{print $2}' | sort -n | awk 'BEGIN{c=0} {a[c++]=$1} END{print a[int(c*0.99)]}'

For ongoing latency tracking, a log shipper (Fluent Bit, Vector) parsing rt= and urt= fields into your time-series store is more practical than running awk against log files. Dash0's NGINX logs guide walks through the full pipeline from JSON log format to a queryable stream.

Collect metrics with the OpenTelemetry Collector

If you're already running the OpenTelemetry Collector, the nginxreceiver component scrapes stub_status on a configurable interval and emits standardized metrics:

yaml

12345678910111213141516
receivers:
  nginx:
    endpoint: "http://127.0.0.1:8080/nginx_status"
    collection_interval: 10s

exporters:
  otlp:
    endpoint: "https://ingress.REGION.dash0.com:4317"  # copy your endpoint from app.dash0.com → Settings → Endpoints
    headers:
      Authorization: "Bearer ${DASH0_AUTH_TOKEN}"

service:
  pipelines:
    metrics:
      receivers: [nginx]
      exporters: [otlp]

The receiver emits these metrics:

Metric	What it tracks
`nginx.connections_accepted`	Cumulative accepted connections
`nginx.connections_handled`	Cumulative handled connections
`nginx.requests`	Cumulative requests
`nginx.connections_current`	Active connections by state (reading/writing/waiting)

The difference between nginx.connections_accepted and nginx.connections_handled, computed as a rate, is your dropped connection rate. Alert on any nonzero sustained value here.

Common pitfalls

The accepts/handled gap goes unnoticed until it's too late. Most teams alert on error rate and latency but skip dropped connections. By the time latency spikes, NGINX has often been silently dropping connections for minutes. Track the rate of accepts - handled as a leading indicator — it fires before users notice anything.

worker_connections is half its advertised capacity when proxying. When NGINX acts as a reverse proxy, each client request consumes two connection slots: one for the client-to-NGINX connection and one for the NGINX-to-upstream connection. Your effective capacity is (worker_processes * worker_connections) / 2. Teams routinely size worker_connections without accounting for this and wonder why connections drop at half the expected load.

File descriptors hit the ceiling before worker_connections does. NGINX validates at startup that worker_rlimit_nofile is at least worker_connections, but it will start even if it isn't. In containerized environments, the container runtime often applies its own FD limit that overrides your NGINX config. Check the actual limit on a running worker with prlimit -n -p <worker_pid>. If the FD limit is lower than worker_connections, the FD ceiling is what actually drops connections—and stub_status won't tell you that. Look for accept4() failed (24: Too many open files) in your error logs.

High Waiting isn't always a problem, but it eats your headroom. Keep-alive connections in the waiting state hold a connection slot but do nothing. On busy servers with aggressive keep-alive timeouts, the waiting count can consume a large fraction of worker_connections before a traffic spike actually arrives. Monitor the ratio of Waiting to total Active connections. If it's consistently above 80%, reduce keepalive_timeout. The default of 75 seconds is generous.

$request_time and $upstream_response_time diverge when TLS termination is expensive. If you're handling TLS at NGINX, the TLS handshake cost is included in $request_time but not in $upstream_response_time. For HTTPS-heavy traffic, a gap between the two doesn't necessarily mean the backend is fast. It might mean clients are negotiating new TLS sessions frequently. Correlate with connection reuse rate (requests / handled) to check.

Final thoughts

The two signals serve different purposes. stub_status tells you whether NGINX itself is healthy. Access logs tell you whether the experience it's delivering is healthy. Neither gives you the full picture on its own.

The practical alert set: a nonzero rate of dropped connections (accepts - handled), p99 request time above your SLO threshold, a rising error rate on 5xx status codes, and active connections above 70% of worker_processes * worker_connections.

Dash0 ingests NGINX metrics via OTLP alongside logs and distributed traces from the rest of your stack, so you can correlate an NGINX latency spike with what's happening in your backends from a single view. Dash0 also parses NGINX log formats natively, so the rt= and urt= fields you add to your access logs become queryable without custom parsers.

Start a free trial to get NGINX metrics, logs, and traces in one place. If you're running NGINX as a Kubernetes Ingress controller, the Observing Ingress-NGINX with OpenTelemetry and Dash0 post covers the full three-signal setup (traces, metrics, and logs) with a ready-to-run demo.