Dash0 Raises $110M Series B at $1B Valuation

  • 13 min read

Docker Container Isolation

Docker container isolation is the set of Linux kernel features that makes a process running inside a container believe it has its own machine, while keeping it from interfering with the host or other containers. The two primitives doing the work are namespaces, which control what a process can see, and cgroups, which control what it can use.

The isolation is real, but it's not the same boundary a virtual machine gives you, and a few of its defaults leave sharper edges than people expect. This article walks through how the isolation works at the kernel level, where the boundary actually sits, and what Docker doesn't isolate unless you ask it to.

How namespaces isolate what a container sees

A Linux namespace is a kernel-level partition of a global system resource. When you put a process inside a namespace, it gets its own private view of that resource, while processes outside the namespace continue to see the original. The Linux kernel provides eight namespace types. Docker creates fresh PID, mount, network, UTS, IPC, and cgroup namespaces for every container by default; user and time namespaces are supported but opt-in. The eight, with what each isolates:

  • PID namespace. The container has its own process ID (PID) tree starting at PID 1. Processes inside can't see or signal processes on the host. The host can see container processes, but with different PIDs than the container sees internally.
  • Mount namespace. Each container gets its own filesystem mount table. The image's root filesystem appears as /, and the host's filesystem is invisible unless explicitly bind-mounted in.
  • Network namespace. Each container gets its own network stack: interfaces, routing table, iptables rules, port numbers. Two containers can both listen on port 80 without conflict because they're in different network namespaces.
  • UTS namespace. The name is historical (UNIX Time-sharing System), but today this namespace controls hostname and domain name. Each container has its own, so hostname inside the container returns the container ID rather than the host's name.
  • IPC namespace. Each container has its own inter-process communication (IPC) objects, including System V shared memory segments and POSIX message queues.
  • User namespace. Each container can have its own mapping of user IDs (UIDs) and group IDs (GIDs). This is the boundary that, when enabled, prevents root inside the container from being root on the host. Docker does not enable this by default.
  • Cgroup namespace. Hides the host's cgroup hierarchy from the container, so the container can't see resource limits or accounting for other processes.
  • Time namespace. Lets a container have its own view of system time. Available since Linux 5.6 in 2020, but Docker doesn't create a new one by default.

You can see the namespaces of a running container directly. The kernel exposes them as symlinks under /proc/[pid]/ns/:

bash
123
docker run -d --name demo nginx:alpine
pid=$(docker inspect -f '{{.State.Pid}}' demo)
ls -l /proc/$pid/ns/

The output looks like this:

12345678910
lrwxrwxrwx 1 root root 0 May 22 10:24 cgroup -> 'cgroup:[4026532163]'
lrwxrwxrwx 1 root root 0 May 22 10:24 ipc -> 'ipc:[4026532161]'
lrwxrwxrwx 1 root root 0 May 22 10:24 mnt -> 'mnt:[4026532159]'
lrwxrwxrwx 1 root root 0 May 22 10:24 net -> 'net:[4026532164]'
lrwxrwxrwx 1 root root 0 May 22 10:24 pid -> 'pid:[4026532162]'
lrwxrwxrwx 1 root root 0 May 22 10:24 pid_for_children -> 'pid:[4026532162]'
lrwxrwxrwx 1 root root 0 May 22 10:24 time -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 May 22 10:24 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 May 22 10:24 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 May 22 10:24 uts -> 'uts:[4026532160]'

Each number in brackets is the unique inode for that namespace. Compare with the host's own namespaces in /proc/1/ns/ and you'll see most of them differ. The exceptions are user (inode 4026531837 here) and time (inode 4026531834), which match the host because Docker doesn't unshare either by default. The user case is the one with serious security implications, and we'll come back to it.

How cgroups limit what a container uses

Namespaces handle visibility. Cgroups handle resource accounting and limits. A cgroup (control group) is a kernel mechanism for grouping processes and applying constraints on CPU time, memory, block I/O, and a handful of other resources.

When you run a container with resource flags, Docker creates a cgroup for that container and enforces the limits there:

bash
1234
docker run -d --name limited \
--memory="512m" \
--cpus="0.5" \
nginx:alpine

The --memory=512m flag sets the cgroup's memory limit. If the container tries to allocate more than 512MB, the kernel either blocks the allocation or invokes the out-of-memory (OOM) killer on a process inside the cgroup. The --cpus=0.5 flag sets a CPU quota: the container gets at most 50% of one CPU core, averaged over a scheduling period.

Modern Linux distributions ship with cgroups v2, which unifies the resource controllers into a single hierarchy and gives more consistent accounting for things like buffered I/O. Docker picks up v2 transparently on those systems.

Without cgroups, a single misbehaving container could exhaust host memory, peg every CPU core, or saturate disk I/O, dragging down everything else on the box. The resource flags are the difference between "isolation" and "isolation that holds under load."

Why this isn't the same as a virtual machine

Here's the part people miss when they treat containers as lightweight VMs: every container on a host shares the same kernel.

A virtual machine runs its own kernel inside a hypervisor-managed boundary. To escape from a VM into the host, an attacker has to find a vulnerability in the hypervisor, which is a hardware-enforced trust boundary that the hypervisor specifically defends. A container, by contrast, is just a process. The boundary is a set of kernel data structures saying "this process can only see these PIDs, this network, this filesystem." If you find a kernel bug, that boundary is gone, because the kernel is what was enforcing it in the first place.

This isn't theoretical. Kernel bugs that pierce or weaken container isolation show up every couple of years: CVE-2022-0185 in fs_context, CVE-2022-0492 in cgroups v1's release_agent, the Dirty Pipe vulnerability in splice(). They're rare in well-patched systems, but the attack surface a container sits behind is the entire Linux kernel running on the host, not a thin hypervisor.

The practical implication: Docker's isolation is fine for separating cooperating workloads in a single trust domain: your services, your team's apps, your dev environments. It is not the right boundary for running untrusted code or multi-tenant workloads where one tenant must not be able to affect another. For those cases you want a VM boundary. gVisor, Kata Containers, and Firecracker wrap each container in either a user-space kernel or a microVM, giving you the container UX with a hardware-enforced isolation boundary. This is the model AWS Lambda uses under the hood.

What Docker doesn't isolate by default

Even when your threat model is the standard one, a few defaults catch people out.

Root in the container is root on the host. Unless you explicitly enable user namespace remapping, UID 0 inside the container is UID 0 outside. If a process breaks out through a kernel bug or a misconfigured bind mount, it does so as root on the host. Setting USER in your Dockerfile to a non-root UID, or running with --user 1000:1000, mitigates most of this without needing to enable userns-remap (which has its own compatibility tradeoffs).

--privileged removes nearly all isolation. It disables seccomp, AppArmor and SELinux, capabilities filtering, and device cgroup restrictions. A privileged container is essentially a process running as root on the host with the namespace cosmetics still applied. Use it only when you actually need it (Docker-in-Docker, certain debugging tools), and never for application workloads.

Bind mounts can hand the host filesystem to a container. docker run -v /:/host ... gives a container read-write access to the entire host filesystem. So does mounting /var/run/docker.sock, which is equivalent to root on the host because the container can ask the Docker daemon to launch another, more privileged container.

Host namespace flags opt out of isolation per dimension. --network=host puts the container in the host's network namespace. --pid=host lets it see and signal every process on the box. --ipc=host shares System V IPC. Each of these flags trades a specific isolation guarantee for some convenience, and each should be deliberate, not default.

The kernel attack surface is shared. Every container on the host uses the same syscall interface, the same filesystem drivers, the same network stack. A kernel bug exploitable from one container is exploitable from all of them. This is the argument for keeping the host kernel patched and for keeping container images minimal: fewer binaries inside a container means fewer interesting tools available to an attacker who manages to get a shell.

Final thoughts

Docker container isolation is namespaces and cgroups doing real work in the kernel, with seccomp, capabilities, and AppArmor or SELinux as additional layers on top. It's strong enough for the common case of co-locating your own services on a host. It's not strong enough to treat as a security boundary against untrusted code, and the defaults leave a few sharp edges (root mapping, privileged mode, host mounts, shared kernel) that are worth knowing about before they bite you in production.

Once you understand where the isolation boundary actually is, the next question is whether your containers are behaving the way you expect inside it: running as the right user, staying within their cgroup limits, not making syscalls you didn't anticipate. That's where runtime observability earns its keep.

Dash0's infrastructure monitoring tracks per-container resource usage against cgroup limits alongside real-time logs and distributed traces, so you can catch a container approaching its memory limit or behaving oddly before it becomes an incident.

Start a free trial to monitor your containerized workloads in one view. No credit card required.