Cgroups & Namespaces

Why This Matters

Every time you run a Docker container, Kubernetes pod, or LXC system container, two Linux kernel features do all the heavy lifting behind the scenes: namespaces and cgroups. Namespaces provide isolation -- making a process believe it is alone on the system. Cgroups provide resource control -- ensuring a runaway process cannot eat all your CPU or memory.

Understanding these two primitives is the difference between "I run containers" and "I understand how containers work." When a container breaks, leaks memory, or behaves strangely with networking, knowing namespaces and cgroups lets you diagnose the problem at the kernel level instead of blindly restarting things.

In this chapter, we will not just read about these concepts. We will use unshare to create namespaces by hand, manually set up cgroups to limit CPU and memory, and -- in the final exercise -- build a mini container from scratch using nothing but standard Linux utilities.


Try This Right Now

See how many namespaces your current shell process belongs to:

$ ls -la /proc/$$/ns/
lrwxrwxrwx 1 user user 0 Feb 21 10:00 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 user user 0 Feb 21 10:00 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 user user 0 Feb 21 10:00 mnt -> 'mnt:[4026531841]'
lrwxrwxrwx 1 user user 0 Feb 21 10:00 net -> 'net:[4026531840]'
lrwxrwxrwx 1 user user 0 Feb 21 10:00 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 user user 0 Feb 21 10:00 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 user user 0 Feb 21 10:00 user -> 'user:[4026531837]'
lrwxrwxrwx 1 user user 0 Feb 21 10:00 uts -> 'uts:[4026531838]'

Each line is a namespace. The numbers in brackets are inode identifiers. Two processes with the same number for a given namespace type share that namespace. That is literally how the kernel tracks isolation.

Now check cgroups:

$ cat /proc/$$/cgroup
0::/user.slice/user-1000.slice/session-1.scope

That tells you which cgroup your shell belongs to (on a cgroups v2 system). Every process on Linux is in a cgroup.


Linux Namespaces: The Isolation Engine

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of that resource.

Linux provides eight types of namespaces:

NamespaceFlagWhat It Isolates
PIDCLONE_NEWPIDProcess IDs -- process sees itself as PID 1
NetworkCLONE_NEWNETNetwork stack -- own interfaces, routing, firewall
MountCLONE_NEWNSFilesystem mount points
UTSCLONE_NEWUTSHostname and domain name
IPCCLONE_NEWIPCSystem V IPC, POSIX message queues
UserCLONE_NEWUSERUser and group IDs (UID/GID mapping)
CgroupCLONE_NEWCGROUPCgroup root directory
TimeCLONE_NEWTIMESystem clocks (since kernel 5.6)
Normal Linux:                         With Namespaces:

All processes see:                    Process in container sees:
  - All PIDs (1-30000+)               - Only its own PIDs (1-50)
  - All network interfaces             - Its own eth0, lo
  - All mount points                   - Its own /proc, /sys, /tmp
  - The real hostname                  - Its own hostname
  - All users                          - Mapped UIDs (root=0 inside)

PID Namespace

The PID namespace gives a process its own view of the process tree. The first process in a new PID namespace becomes PID 1 (the init process for that namespace).

# Create a new PID namespace and run bash in it
$ sudo unshare --pid --fork --mount-proc bash

# Inside the new namespace:
root# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   7236  4016 pts/0    S    10:00   0:00 bash
root         2  0.0  0.0  10072  3344 pts/0    R+   10:00   0:00 ps aux

Notice there are only two processes. Your bash is PID 1. The host's thousands of processes are invisible. But from the host, this bash process has a normal PID (say, 12345).

# Exit the namespace
root# exit

Network Namespace

A network namespace gives a process its own network stack: its own interfaces, routing table, firewall rules, and port space.

# Create a named network namespace
$ sudo ip netns add testns

# List network namespaces
$ sudo ip netns list
testns

# Run a command inside the namespace
$ sudo ip netns exec testns ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Only the loopback interface exists, and it is DOWN. This namespace is completely isolated from the host network. Let us connect it:

# Create a veth pair (virtual ethernet cable)
$ sudo ip link add veth-host type veth peer name veth-ns

# Move one end into the namespace
$ sudo ip link set veth-ns netns testns

# Configure the host end
$ sudo ip addr add 10.0.0.1/24 dev veth-host
$ sudo ip link set veth-host up

# Configure the namespace end
$ sudo ip netns exec testns ip addr add 10.0.0.2/24 dev veth-ns
$ sudo ip netns exec testns ip link set veth-ns up
$ sudo ip netns exec testns ip link set lo up

# Test connectivity
$ sudo ip netns exec testns ping -c 2 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.038 ms
64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.048 ms

This is exactly how Docker and other container runtimes set up container networking.

┌─────────────────────────┐    ┌────────────────────────┐
│     Host Namespace      │    │   testns Namespace     │
│                         │    │                        │
│   veth-host             │    │          veth-ns       │
│   10.0.0.1/24 ─────────┼────┼──────── 10.0.0.2/24   │
│                         │    │                        │
│   eth0 (real NIC)       │    │   (no real NIC)        │
│   192.168.1.100         │    │                        │
└─────────────────────────┘    └────────────────────────┘
         veth pair = virtual cable

Clean up:

$ sudo ip netns delete testns
$ sudo ip link delete veth-host 2>/dev/null

UTS Namespace

The UTS namespace isolates the hostname. Containers use this so each one can have its own hostname.

$ sudo unshare --uts bash

root# hostname container-demo
root# hostname
container-demo

root# exit

# Host hostname is unchanged
$ hostname
your-real-hostname

Mount Namespace

The mount namespace gives a process its own view of the filesystem mount points. Mounts made inside the namespace are invisible outside.

$ sudo unshare --mount bash

root# mkdir /tmp/private-mount
root# mount -t tmpfs tmpfs /tmp/private-mount
root# echo "secret" > /tmp/private-mount/file.txt
root# cat /tmp/private-mount/file.txt
secret

root# exit

# From the host, the mount does not exist
$ ls /tmp/private-mount/
# Empty or directory does not exist

User Namespace

The user namespace maps UIDs and GIDs. A process can be root (UID 0) inside a user namespace while being an unprivileged user on the host. This is the foundation of rootless containers.

# As a regular user, create a new user namespace
$ unshare --user --map-root-user bash

root# id
uid=0(root) gid=0(root) groups=0(root)

root# whoami
root

# But on the host, you are still your regular user
root# cat /proc/$$/uid_map
         0       1000          1
# UID 0 inside maps to UID 1000 outside

root# exit

Think About It: User namespaces are what make rootless Podman and rootless Docker possible. The container process believes it is running as root (UID 0), but on the host it is actually your regular unprivileged user. If the container is compromised, the attacker only has your user's privileges, not root.


Cgroups: The Resource Control Engine

While namespaces provide isolation (what a process can see), cgroups provide limitation (what a process can use). Cgroups control:

  • CPU -- how much processor time a group of processes gets
  • Memory -- how much RAM (and swap) processes can consume
  • I/O -- disk read/write bandwidth limits
  • PIDs -- maximum number of processes
  • Network (indirectly) -- via traffic shaping

Cgroups v1 vs Cgroups v2

Linux has two versions of cgroups, and the distinction matters.

FeatureCgroups v1Cgroups v2
HierarchyMultiple hierarchies (one per controller)Single unified hierarchy
Filesystem/sys/fs/cgroup/<controller>//sys/fs/cgroup/ (unified)
StatusLegacy (still supported)Current standard
systemd integrationWorks but messyNative integration

Check which version you are using:

$ stat -fc %T /sys/fs/cgroup/
  • cgroup2fs = cgroups v2 (unified)
  • tmpfs = cgroups v1 (or hybrid)
# On cgroups v2, list controllers
$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

Most modern distributions (Fedora 31+, Ubuntu 21.10+, Debian 12+, Arch) default to cgroups v2.

The Cgroup Filesystem

Cgroups are managed entirely through a virtual filesystem. Creating directories creates cgroups. Writing to files configures limits. It is elegant in its simplicity.

/sys/fs/cgroup/                          (root cgroup)
├── cgroup.controllers                   (available controllers)
├── cgroup.subtree_control               (enabled controllers for children)
├── user.slice/                          (user sessions)
│   └── user-1000.slice/
│       └── session-1.scope/
│           ├── cgroup.procs             (PIDs in this cgroup)
│           ├── memory.current           (current memory usage)
│           └── cpu.stat                 (CPU statistics)
├── system.slice/                        (system services)
│   ├── sshd.service/
│   ├── nginx.service/
│   └── docker.service/
└── init.scope/                          (PID 1)

Hands-On: Creating Cgroups and Setting Limits

Let us manually create a cgroup and limit memory.

Step 1: Create a cgroup (cgroups v2):

# Enable memory and pids controllers for children of root
$ sudo sh -c 'echo "+memory +pids +cpu" > /sys/fs/cgroup/cgroup.subtree_control'

# Create a new cgroup
$ sudo mkdir /sys/fs/cgroup/demo-group

# Verify controllers are available
$ cat /sys/fs/cgroup/demo-group/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

Step 2: Set a memory limit:

# Limit to 50MB of memory
$ sudo sh -c 'echo 52428800 > /sys/fs/cgroup/demo-group/memory.max'

# Verify
$ cat /sys/fs/cgroup/demo-group/memory.max
52428800

Step 3: Add a process to the cgroup:

# Start a bash shell in the cgroup
$ sudo sh -c "echo $$ > /sys/fs/cgroup/demo-group/cgroup.procs"

# Verify
$ cat /proc/$$/cgroup
0::/demo-group

Step 4: Test the limit:

# Try to allocate more than 50MB
$ python3 -c "
data = []
try:
    while True:
        data.append('A' * 1024 * 1024)  # 1MB at a time
        print(f'Allocated {len(data)} MB')
except MemoryError:
    print(f'Hit memory limit at {len(data)} MB')
"
Allocated 1 MB
Allocated 2 MB
...
Allocated 45 MB
Killed

The kernel's OOM killer terminated the process because it exceeded the cgroup memory limit. That is exactly how Docker's --memory flag works.

Safety Warning: Be careful when moving your current shell into a resource-limited cgroup. If you set the memory limit too low, your shell itself may be killed.

Step 5: Set CPU limits:

# Limit to 20% of one CPU core
# Format: $MAX $PERIOD (in microseconds)
# 20000 out of 100000 = 20%
$ sudo sh -c 'echo "20000 100000" > /sys/fs/cgroup/demo-group/cpu.max'

# Verify
$ cat /sys/fs/cgroup/demo-group/cpu.max
20000 100000

Step 6: Set a PID limit:

# Maximum 10 processes in this cgroup
$ sudo sh -c 'echo 10 > /sys/fs/cgroup/demo-group/pids.max'

Cleanup:

# Move our shell back to the root cgroup first
$ sudo sh -c "echo $$ > /sys/fs/cgroup/cgroup.procs"

# Remove the cgroup (must be empty)
$ sudo rmdir /sys/fs/cgroup/demo-group

Think About It: When Docker runs a container with --memory=256m --cpus=0.5, it is literally creating a cgroup, writing 268435456 to memory.max, and writing 50000 100000 to cpu.max. The container runtime is just an automation layer over these kernel primitives.


systemd and Cgroups

systemd uses cgroups extensively. Every service, user session, and scope gets its own cgroup. This is how systemd tracks all processes belonging to a service (even after forks) and applies resource limits.

# View the cgroup hierarchy as a tree
$ systemd-cgls
Control group /:
-.slice
├─user.slice
│ └─user-1000.slice
│   ├─session-1.scope
│   │ ├─1234 bash
│   │ └─5678 vim
│   └─user@1000.service
│     └─init.scope
│       └─1111 /lib/systemd/systemd --user
├─init.scope
│ └─1 /sbin/init
└─system.slice
  ├─sshd.service
  │ └─800 sshd: /usr/sbin/sshd -D
  ├─nginx.service
  │ ├─900 nginx: master process
  │ ├─901 nginx: worker process
  │ └─902 nginx: worker process
  └─docker.service
    └─1000 /usr/bin/dockerd

Set resource limits for a service via systemd:

# Edit a service's cgroup limits
$ sudo systemctl edit myapp.service
[Service]
MemoryMax=512M
CPUQuota=50%
TasksMax=100
# View current resource usage for a service
$ systemctl status nginx.service

The "CGroup" line shows which cgroup the service belongs to. You can also use:

# Real-time resource usage by cgroup
$ systemd-cgtop
Control Group                          Tasks   %CPU   Memory
/                                        150    5.2     1.8G
/system.slice                             45    3.1   800.0M
/system.slice/docker.service              12    1.5   400.0M
/user.slice                               30    1.2   500.0M

How Containers Use Namespaces + Cgroups

A container is not a special kernel feature. It is the combination of namespaces (for isolation) and cgroups (for resource control), orchestrated by a container runtime.

What a container runtime does:

1. Create namespaces:
   └─ PID namespace    → container gets its own PID 1
   └─ Network namespace → container gets its own eth0
   └─ Mount namespace   → container sees its own filesystem
   └─ UTS namespace     → container gets its own hostname
   └─ IPC namespace     → container gets isolated IPC
   └─ User namespace    → UID mapping (rootless containers)

2. Create cgroup:
   └─ Set memory.max   → --memory flag
   └─ Set cpu.max      → --cpus flag
   └─ Set pids.max     → --pids-limit flag

3. Set up filesystem:
   └─ Mount container image as root filesystem
   └─ Set up overlay filesystem (layers)
   └─ Mount /proc, /sys, /dev

4. Apply security:
   └─ Drop capabilities
   └─ Apply seccomp filters
   └─ Apply SELinux/AppArmor labels

5. Execute the container's entrypoint process

Hands-On: Build a Mini Container

Let us build a minimal container using only unshare, chroot, and cgroups. No Docker, no Podman -- just raw Linux primitives.

Step 1: Create a minimal root filesystem:

# Create a directory for our container's filesystem
$ mkdir -p ~/minicontainer/rootfs

# Use debootstrap to create a minimal Debian filesystem
# (On Debian/Ubuntu)
$ sudo apt install -y debootstrap
$ sudo debootstrap --variant=minbase bookworm ~/minicontainer/rootfs

Distro Note: On Fedora/RHEL, you can use dnf --installroot instead:

$ sudo dnf --releasever=39 --installroot=$HOME/minicontainer/rootfs \
    install -y bash coreutils procps-ng iproute

On Arch, use pacstrap from the arch-install-scripts package.

Step 2: Enter the container with namespaces:

$ sudo unshare \
    --pid \
    --fork \
    --mount \
    --uts \
    --ipc \
    --mount-proc \
    chroot ~/minicontainer/rootfs /bin/bash

Step 3: Explore the container:

# You are now inside your mini container!
root# hostname minicontainer
root# hostname
minicontainer

root# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   7236  3904 ?        S    10:00   0:00 /bin/bash
root         8  0.0  0.0  10072  3360 ?        R+   10:00   0:00 ps aux

root# ls /
bin  boot  dev  etc  home  lib  lib64  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

root# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

You are looking at PID 1, inside a Debian filesystem, with your own hostname, isolated from the host -- and you did it without any container runtime.

Step 4: Exit the container:

root# exit

Step 5: Add cgroup limits (from the host):

To limit the resources of our mini container, we can combine unshare with cgroups:

# Create a cgroup for our container
$ sudo mkdir /sys/fs/cgroup/minicontainer
$ sudo sh -c 'echo 104857600 > /sys/fs/cgroup/minicontainer/memory.max'  # 100MB
$ sudo sh -c 'echo "50000 100000" > /sys/fs/cgroup/minicontainer/cpu.max'  # 50% CPU
$ sudo sh -c 'echo 50 > /sys/fs/cgroup/minicontainer/pids.max'  # 50 processes

# Launch the container and add it to the cgroup
$ sudo unshare --pid --fork --mount --uts --ipc --mount-proc \
    sh -c "echo \$\$ > /sys/fs/cgroup/minicontainer/cgroup.procs && \
    exec chroot $HOME/minicontainer/rootfs /bin/bash"

Congratulations -- you just built a container by hand. It has:

  • Isolated PIDs (PID namespace)
  • Isolated filesystem (mount namespace + chroot)
  • Isolated hostname (UTS namespace)
  • Isolated IPC (IPC namespace)
  • Memory and CPU limits (cgroups)

This is, at its core, what Docker and Podman do. They just add layers of convenience: image management, networking, storage drivers, and a nice CLI.


Inspecting Namespaces and Cgroups of Running Containers

When troubleshooting containers, you can inspect their namespaces and cgroups directly.

# Find a container's PID on the host
$ docker inspect --format '{{.State.Pid}}' my-container
12345

# View its namespaces
$ sudo ls -la /proc/12345/ns/
lrwxrwxrwx 1 root root 0 Feb 21 10:00 cgroup -> 'cgroup:[4026532456]'
lrwxrwxrwx 1 root root 0 Feb 21 10:00 ipc -> 'ipc:[4026532389]'
lrwxrwxrwx 1 root root 0 Feb 21 10:00 mnt -> 'mnt:[4026532387]'
lrwxrwxrwx 1 root root 0 Feb 21 10:00 net -> 'net:[4026532391]'
lrwxrwxrwx 1 root root 0 Feb 21 10:00 pid -> 'pid:[4026532390]'
lrwxrwxrwx 1 root root 0 Feb 21 10:00 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Feb 21 10:00 uts -> 'uts:[4026532388]'

# View its cgroup
$ cat /proc/12345/cgroup
0::/system.slice/docker-abc123def456.scope

# Check memory limit
$ cat /sys/fs/cgroup/system.slice/docker-abc123def456.scope/memory.max
268435456

# Check current memory usage
$ cat /sys/fs/cgroup/system.slice/docker-abc123def456.scope/memory.current
52428800

# Enter a container's namespace directly (like docker exec)
$ sudo nsenter --target 12345 --mount --uts --ipc --net --pid -- /bin/bash

The nsenter command is the underlying mechanism behind docker exec. It enters the namespaces of an existing process.


Debug This

A developer reports their container keeps getting OOM-killed even though the host has plenty of free memory.

$ docker logs my-app
... application started ...
Killed

$ dmesg | tail
[12345.678] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),
  task=java,pid=3456,uid=0
[12345.678] Memory cgroup out of memory. Kill process 3456 (java)
  total-vm:2097152kB, anon-rss:262144kB, file-rss:0kB

Diagnosis: The key phrase is CONSTRAINT_MEMCG -- this means the OOM kill was triggered by a cgroup memory limit, not system-wide memory pressure. The container has a memory limit set.

Investigation:

# Find the container's cgroup memory limit
$ docker inspect --format '{{.HostConfig.Memory}}' my-app
268435456

# That is 256MB. The Java app likely needs more.

Fix:

# Restart with more memory
$ docker run --memory=1g my-app

# Or for a Java application, also set JVM heap limits
$ docker run --memory=1g -e JAVA_OPTS="-Xmx768m" my-app

The lesson: a cgroup memory limit is a hard wall. The kernel OOM killer will terminate processes that exceed it, even if the host has gigabytes of free RAM.


What Just Happened?

┌─────────────────────────────────────────────────────────────┐
│                    CHAPTER RECAP                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Namespaces provide ISOLATION:                              │
│    PID    → isolated process tree                           │
│    Net    → isolated network stack                          │
│    Mount  → isolated filesystem mounts                      │
│    UTS    → isolated hostname                               │
│    IPC    → isolated inter-process communication            │
│    User   → UID/GID mapping (rootless containers)          │
│                                                             │
│  Cgroups provide RESOURCE CONTROL:                          │
│    memory.max  → RAM limit                                  │
│    cpu.max     → CPU time limit                             │
│    pids.max    → process count limit                        │
│    io.max      → disk I/O limit                             │
│                                                             │
│  Cgroups v2 is the modern unified hierarchy.                │
│  systemd uses cgroups to track and limit services.          │
│                                                             │
│  A container = namespaces + cgroups + filesystem image      │
│                + security policies.                         │
│                                                             │
│  unshare creates namespaces; nsenter joins them.            │
│  The /sys/fs/cgroup filesystem manages cgroups.             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Try This

  1. PID namespace exploration: Use unshare --pid --fork --mount-proc bash to create a PID namespace. Run ps aux inside and outside. Verify that the host cannot see the namespace's PID 1 as PID 1, but can see it under its real host PID.

  2. Network namespace lab: Create two network namespaces and connect them with a veth pair. Assign IP addresses and verify you can ping between them.

  3. Memory cgroup limit: Create a cgroup with a 30MB memory limit. Write a script that allocates memory in a loop. Observe it being killed when it exceeds the limit. Check dmesg for the OOM message.

  4. CPU throttling: Create a cgroup with a 10% CPU limit. Run a CPU-intensive process (like stress --cpu 1) inside it. Use top to verify it stays around 10%.

  5. Bonus Challenge: Extend the mini container exercise. Add a network namespace to your hand-built container. Create a veth pair, assign it an IP, and add NAT on the host so the container can reach the internet. You will have essentially rebuilt what Docker does for networking.