7 min read

Protecting Docker with User Namespaces

In today's article, we are going to discuss the benefits of using user namespaces for securing our Docker environments. The official Docker documentation contains a page explaining how to isolate containers with a user namespace in order to prevent potential privilege-escalation attacks, and even though the instructions on that page are fairly detailed, I still feel it's not entirely apparent what user namespaces can protect you from exactly, especially if you haven't used them before.

My goal here is to make this clear through a practical example, and hopefully this will solidify your understanding of user namespaces, how they work, and why it's important to have them properly configured in a Docker environment.

Without further ado, let's go ahead and dive into our example.

Exploring a usual Docker setup

Let's imagine a typical scenario: engineers from different teams at your organization share a virtual private server to run Docker containers in a collaborative environment. Each engineer has a dedicated user account on the server, and for convenience, every account has been added to the docker group, allowing users to issue docker commands without having to prefix them with sudo.

Docker runs with its default configuration, so there's no user namespace isolation enabled, and the dockerd daemon runs as root. No user besides the main system administrator has sudo access on the server, and to the casual observer, the system appears relatively secure in terms of user privileges and access control.

Everything runs smoothly, until one day a malicious actor manages to gain access to a user account on the server. In this case, unless Docker containers are protected with a user namespace, the attacker can easily infiltrate other user accounts on the server by exploiting the default Docker configuration.

Exploiting the default Docker configuration

To understand why exploiting the default Docker configuration is possible, let's recap how Docker launches containers in the first place.

Suppose you issue a command such as:

docker run --rm hello-world

When you do that, you are not really launching the container on behalf of your user ID. You are actually making use of the Docker client to tell the Docker daemon running in the background to launch a new container for you. We won't delve further into the mechanics of how this happens precisely, but the main point is that the docker command doesn't launch the container. It only expresses the intention, and then it's dockerd that organizes the actual launch:

Because the dockerd daemon runs as root by default, any containers that it launches can also execute commands as root (technically, dockerd relies on another daemon called containerd to setup the container, but it also runs as root, so you get the point).

You can verify this by running:

ps -f -p $(pgrep 'dockerd|containerd')

Both daemons run as root:

root        1945       1  0 10:01 ?        Ssl    0:03 /usr/bin/containerd
root        2704       1  0 10:01 ?        Ssl    0:17 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

At the same time, adding users to the docker group allows them to issue docker commands without having root privileges because the Docker client and the Docker daemon communicate through a Unix socket, which is configured to provide read/write permissions to all users in the docker group.

You can verify this by running:

ls -l /var/run/docker.sock

The output confirms that all members of the docker group can read and write to the socket:

srw-rw---- 1 root docker 0 Dec 14 11:19 /var/run/docker.sock

Essentially, this allows any user in the docker group to exploit the system and escape the confinement of a container, gaining full root access to the host.

It's as easy as running:

docker run -v /:/host --rm -it alpine chroot /host sh

The command starts a new Docker container from the Alpine image and drops the attacker into a root shell, allowing them to perform any changes they want to the underlying system.

For example, there's nothing stopping the attacker from modifying the user account files (e.g. /etc/passwd, /etc/shadow, and so on) in /etc. We can prove this by creating a new file inside /etc:

touch /etc/test.txt

Instead of seeing a permission error, the new file is created successfully, despite the fact that the folder is owned by root on the host and not by the user who started the container (who, as you remember, didn't have any root permissions on the host).

Performing an ls both inside and outside of the container further confirms this:

ls -l /etc/test.txt

In both cases, root appears as the file owner:

-rw-r--r-- 1 root root 0 Dec 14 14:46 /etc/test.txt

Since we issued chroot /host sh, which effectively started a new shell session, you can verify that the shell open inside the Alpine container runs as root by typing:

ps -f -p $(pidof sh)

The output states:

root      242796  242774  0 14:48 pts/0    Ss+    0:00 sh

So even though the user who executed the original docker command was neither root nor a member of the sudo group on the server, they were still able to gain full root access to the underlying system and create a file in a folder that they weren't supposed to have access to.

Protecting Docker with User Namespaces

The previously described scenario would not have been possible if user namespaces were utilized. With user namespaces, you can instruct Docker to map the root user inside the container to a non-privileged user on the host system. So while the processes still appear to be running as root inside the container, they actually use a completely different user ID on the host.

Anyone with root access to the server can enable this feature by making a small adjustment to the Docker daemon configuration file. On Linux, this file is usually located at /etc/docker/daemon.json (if the file doesn't exist, it's safe to create it and Docker will pick it up automatically on the next restart).

Add the following instruction to daemon.json:

  "userns-remap": "default"

Then restart the Docker daemon:

systemctl restart docker

Several things happen as a result. First, Docker creates a new user and group on the system, both named dockremap.

To verify this, run:

id dockremap
uid=110(dockremap) gid=117(dockremap) groups=117(dockremap)

Additionally, Docker adds an entry for dockremap in both the /etc/subuid and /etc/subgid files on the host system:

grep dockremap /etc/subuid
grep dockremap /etc/subgid

On Linux, the files /etc/subuid and /etc/subgid are used for configuring subordinate user and group ID ranges for user namespaces. The expression dockremap:165536:65536 basically means that the user account dockremap is allocated a subordinate user ID range of 65536 total IDs, starting at 165536 and ending at 231071 (165536+65536-1).

This means that Docker can now map the root user ID inside the container to the very first ID from the dockremap subordinate range (i.e., 165536).

To see this in action, run the same command:

docker run -v /:/host --rm -it alpine chroot /host sh

Then try touching the test.txt file once again:

touch /etc/test.txt

This time, the output says:

touch: cannot touch '/etc/test.txt': Permission denied

That's because user namespace isolation is enabled.

Run the following command:

ls -l / | grep etc

You will notice that the /etc folder appears to be owned by nobody:nogroup inside the container:

drwxr-xr-x 102 nobody nogroup  4096 Dec 14 12:39 etc

On the host, however, the owner is root:root:

drwxr-xr-x 102 root root  4096 Dec 14 12:39 etc

This is because any user or group ID not explicitly mapped to the user namespace is reported as nobody (respectively, nogroup) inside that namespace.

You can see the user namespace mapping by running the following command inside the container:

cat /proc/self/uid_map

The output indicates that user ID 0 (i.e., root) in the current user namespace maps to user ID 165536 on the host. This mapping continues for all subsequent 65536 user IDs (so 1 in the container maps to 165537 on the host, 2 in the container maps to 165538 on the host, and so on):

         0     165536      65536

If the container was running without user namespace mapping, cat /proc/self/uid_map would have reported this instead:

         0          0 4294967295

Let's try something else and create a file inside the /tmp folder from within the container (remember that on Linux the /tmp folder is world-writable):

touch /tmp/test.txt

Now run an ls inside the container:

ls -l /tmp/test.txt 

From the output, it appears that the file is owned by root:

-rw-r--r-- 1 root root 0 Dec 14 13:30 /tmp/test.txt

Run the same command on the host itself:

ls -l /tmp/test.txt 

This time, you will see that the file actually belongs to user 165536 on the host:

-rw-r--r-- 1 165536 165536 0 Dec 14 13:30 /tmp/test.txt

In other words, the root user inside the container (ID 0) actually corresponds to user ID 165536 on the host.

With the container still running, try executing the following command on the host:

ps -f -p $(pidof sh)

You will see that the sh process running inside the container runs as user 165536 on the host, which further confirms that the user namespace mapping works:

UID          PID    PPID  C STIME TTY          TIME CMD
165536      4064    4046  0 12:43 pts/0    00:00:00 sh

With that, we can conclude our exploration of user namespaces.

Final thoughts

User namespaces provide an effective mechanism to manage and secure user privileges within running containers. By mapping user IDs from the container to different user IDs on the host, we can ensure that processes inside the container cannot access resources or perform actions outside their defined boundaries.

This adds an extra layer of security and allows for better management of user permissions in containerized environments, making user namespace isolation a valuable mechanism for protecting system integrity and ensuring that containerized applications are securely isolated from the host.

I hope this article has provided a clear understanding of the importance of user namespace isolation in containerized environments. Thanks for reading, and until next time!