Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Docker Security Cheat Sheet (owasp.org)
325 points by soheilpro on March 13, 2021 | hide | past | favorite | 49 comments


A big :+1: for running Docker containers with `--read-only`, forcing you to use explicit writeable volume/bind mounts for all writable data... it's not just the security benefits, you can also avoid entire classes of problems like:

* minimizing the difference between `docker restart` (preserves overlayfs changes) and container re-creates (resets overlayfs back to the image state)

* surprise data loss on container redeployments because data was unexpectedly being written to the overlayfs instead of a volume

* unexpectedly running out of disk space in `/var/lib/docker` because data was being written outside of a volume

* performance issues caused by excessive overlayfs writes (storage drivers and /var/lib/docker not necessarily designed for IO performance)


I was just thinking the other day about how best to manage third party containers that write various logs, preventing them from filling up disk space over time and crashing. This is a perfect solution, assuming the third party container is good about telling you every volume it needs write access to, etc.


If you're running into containers that are writing logs to files, IMHO that's a code smell and something is off with how they're using containers. Most of the container ecosystem has settled on stdout and stderr as the defacto log locations and then leave it up to the container orchestrator (docker-compose, k8s, etc.) to deal with the log data. There might be some good reasons to use files for logs, but in 2021 they're likely exceptions and not the norm.


I am a big fan of read-only rootfs. But for some third party containers it may be challenging to apply this. In most cases it is not a big deal but sometimes you end up with a bunch of tmpfs mounts for those paths where it is required.


I never understood why read-only isn't the default option for volume mounts.


All of these isolation techniques (and much more) can be used inside systemd units. Write your .service as usual, then run this command on it:

  $ systemd-analyze security service-name
It prints out a long list of hardening flags that can be applied inside of your service, like so:

    NAME                                                        DESCRIPTION                                                             EXPOSURE
   PrivateNetwork=                                             Service has access to the host's network                                     0.5
   User=/DynamicUser=                                          Service runs as root user                                                    0.4
   CapabilityBoundingSet=~CAP_SET(UID|GID|PCAP)                Service may change UID/GID identities/capabilities                           0.3
   CapabilityBoundingSet=~CAP_SYS_ADMIN                        Service has administrator privileges                                         0.3
  ...
Here's what I typically use for a .NET 5 application:

  WorkingDirectory        = /opt/appname/app
  ReadWritePaths          = /opt/appname/data
  UMask                   = 0077
  LockPersonality         = yes
  NoNewPrivileges         = yes
  PrivateDevices          = yes
  PrivateMounts           = yes
  PrivateTmp              = yes
  PrivateUsers            = yes
  ProtectClock            = yes
  ProtectControlGroups    = yes
  ProtectHome             = yes
  ProtectHostname         = yes
  ProtectKernelLogs       = yes
  ProtectKernelModules    = yes
  ProtectKernelTunables   = yes
  ProtectSystem           = strict
  RemoveIPC               = yes
  RestrictAddressFamilies = AF_UNIX AF_INET AF_INET6
  RestrictNamespaces      = yes
  RestrictRealtime        = yes
  RestrictSUIDSGID        = yes
  SystemCallArchitectures = native
  ProtectProc             = invisible
  CapabilityBoundingSet   =
  SystemCallFilter        = ~@clock @module @mount @raw-io @reboot @swap @privileged @cpu-emulation @obsolete
ReadWritePaths should be replaced with a combination of DynamicUser + writing local persistent data to $STATE_DIRECTORY, but I'm too lazy to do that yet.

See systemd.exec(5) for more.


This is excellent advice. Together with systemd-nspawn it's all one needs to run a container in a strong sandbox.


Maybe a bit off-topic on this article but what does systemd-nspawn add compared to the aforementioned isolation options and how can you combine the two?


systemd-nspawn does the same things as docker, but with much smaller attack surface and can be sandboxed very effectively by the unit files.

The isolation provided by unit files is orthogonal with running containers.

nspawn is not even running a dedicated daemon. Plus, it's no secret that docker was not designed with security in mind and its isolation is bolted on. [1]

Furthermore, systemd is already installed and running on most systems (like it or not)

[1] https://www.cvedetails.com/vendor/13534/Docker.html


What's the systemd equivalent to docker-compose? The real value-add of Docker imo is not the security, it's the easy packaging, distribution, and running of otherwise complicated apps.


Writing N systemd service files with After=, etc., to match your docker-compose.yml that has N services in it.

I have ported a few docker-compose.yml-s this way, because I don’t understand docker enough to troubleshoot issues with getting my firewall rules to apply to docker traffic. Dependency hell is not an issue for these projects, so I feel happier with systemd.


At this point isn't it better to just run it inside a container like lxc?


Well sure, but I'm not using systemd. If I wanted a container runtime written in C, I would use crun. But I'd rather not use that either.


Well sure, but I'm not using systemd.

I hear you but fighting systemd in 2021 is like pushing water uphill. With a fork.


I'm not fighting anything. Just don't use it. Easy as that. I don't fudge with configs, carefully read manpages like systemd.exec, or put any effort into how I keep services running on my personal devices.


I use Hadolint[1] as a CI job to check if my Dockerfiles follow the good "rules". But there is one rule that annoys me the most and which is also present in this article, is the pinned OS package version rule[2]. While I understand its interest, I struggle to handle this problem.

When I build new images and it fails because the pinned version is not available anymore, I have to dig into Debian or Ubuntu packages websites to find the new ones as they don't keep the old packages online.

I know I could ask Hadolint to ignore this rule but I don't like this and I think it's important to stick to a certain version of a package to avoid problems. I'm just trying to find any tip that could make me use pinned version and avoid this manual search every time I have to. Does apt-get install allows wildcard for example?

1: https://github.com/hadolint/hadolint

2: https://github.com/hadolint/hadolint/wiki/DL3008


You could use renovate [1] to watch your repository and open pr for you when there is a new version. I don't know if it supports watching debian repositories out of the box, but that's probably doable with some tweaking

[1] https://github.com/renovatebot/renovate


I had a pretty serious issue because I didn't pin versions [1] and countless small problems.

[1] https://nicolasbouliane.com/blog/nextcloud-docker-upgrade-er...


One of hadolint authors here.

You can do `apt-get install my_package==4.1.*`, for instance and that will pass the validation and still be close to the ideal of having reproducible builds.


Awesome, thanks your answer and for this great tool!


snapshot.debian.org?



CIS Docker benchmark is a very extensive rule set for assessing docker host, daemon, images and containers from the security perspective.

It comes with a very handy tool as well https://github.com/docker/docker-bench-security


Yet another list of potentially extremely useful concepts that I'll never actually have the time to fully verify.

I wonder where all this complexity ends. If a human can't fully grok the systems we work on, then there is no way we can hope to not be misled and taken advantage of. Does anyone else share these concerns?


A large majority of people in tech seem to have convinced themselves that you solve complexity by adding even more complexity ontop.

Yes, linux security is complicated. So why not work on improving this instead? Optimise and simplify what we already have.

Docker hasn't made the complexity of linux security go away. It has just added a whole other dimension of potential security issues people now need to manage in addition to the security of the base system.

Programmers need to shift their mindset. We already have far too much complexity. Stop thinking about what new things you can create, start thinking about how you can improve and simplify the software that we already have.


It has gotten simpler, especially for things like testing safely. If you don't think about it as docker, or containers, the ability to sandbox things with namespaces and cgroups is almost magical. It's effectively instant and much more effective than just chroot.

"Run something with no -- or very specific -- network access" was a really annoying problem to solve (LD_PRELOAD?) in the before times.


The abstraction that docker provides you is simpler, but the whole system has gotten more complex.

>Run something with no, or very specific, network access

If this was such a problem in linux then why didn't people focus on improving this instead? We could have solved this problem in linux and made the whole system better for everyone.

Instead people left the problem there and just piled more crud ontop. We added an entire new layer of abstraction that everyone now has to spend weeks learning how to use instead of just fixing the original problem. The whole system is now far more complex, and the original problem is still there.


Docker solves a few other problems too, like conflicting dependencies in user space. Still, I agree the broader point that one can't solve too much abstraction with more abstraction.


I absolutely share these concerns.

The art of software engineering is about managing complexity. The best way to manage something is to reduce the amount of it you have to worry about. For the rest, paying really close attention, accepting ownership and directly engaging is key. Reducing the number of parties you have to trust is typically a happy side-effect of reducing complexity.

I look at containerization as an attempt to hand-wave away critical responsibilities around ownership of complexity in an application. There are tons of examples that illustrate this throughout the ecosystem, but I think this one is most apt -

One day you discover your application is a difficult mess to reconstruct from source each time. You have reached a fork in the road because management is complaining that it takes 2 weeks to configure a new QA environment from scratch. Do you either:

A) Review fundamental assumptions about the problem domain, technology choices and teamwork. Potentially consider rewriting your application from scratch using fewer tools & computers, and with more focus on driving the actual business value equations.

Or,

B) Decide that Tom's computer is the new golden image of production and bless it as such. Now let's find a way to manage a whole farm of these things!


The actual answer people come up with is neither of those. The process of moving your application to run in containers generally doesn't involve snapshotting Tom's computer - it usually wouldn't even work inside Docker for myriad reasons.

And the Dockerfile format is a good way of capturing what it actually takes to reconstruct your application/environment from source in a manageable, version-control-able text file.

I do agree that the rush to containerization is generally a rush to draw expansive abstraction boundaries, but this particular story isn't what people are doing. (At least not with containers - it's certainly what people were doing with VM images ten years ago!)


As a docker user, most of this aligns with my understanding, it’s mostly straight forward stuff and not inherently complex.

Try SELinux and see if you think this is still complicated ;)

ADD vs COPY was new to me, though.

...More in general, I think it’s not the systems that are complex but our minds/brains that are limited to deal with it. Just look at the complexity in nature.


As a sysadmin that that worked on securing linux apps before docker was a thing, imo doing it with docker is a lot less complex than before.

No tech is perfect, but docker has dramatically reduced the surface area of tunables that my application devs have to care about in order to get our product shipped. Even if it's not brought it down to zero, it's a step in the right direction in terms of UX complexity exposed to the developer.


As always, it depends?

If you're only using docker for development to bundle things in a different OS base image then there's not that much need for paranoia. You probably would trust the other OS vendor with your host system too. Running docker without security may be perfectly fine in that case.

For random scripts from github running them in docker is mostly a precaution so they don't screw up your host system or some malicious dependency makes a low-effort attempt at exfiltrating your ~/.ssh dir or whatever. In that case it makes sense to understand the basics, e.g. not just running docker --privileged willy-nilly just because some README asks you to.

If you're running some PaaS container service where users can execute arbitrary code on shared hardware then you better have a full-time security engineer or two who will be paid to understanding all the details involved.


I think what you should take away from this is that containers are not the abstraction you want for running mutually distrusting workloads, and you can continue to use separate VMs or even separate physical networks of bare metal servers. Some people will geek out about this quixotic undertaking; you don’t have to be one of them.

Linux is not very good at security boundaries anyway, just run one thing in each VM and don’t leave anything else there to privilege-escalate into.


I feel this way every time I read an OpenSCAP or CIS lockdown policy. They're hundreds of pages of "recommendations" that I think I can trust but don't fully understand. I don't know if I should worry about my lack of understanding or not.


My Grandad was an illiterate farmer. He didn’t understand the technology he used either, but it didn’t matter as long as it worked or he could call someone to fix it. Somebody who lives 100% this stuff will abstract it and solve the problem for you, just as others did then.

But do consider the list when some junior says they can get the company billing system running on Docker on prod.


Yes. Stay away from Docker or anything so complex. systemd-nspawn does the same things and can be sandboxed very effectively by the unit files.


Rule #2 needs lots of nuance stacked on it.

Host-users and guest-users must be explicitly mapped by whomever starts the container, so this issue is not a security threat to outside of the container. That said, if a guest-os is running as root and then someone compromises it, they have at their disposal all the powers of root in that container.

OWASP has also mixed good container guidance with "Docker" as well as some Docker specific things, which I'd like to maintain a delineation with now and into the future because they are different. I understand most uneducated people are looking for "how do I do x in Docker" but subtly educating the user matters for broader discourse.


“Don’t run as root in the container” definitely gets parroted without the nuance you describe. Gaining root in a container does not generally mean anything has been compromised on the host. Unless you do something odd (like mounting the host docker socket, weird setuid stuff, or run as privileged) it should be fine to run the container processes as root. It’s kind of the main point of running a process in a docker container.

I understand the motivation of marking scripts and binaries as non-writable inside the container as an extra layer of assurance (along with a non-root user that can only execute). But it’s a disservice to developers if you don’t explain why. A lot of people walk away from this thinking they’re protecting the host OS and wind up cargo-cult-creating a container user with full write/execute permissions.


An option not given in the article is to start as root, then step down to an unprivileged user using `gosu`. The official Postgres image does this, so appropriate permissions can be set on the data dir, becoming an unprivileged user for operation.


Quick tips to find out if your any of the tests in your project are actually making outbound calls: Run it without networking (--network=none)


You can also do this without Docker. I do this as part of my build system (which is just a bash script) for a Linux distribution. Each package has a download phase, which permits network access. After that phase completes, network access is dropped.

https://chiselapp.com/user/rkeene/repository/bash-drop-netwo...


Good article for some basic rules to keep your containers safe. That being said, I violate rule #1 all the time. If you want to have containers that launch other containers (we do that for Apache Airflow), you don't have much other choice. Also, cAdvisor requires that mount for monitoring.

Of course, if you're running containers in a ephemeral VM, no biggie.


Towards the end (Rule #11) "Avoid the use of apt/apk upgrade". Why? (assuming base OS image version is pinned)


Im surprised noone commented that when someone is able to get to your runtime its already too late to do anything.

If your docker is running in a private network and someone got in, it already means your whole system is compromised.

If you exposed application within a docker that has code0 vulnerability, it really does not matter, your system is already compromised..


Sure. I still try to put in layers of security anyway. Rather than just: fuck it.


Rule #12

Know what you are doing!

Don't just run an unknown public container without checking it's Dockerfile first. If you don't understand what's happening, chances are high that you don't know what you're doing.


This seems a little out-dated because --link is now a legacy feature: https://docs.docker.com/network/links/

Correct me if I'm wrong, but by default, containers can't communicate with each other even if ICC isn't disabled because the daemon gives them unique "default" networks. Only if you specify the same network for different containers can they communicate...

Edit: this behavior is specific to docker-compose. If you do docker run without specifying a network, it does use the docker0 bridge.


Everyone running docker in their homelab should be reading this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: