Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rootless] question: plan for supporting cgroups? #1429

Closed
AkihiroSuda opened this issue Sep 8, 2018 · 50 comments
Closed

[rootless] question: plan for supporting cgroups? #1429

AkihiroSuda opened this issue Sep 8, 2018 · 50 comments
Assignees
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless

Comments

@AkihiroSuda
Copy link
Collaborator

Rootless mode could support cgroups when pam_cgfs.so is available (
opencontainers/runc#1839 cc @cyphar), but it is not available on Fedora (AFAIK)

Is there plan for supporting pam_cgfs.so or any equivalent of that?

(This question is not specific to podman, and I'm not sure this repo is the right place to ask this question :p)

@rhatdan
Copy link
Member

rhatdan commented Sep 9, 2018

Is there a way to handle this via communications with systemd? It would seem that systemd probably provides a mechanism for user apps to manipulate the cgroups available to the user?

@brauner
Copy link

brauner commented Sep 9, 2018

systemd won't do unprivileged cgroup delegation on v1 hierarchies since there's no way to do it safely so it needs to be up to the administrator to switch this on. So on v1 hierarchies now way to get this going without pam_cgfs.so that's why we @lxc wrote it in the first place. The trick is to limit this to cgroups you really really care about for your runtime, that's why the pam module takes arguments which the administrators needs to explicitly set.
Now, the story is different for cgroup v2. You can talk to systemd via dbus if you feel like linking against a bunch of xml. Or you request the delegated property in a service file or - if you have a daemon - the daemon requests the delegation and creates two parallel cgroups on the same level of the hierarchy and moves itself into one and the container into the other. In any case this requires that the runtime never escapes to the root cgroup.

@cyphar
Copy link

cyphar commented Sep 10, 2018

@brauner Does systemd remount /sys/fs/cgroup with nsdelegate now? Or is this something that systemd has yet to come up with a proper setup for -- since it would technically allow cgroupv2 delegation without systemd integration by just using a cgroup namespace.

@rhatdan
Copy link
Member

rhatdan commented Sep 10, 2018

From devel list on fedora, this is @poettering response.

I am not sure what pam_cgfs.so precisely does, but do note that on
systemd systems (which includes Fedora) systemd is the owner of the
cgroup tree, and the only means by which other components may manage
their own subtrees is through cgroup delegation (which includes
delegation to less privileged users), which you can request from
systemd.

You can request cgroup delegation from the system service manager, for which
you need to be privileged (but you can request it on behalf of an
unprivileged user).

You can also request cgroup delegation from your private user service
manager instance, for which you do not need to be privileged.

The APIs for requesting cgroup delegation from the system service
manager or your user service manager is the same, the only difference
is whether you do so through the system or user dbus bus.

Note that on cgroupsv1 delegation of *controllers* (i.e. "cpu",
"cpuset", "memory", "blkio", …) to unprivileged processes is not safe
(this is a kernel limitation) and systemd won't do it hence. On
cgroupsv2 it is safe however, and hence you will get "memory" and
"tasks" delegated by default (though not "cpu" by default as the
runtime impact of that is still too high).

Do note however, that Docker is blocking us from switching Fedora over
to cgroupsv2 though, as there is still no working support for
cgroupsv2 in Docker, nor support for requested cgroup tree delegation
from systemd. It's a shame that Docker is hindering us from making the
switch, but it is how it is. Docker currently bypasses systemd
entirely when it comes to cgroupsv2 and considers itself to be owner
of the cgroup tree which is a mess on cgroupsv1 (though you have a
chance of getting away with it) and doesn't work at all on cgroupsv2.

Or in other words: if you are looking for a way to get your own
per-user delegated cgroup subtree, simply ask systemd for it by
setting Delegate=yes in your service unit file, or by asking for
a scope unit to be registered, also with Delegate=yes set. Nothing
else is supported.

Lennart

@rhatdan
Copy link
Member

rhatdan commented Sep 10, 2018

Of course if Podman is successful we could add better support for V2 CGroups to allow distributions to have the option.

@cyphar
Copy link

cyphar commented Sep 10, 2018

I thought you guys used the systemd cgroup driver on Fedora/RHEL -- that does exactly what Lennart is referring to as "hindering [him]". In addition, it's untrue that Docker (or runc) are entirely responsible for blocking cgroupv2 adoption -- a lack of the freezer cgroup (and a usable devices cgroup that doesn't depend entirely on eBPF) is blocking a wholesale switch to cgroupv2.

Most of the points Lennart made are just longer versions of what @brauner said earlier as well (though his disdain of pam_cgfs.so even though he doesn't know what it does ignores that @brauner explicitly said that you need to be careful when using it precisely for that reason -- many cgroupv1 controllers are safe for unprivileged use).

But I guess that answers my question on whether systemd intends to support nsdelegate -- it appears the answer is "no" given that he is discussing asking systemd for permission to delegate cgroupv2 cgroups even though the kernel has an explicit feature to allow this (nsdelegate and CLONE_NEWCGROUP).

@jerboaa
Copy link

jerboaa commented Sep 24, 2018

How do people feel about mentioning this root-less issue in the man page or provide some warning when --memory is known to NOT work? Currently I see this in podman-run's man page:

   -m, --memory=""

   Memory limit (format: <number>[<unit>], where unit = b, k, m or g)

   Allows you to constrain the memory available to a container. If the host supports swap memory, then the -m memory setting can be larger than physical RAM. If a limit of 0 is specified (not using
   -m), the container's memory is not limited. The actual limit may be rounded up to a multiple of the operating system's page size (the value would be very large, that's millions of trillions).

It doesn't say that one needs to run podman as root in order for it to be effective. Neither mentions the command itself any limitation :-( Something like this would have been helpful:

 $ whoami
 someuser
 $ podman run [...] --memory=10m fedora:28 /bin/bash
 Warning: --memory as unprivileged user has no effect.
 bash-4.4#

@brauner
Copy link

brauner commented Sep 24, 2018

Why not simply fail if a user requests memory constraints or any cgroup constraints that the user requested but that runC can't fulfill. Seems printing a simple warning can be easily overlooked.

@mheon
Copy link
Member

mheon commented Sep 24, 2018

I agree that failing it probably the appropriate course of action.

On the manpages - they really need a thorough overhaul to show what can and cannot be done with rootless.

@cyphar
Copy link

cyphar commented Sep 25, 2018

@brauner runc should already do that (we have tests to make sure and everything 😉), I have the feeling that podman is removing cgroup entries or something fishy?

@rhatdan
Copy link
Member

rhatdan commented Sep 25, 2018

@giuseppe PTAL

@rhatdan
Copy link
Member

rhatdan commented Sep 25, 2018

@brauner Please open a PR for the updated man page. --memory should definitely error out.

@giuseppe
Copy link
Member

@rhatdan I've opened a PR here: #1547

@rhatdan
Copy link
Member

rhatdan commented Dec 22, 2018

@AkihiroSuda @giuseppe Where are we on this issue?

@AkihiroSuda
Copy link
Collaborator Author

@rhatdan It looks like Fedora is not going to adopt pam_cgfs.so (or any its equivalent for cgroups v1), so probably we need to implement cgroups v2 for runc instead: opencontainers/runc#654 (needs consideration for device & freezer stuff)

@rhatdan
Copy link
Member

rhatdan commented Dec 23, 2018

Well I would love to go to CGroups v2, but we always seem to be stuck in V1 world.

@cyphar
Copy link

cyphar commented Dec 23, 2018

Not to mention that switching to cgroupv2 is effectively a userspace regression for containers, because now programs that understand cgroupv1 cannot work with cgroupv2 (and if you switch partially to cgroupv2 then programs in containers cannot use cgroupv1 and thus would be be broken by the switch).

File capabilities v3 was handled by silently doing conversions (in quite a few ways) specifically to avoid this problem. But cgroupv2 has no such work, and as a result there will probably be a split like this for a very long time...

@poettering
Copy link

Not to mention that switching to cgroupv2 is effectively a userspace regression for containers, because now programs that understand cgroupv1 cannot work with cgroupv2 (and if you switch partially to cgroupv2 then programs in containers cannot use cgroupv1 and thus would be be broken by the switch).

Note that it's possible to set up a cgroupsv1 compatible environment for container payloads on a cgroupsv2 host and vice versa (if the kernel supports both APIs). systemd-nspawn supports that for example. It's a bit restricted though, since it means you can't reasonably delegate any controllers to the container, but quite frankly controller delegation on cgroupsv1 is unsafe anyway, and hence not desirable, regardless if the host runs cgroupsv2 or cgroupsv1.

Or in other words: whether the host runs a cgroupsv2 or cgroups1 setup does not necessarily have to have effect on what the container payloads see.

@cyphar
Copy link

cyphar commented Dec 27, 2018

Note that it's possible to set up a cgroupsv1 compatible environment for container payloads on a cgroupsv2 host and vice versa (if the kernel supports both APIs).

How does this work? Last I checked, you have to enable an entire controller on either cgroupv1 or cgroupv2 and you can't use them in parallel. So if the host is using cgroupv2 controllers, then the container cannot use the cgroupv1 equivalent of the same controller simultaneously. This is what I was referring to.

@poettering
Copy link

It doesn't manipulate cgroups, but it does read memory.limit_in_bytes and cpuset.cpus (et al) so that it can figure out what limits are actually in place (since /proc/meminfo and /proc/cpuinfo aren't cgroup-aware).

Given that cpuset.cpus on cgroupsv1 is actually the same thing as the normal process affinity mask (yes, they propagate towards each other, it's fucked), there's really no benefit in using cpuset.cpus. If they have a fallback to the affinity mask, that's totally sufficient...

I am not sure I grok why java wants to read that and what for. I mean, does it assume it's the only thing running inside a cgroup? What good is a memory measurement for for yourself when you don't know if it's actually you or something else too that is accounted into it? Sounds all very fishy to me...

Either way, this sounds like no big issue to me. A patch fixing this should be pretty straight-forwards, and it doesn't actually "break" stuff I guess anyway, except some stats...

I'm aware of Delegate=, in fact runc and Docker depend on it quite heavily when users use the "systemd cgroup driver" rather than the default/native one. I do really wish systemd supported nsdelegate (which would allow for cgroup namespaces to actually be used as delegation boundaries under systemd without needing to modify systemd-specific files or have systemd-specific code).

As I wrote countless times elsewhere and here: if you follow those guidelines then your program doesn't need anything systemd specific really: the whole delegation docs just say: you asked for delegation, you got it, now stay within the subtree you got, and you are fine.

Also systemd insists on nsdelegate when it's available. It's not even an option to opt-out of nsdelegate. Since a longer time actually.

@jerboaa
Copy link

jerboaa commented Jan 21, 2019

It doesn't manipulate cgroups, but it does read memory.limit_in_bytes and cpuset.cpus (et al) so that it can figure out what limits are actually in place (since /proc/meminfo and /proc/cpuinfo aren't cgroup-aware).

Given that cpuset.cpus on cgroupsv1 is actually the same thing as the normal process affinity mask (yes, they propagate towards each other, it's fucked), there's really no benefit in using cpuset.cpus. If they have a fallback to the affinity mask, that's totally sufficient...

Speaking for OpenJDK, yes, it uses sched_getaffinity:
http://hg.openjdk.java.net/jdk/jdk/file/571f12d51db5/src/hotspot/os/linux/osContainer_linux.cpp#l526

Having said that, cpuset.cpus isn't very common to be used in cloud frameworks. E.g. kubernetes uses cpu shares and cpu quotas.

OpenJDK takes cpu shares and cpu quotas into account. In doing so it makes some assumptions about the higher level cloud frameworks, like kubernetes, and how they set up and run containers. Example:
http://hg.openjdk.java.net/jdk/jdk/file/571f12d51db5/src/hotspot/os/linux/osContainer_linux.cpp#l35

I am not sure I grok why java wants to read that and what for.

OpenJDK hotspot has its own memory management. If run in a container with memory limits, it needs to know so as to not run afoul of the OOM-killer. It would otherwise size its heap too big and eventually an OOM-kill would happen.

As for the CPU limits, it does that so that it can do some guestimate on the available CPUs. It's never going to be accurate, but as the JVM does some sizing of its threads (JIT threads, GC threads, etc.) based on CPUs it thinks it has available it works better if it takes cgroup limits into account.

I mean, does it assume it's the only thing running inside a cgroup?

It doesn't. However, that's actually a fairly common thing in cloud containers. Anyhow, it's better off considering the container limits than the actual host values.

What good is a memory measurement for for yourself when you don't know if it's actually you or something else too that is accounted into it? Sounds all very fishy to me...

Agreed. There is no perfect answer for this. But considering there is a container limit it can be assumed that the user wanted the entire container (cgroup) to not go beyond that limit. Be it one process or more.

Either way, this sounds like no big issue to me. A patch fixing this should be pretty straight-forwards, and it doesn't actually "break" stuff I guess anyway, except some stats...

Not sure if it's related, but we've discovered that with Kernel 4.18 and above the container detection breaks with systemd slices. Last working Kernel was 4.15. See:
https://bugs.openjdk.java.net/browse/JDK-8217338

@AkihiroSuda
Copy link
Collaborator Author

https://fedoraproject.org/wiki/Changes/CGroupsV2

Is Red Hat working on cgroup2 support for runc?

@rhatdan @giuseppe @vbatts

@giuseppe
Copy link
Member

@AkihiroSuda:

Filipe (@filbranden) is working on it: containers/conmon#8

@rhatdan
Copy link
Member

rhatdan commented Feb 18, 2019

We are trying to support his efforts and make changes in Podman and Conmon to further his testing along.
We are also working with the systemd team to make sure that they work with @filbranden.

Bottom line this is a high priority for us, and anything we can do to help this along, we shall help.
@AkihiroSuda If you can also help that would be great.

@filbranden
Copy link

@AkihiroSuda Just cc'd you on opencontainers/runc#1991 where I'm starting to fix libcontainer's systemd cgroup driver to actually always go through systemd (using the D-Bus interface) for all the writes.

That first PR is trying to establish an interface for the subsystems to translate their OCI configuration into systemd D-Bus properties, and it implements it for the "devices" controller (as a proof of concept.) Once the interface is approved/merged, we can convert the other cgroups (memory, cpu, etc.) and get all going through systemd.

Once that's in, I already have some code to gather the stats from the cgroupv2 tree (it's a fairly simple patch.)

So... progress! Watch that PR and pitch in if you like!

Cheers,
Filipe

@AkihiroSuda
Copy link
Collaborator Author

Thanks, just to confirm, non-systemd cgroup2 is also going to be supported?

@filbranden
Copy link

No, only through the systemd cgroupdriver.

That's the thing, doing it through systemd gets it for free, we only go through D-Bus and systemd abstracts all that from us. The only remaining implementation is when getting statistics directly from the tree (memory.stat, cpu.stat, etc.) we need to find them at the proper place, but that's a small detail, a tiny commit, I already have a draft for it.

Frankly, I don't see cgroupv2 on cgroupfs cgroupdriver ever happening, since some controllers (such as "devices") were discontinued on cgroupv2, so systemd is actually installing an eBPF rule to implement device restrictions there. I really don't see libcontainer duplicating that effort... (But I might be wrong about it.)

In any case, I'd say 99% of systems I care about are running on systemd anyways, so going through it makes sense to me.

@AkihiroSuda
Copy link
Collaborator Author

So it doesn't work with nested containers and Alpine hosts?

@filbranden
Copy link

I, at least, am only looking to fixing the systemd path to support cgroupv2. So I think it won't work nested, unless you're running systemd inside your container (e.g. like KIND does.) I believe systemd can run in a fairly unprivileged container (at least in nspawn world...) but I haven't looked a lot into this, so you'd need to double check that...

@vbatts
Copy link
Collaborator

vbatts commented Feb 20, 2019 via email

@poettering
Copy link

poettering commented Feb 20, 2019 via email

@rhatdan
Copy link
Member

rhatdan commented Feb 20, 2019

I would like to see this get in, but in the rootless case, where we want to modify the cgroups of a container will this work? Will runc be able to talk to Systemd to setup a cgroup for the container?

@poettering
Copy link

I would like to see this get in, but in the rootless case, where we want to modify the cgroups of a container will this work? Will runc be able to talk to Systemd to setup a cgroup for the container?

i am not sure how unpriv runc precisely works. But note that PID1 (i.e. the system instance of systemd) will deny delegation of cgroups subtrees to unprivileged clients if they already dropped privs. However, it's fine to delegate cgroup subtrees to programs that start unpriv and drop privs later, as well as to service payloads that use systemd's User= and thus let systemd drop privs for you.

Also note that each regular user generally has their own systemd --user instance. Unpriv users can request their instance for a delegated subtree too, and this is then permitted. The APIs are exactly the same as they are for the system instance, except that you ask on the user rather than the system bus for delegation.

@rhatdan
Copy link
Member

rhatdan commented Feb 20, 2019

This sounds like exactly what we need. If a user is alloced X% of a resource, then we want them to further subdevice the X% to their containers.

@giuseppe
Copy link
Member

I've written this message privately to some of you, but I'll report it here as well:

something I've noticed and that will block its adoption for rootless containers is that D-Bus doesn't work from a user namespace if euid_in_the_namespace != euid_on_the_host.

We create the user namespace to manage the storage and the networking before we call the OCI runtime. The OCI runtime for rootless containers can create a nested userns if there are different mappings used but it already runs within a userns with euid=0.

A simple test:

  • we create a user namespace but we keep the same uid we had on the host:
$ bwrap --unshare-user --uid $(id -u) --bind / / dbus-send  --session  --dest=org.freedesktop.DBus --type=method_call --print-reply /org/freedesktop/DBus org.freedesktop.DBus.ListNames
  • we use uid=0 in the user namespace, it doesn't work:
$ bwrap --unshare-user --uid 0 --bind / / dbus-send  --session --dest=org.freedesktop.DBus     --type=method_call  --print-reply /org/freedesktop/DBus org.freedesktop.DBus.ListNames
Failed to open connection to "session" message bus: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

I think it depends on D-Bus including the euid in the AUTH EXTERNAL request:

https://github.com/systemd/systemd/blob/master/src/libsystemd/sd-bus/bus-socket.c#L620

@giuseppe
Copy link
Member

I think it depends on D-Bus including the euid in the AUTH EXTERNAL request:

https://github.com/systemd/systemd/blob/master/src/libsystemd/sd-bus/bus-socket.c#L620

being addressed by: systemd/systemd#11785

@rhatdan
Copy link
Member

rhatdan commented Mar 8, 2019

There continues to be progress being made in cgroupsv2.

@vbatts
Copy link
Collaborator

vbatts commented Mar 8, 2019 via email

@rhatdan
Copy link
Member

rhatdan commented Apr 13, 2019

@filbranden Any update on the cgroupsv2 work?

@filbranden
Copy link

Hi @rhatdan

I just added an update to opencontainers/runc#2007 with a proposed approach.

I think we still need more work on the underlying components, to ensure everything is in place. In particular, we'll need freezer support in cgroup2 in the kernel (last I looked, it was planned for 5.2, but not sure if it's still in schedule) and systemd needs to export more cgroup2 interfaces to userspace, via D-Bus (such as freezer, as mentioned, and also cpuset, which I believe made it into kernel 5.0)

Cheers!
Filipe

@rhatdan
Copy link
Member

rhatdan commented May 2, 2019

Thanks for keeping us up2date. I am watching the runc PRs and keeping up with it as best I can. @filbranden Keep up the good work. Eventually we will get there.

@rhatdan
Copy link
Member

rhatdan commented Aug 5, 2019

@giuseppe Since we now have cgroupsv2 support, can we close this PR?

@giuseppe
Copy link
Member

giuseppe commented Aug 6, 2019

@giuseppe Since we now have cgroupsv2 support, can we close this PR?

yes I think we can close the issue here and address any future issue separately

@giuseppe giuseppe closed this as completed Aug 6, 2019
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless
Projects
None yet
Development

No branches or pull requests

10 participants