[rootless] question: plan for supporting cgroups? #1429

AkihiroSuda · 2018-09-08T12:41:56Z

Rootless mode could support cgroups when pam_cgfs.so is available (
opencontainers/runc#1839 cc @cyphar), but it is not available on Fedora (AFAIK)

Is there plan for supporting pam_cgfs.so or any equivalent of that?

(This question is not specific to podman, and I'm not sure this repo is the right place to ask this question :p)

The text was updated successfully, but these errors were encountered:

rhatdan · 2018-09-09T13:37:04Z

Is there a way to handle this via communications with systemd? It would seem that systemd probably provides a mechanism for user apps to manipulate the cgroups available to the user?

brauner · 2018-09-09T18:55:20Z

systemd won't do unprivileged cgroup delegation on v1 hierarchies since there's no way to do it safely so it needs to be up to the administrator to switch this on. So on v1 hierarchies now way to get this going without pam_cgfs.so that's why we @lxc wrote it in the first place. The trick is to limit this to cgroups you really really care about for your runtime, that's why the pam module takes arguments which the administrators needs to explicitly set.
Now, the story is different for cgroup v2. You can talk to systemd via dbus if you feel like linking against a bunch of xml. Or you request the delegated property in a service file or - if you have a daemon - the daemon requests the delegation and creates two parallel cgroups on the same level of the hierarchy and moves itself into one and the container into the other. In any case this requires that the runtime never escapes to the root cgroup.

cyphar · 2018-09-10T01:28:31Z

@brauner Does systemd remount /sys/fs/cgroup with nsdelegate now? Or is this something that systemd has yet to come up with a proper setup for -- since it would technically allow cgroupv2 delegation without systemd integration by just using a cgroup namespace.

rhatdan · 2018-09-10T13:34:34Z

From devel list on fedora, this is @poettering response.

I am not sure what pam_cgfs.so precisely does, but do note that on
systemd systems (which includes Fedora) systemd is the owner of the
cgroup tree, and the only means by which other components may manage
their own subtrees is through cgroup delegation (which includes
delegation to less privileged users), which you can request from
systemd.

You can request cgroup delegation from the system service manager, for which
you need to be privileged (but you can request it on behalf of an
unprivileged user).

You can also request cgroup delegation from your private user service
manager instance, for which you do not need to be privileged.

The APIs for requesting cgroup delegation from the system service
manager or your user service manager is the same, the only difference
is whether you do so through the system or user dbus bus.

Note that on cgroupsv1 delegation of *controllers* (i.e. "cpu",
"cpuset", "memory", "blkio", …) to unprivileged processes is not safe
(this is a kernel limitation) and systemd won't do it hence. On
cgroupsv2 it is safe however, and hence you will get "memory" and
"tasks" delegated by default (though not "cpu" by default as the
runtime impact of that is still too high).

Do note however, that Docker is blocking us from switching Fedora over
to cgroupsv2 though, as there is still no working support for
cgroupsv2 in Docker, nor support for requested cgroup tree delegation
from systemd. It's a shame that Docker is hindering us from making the
switch, but it is how it is. Docker currently bypasses systemd
entirely when it comes to cgroupsv2 and considers itself to be owner
of the cgroup tree which is a mess on cgroupsv1 (though you have a
chance of getting away with it) and doesn't work at all on cgroupsv2.

Or in other words: if you are looking for a way to get your own
per-user delegated cgroup subtree, simply ask systemd for it by
setting Delegate=yes in your service unit file, or by asking for
a scope unit to be registered, also with Delegate=yes set. Nothing
else is supported.

Lennart

rhatdan · 2018-09-10T13:35:29Z

Of course if Podman is successful we could add better support for V2 CGroups to allow distributions to have the option.

cyphar · 2018-09-10T16:25:23Z

I thought you guys used the systemd cgroup driver on Fedora/RHEL -- that does exactly what Lennart is referring to as "hindering [him]". In addition, it's untrue that Docker (or runc) are entirely responsible for blocking cgroupv2 adoption -- a lack of the freezer cgroup (and a usable devices cgroup that doesn't depend entirely on eBPF) is blocking a wholesale switch to cgroupv2.

Most of the points Lennart made are just longer versions of what @brauner said earlier as well (though his disdain of pam_cgfs.so even though he doesn't know what it does ignores that @brauner explicitly said that you need to be careful when using it precisely for that reason -- many cgroupv1 controllers are safe for unprivileged use).

But I guess that answers my question on whether systemd intends to support nsdelegate -- it appears the answer is "no" given that he is discussing asking systemd for permission to delegate cgroupv2 cgroups even though the kernel has an explicit feature to allow this (nsdelegate and CLONE_NEWCGROUP).

jerboaa · 2018-09-24T16:35:27Z

How do people feel about mentioning this root-less issue in the man page or provide some warning when --memory is known to NOT work? Currently I see this in podman-run's man page:

   -m, --memory=""

   Memory limit (format: <number>[<unit>], where unit = b, k, m or g)

   Allows you to constrain the memory available to a container. If the host supports swap memory, then the -m memory setting can be larger than physical RAM. If a limit of 0 is specified (not using
   -m), the container's memory is not limited. The actual limit may be rounded up to a multiple of the operating system's page size (the value would be very large, that's millions of trillions).

It doesn't say that one needs to run podman as root in order for it to be effective. Neither mentions the command itself any limitation :-( Something like this would have been helpful:

 $ whoami
 someuser
 $ podman run [...] --memory=10m fedora:28 /bin/bash
 Warning: --memory as unprivileged user has no effect.
 bash-4.4#

brauner · 2018-09-24T17:06:34Z

Why not simply fail if a user requests memory constraints or any cgroup constraints that the user requested but that runC can't fulfill. Seems printing a simple warning can be easily overlooked.

mheon · 2018-09-24T17:25:36Z

I agree that failing it probably the appropriate course of action.

On the manpages - they really need a thorough overhaul to show what can and cannot be done with rootless.

cyphar · 2018-09-25T12:24:30Z

@brauner runc should already do that (we have tests to make sure and everything 😉), I have the feeling that podman is removing cgroup entries or something fishy?

rhatdan · 2018-09-25T20:55:53Z

@giuseppe PTAL

rhatdan · 2018-09-25T20:56:44Z

@brauner Please open a PR for the updated man page. --memory should definitely error out.

giuseppe · 2018-09-26T11:25:14Z

@rhatdan I've opened a PR here: #1547

containers#1429 (comment) Signed-off-by: Giuseppe Scrivano <[email protected]>

rhatdan · 2018-12-22T12:40:36Z

@AkihiroSuda @giuseppe Where are we on this issue?

AkihiroSuda · 2018-12-22T15:02:11Z

@rhatdan It looks like Fedora is not going to adopt pam_cgfs.so (or any its equivalent for cgroups v1), so probably we need to implement cgroups v2 for runc instead: opencontainers/runc#654 (needs consideration for device & freezer stuff)

rhatdan · 2018-12-23T14:55:20Z

Well I would love to go to CGroups v2, but we always seem to be stuck in V1 world.

cyphar · 2018-12-23T19:33:13Z

Not to mention that switching to cgroupv2 is effectively a userspace regression for containers, because now programs that understand cgroupv1 cannot work with cgroupv2 (and if you switch partially to cgroupv2 then programs in containers cannot use cgroupv1 and thus would be be broken by the switch).

File capabilities v3 was handled by silently doing conversions (in quite a few ways) specifically to avoid this problem. But cgroupv2 has no such work, and as a result there will probably be a split like this for a very long time...

poettering · 2018-12-27T11:03:14Z

Not to mention that switching to cgroupv2 is effectively a userspace regression for containers, because now programs that understand cgroupv1 cannot work with cgroupv2 (and if you switch partially to cgroupv2 then programs in containers cannot use cgroupv1 and thus would be be broken by the switch).

Note that it's possible to set up a cgroupsv1 compatible environment for container payloads on a cgroupsv2 host and vice versa (if the kernel supports both APIs). systemd-nspawn supports that for example. It's a bit restricted though, since it means you can't reasonably delegate any controllers to the container, but quite frankly controller delegation on cgroupsv1 is unsafe anyway, and hence not desirable, regardless if the host runs cgroupsv2 or cgroupsv1.

Or in other words: whether the host runs a cgroupsv2 or cgroups1 setup does not necessarily have to have effect on what the container payloads see.

cyphar · 2018-12-27T14:55:34Z

Note that it's possible to set up a cgroupsv1 compatible environment for container payloads on a cgroupsv2 host and vice versa (if the kernel supports both APIs).

How does this work? Last I checked, you have to enable an entire controller on either cgroupv1 or cgroupv2 and you can't use them in parallel. So if the host is using cgroupv2 controllers, then the container cannot use the cgroupv1 equivalent of the same controller simultaneously. This is what I was referring to.

poettering · 2019-01-18T15:47:53Z

It doesn't manipulate cgroups, but it does read memory.limit_in_bytes and cpuset.cpus (et al) so that it can figure out what limits are actually in place (since /proc/meminfo and /proc/cpuinfo aren't cgroup-aware).

Given that cpuset.cpus on cgroupsv1 is actually the same thing as the normal process affinity mask (yes, they propagate towards each other, it's fucked), there's really no benefit in using cpuset.cpus. If they have a fallback to the affinity mask, that's totally sufficient...

I am not sure I grok why java wants to read that and what for. I mean, does it assume it's the only thing running inside a cgroup? What good is a memory measurement for for yourself when you don't know if it's actually you or something else too that is accounted into it? Sounds all very fishy to me...

Either way, this sounds like no big issue to me. A patch fixing this should be pretty straight-forwards, and it doesn't actually "break" stuff I guess anyway, except some stats...

I'm aware of Delegate=, in fact runc and Docker depend on it quite heavily when users use the "systemd cgroup driver" rather than the default/native one. I do really wish systemd supported nsdelegate (which would allow for cgroup namespaces to actually be used as delegation boundaries under systemd without needing to modify systemd-specific files or have systemd-specific code).

As I wrote countless times elsewhere and here: if you follow those guidelines then your program doesn't need anything systemd specific really: the whole delegation docs just say: you asked for delegation, you got it, now stay within the subtree you got, and you are fine.

Also systemd insists on nsdelegate when it's available. It's not even an option to opt-out of nsdelegate. Since a longer time actually.

jerboaa · 2019-01-21T10:15:30Z

It doesn't manipulate cgroups, but it does read memory.limit_in_bytes and cpuset.cpus (et al) so that it can figure out what limits are actually in place (since /proc/meminfo and /proc/cpuinfo aren't cgroup-aware).

Given that cpuset.cpus on cgroupsv1 is actually the same thing as the normal process affinity mask (yes, they propagate towards each other, it's fucked), there's really no benefit in using cpuset.cpus. If they have a fallback to the affinity mask, that's totally sufficient...

Speaking for OpenJDK, yes, it uses sched_getaffinity:
http://hg.openjdk.java.net/jdk/jdk/file/571f12d51db5/src/hotspot/os/linux/osContainer_linux.cpp#l526

Having said that, cpuset.cpus isn't very common to be used in cloud frameworks. E.g. kubernetes uses cpu shares and cpu quotas.

OpenJDK takes cpu shares and cpu quotas into account. In doing so it makes some assumptions about the higher level cloud frameworks, like kubernetes, and how they set up and run containers. Example:
http://hg.openjdk.java.net/jdk/jdk/file/571f12d51db5/src/hotspot/os/linux/osContainer_linux.cpp#l35

I am not sure I grok why java wants to read that and what for.

OpenJDK hotspot has its own memory management. If run in a container with memory limits, it needs to know so as to not run afoul of the OOM-killer. It would otherwise size its heap too big and eventually an OOM-kill would happen.

As for the CPU limits, it does that so that it can do some guestimate on the available CPUs. It's never going to be accurate, but as the JVM does some sizing of its threads (JIT threads, GC threads, etc.) based on CPUs it thinks it has available it works better if it takes cgroup limits into account.

I mean, does it assume it's the only thing running inside a cgroup?

It doesn't. However, that's actually a fairly common thing in cloud containers. Anyhow, it's better off considering the container limits than the actual host values.

What good is a memory measurement for for yourself when you don't know if it's actually you or something else too that is accounted into it? Sounds all very fishy to me...

Agreed. There is no perfect answer for this. But considering there is a container limit it can be assumed that the user wanted the entire container (cgroup) to not go beyond that limit. Be it one process or more.

Either way, this sounds like no big issue to me. A patch fixing this should be pretty straight-forwards, and it doesn't actually "break" stuff I guess anyway, except some stats...

Not sure if it's related, but we've discovered that with Kernel 4.18 and above the container detection breaks with systemd slices. Last working Kernel was 4.15. See:
https://bugs.openjdk.java.net/browse/JDK-8217338

AkihiroSuda · 2019-02-18T04:39:25Z

https://fedoraproject.org/wiki/Changes/CGroupsV2

Is Red Hat working on cgroup2 support for runc?

@rhatdan @giuseppe @vbatts

giuseppe · 2019-02-18T07:29:10Z

@AkihiroSuda:

Filipe (@filbranden) is working on it: containers/conmon#8

rhatdan · 2019-02-18T20:14:27Z

We are trying to support his efforts and make changes in Podman and Conmon to further his testing along.
We are also working with the systemd team to make sure that they work with @filbranden.

Bottom line this is a high priority for us, and anything we can do to help this along, we shall help.
@AkihiroSuda If you can also help that would be great.

filbranden · 2019-02-20T04:02:13Z

@AkihiroSuda Just cc'd you on opencontainers/runc#1991 where I'm starting to fix libcontainer's systemd cgroup driver to actually always go through systemd (using the D-Bus interface) for all the writes.

That first PR is trying to establish an interface for the subsystems to translate their OCI configuration into systemd D-Bus properties, and it implements it for the "devices" controller (as a proof of concept.) Once the interface is approved/merged, we can convert the other cgroups (memory, cpu, etc.) and get all going through systemd.

Once that's in, I already have some code to gather the stats from the cgroupv2 tree (it's a fairly simple patch.)

So... progress! Watch that PR and pitch in if you like!

Cheers,
Filipe

AkihiroSuda · 2019-02-20T04:05:16Z

Thanks, just to confirm, non-systemd cgroup2 is also going to be supported?

filbranden · 2019-02-20T04:39:12Z

No, only through the systemd cgroupdriver.

That's the thing, doing it through systemd gets it for free, we only go through D-Bus and systemd abstracts all that from us. The only remaining implementation is when getting statistics directly from the tree (memory.stat, cpu.stat, etc.) we need to find them at the proper place, but that's a small detail, a tiny commit, I already have a draft for it.

Frankly, I don't see cgroupv2 on cgroupfs cgroupdriver ever happening, since some controllers (such as "devices") were discontinued on cgroupv2, so systemd is actually installing an eBPF rule to implement device restrictions there. I really don't see libcontainer duplicating that effort... (But I might be wrong about it.)

In any case, I'd say 99% of systems I care about are running on systemd anyways, so going through it makes sense to me.

AkihiroSuda · 2019-02-20T04:48:01Z

So it doesn't work with nested containers and Alpine hosts?

filbranden · 2019-02-20T04:54:46Z

I, at least, am only looking to fixing the systemd path to support cgroupv2. So I think it won't work nested, unless you're running systemd inside your container (e.g. like KIND does.) I believe systemd can run in a fairly unprivileged container (at least in nspawn world...) but I haven't looked a lot into this, so you'd need to double check that...

vbatts · 2019-02-20T14:05:23Z

On 19/02/19 20:54 -0800, Filipe Brandenburger wrote: I, at least, am only looking to fixing the systemd path to support cgroupv2. So I think it won't work nested, unless you're running systemd inside your container (e.g. like KIND does.) I believe systemd can run in a fairly unprivileged container (at least in nspawn world...) but I haven't looked a lot into this, so you'd need to double check that...

i suppose nothing is stopping a hook from mounting cgroup v1 for the container. It sounds gross and i'm not sure how manageable it would be.

poettering · 2019-02-20T14:46:24Z

On Mi, 20.02.19 14:05, Vincent Batts ***@***.***) wrote: On 19/02/19 20:54 -0800, Filipe Brandenburger wrote: >I, at least, am only looking to fixing the systemd path to support cgroupv2. So I think it won't work nested, unless you're running systemd inside your container (e.g. like KIND does.) I believe systemd can run in a fairly unprivileged container (at least in nspawn world...) but I haven't looked a lot into this, so you'd need to double check that... i suppose nothing is stopping a hook from mounting cgroup v1 for the container. It sounds gross and i'm not sure how manageable it would be.

note that nspawn actually supports running cgroupsv1 container payloads on a cgroupsv2 host. It does so by mounting the old hierarchies internally, and using that, replicating the minimal hierarchy from the cgroupsv2 tree as necessary. But this is pretty messy, since nobody maintains that tree and cleans it up afterwards.

rhatdan · 2019-02-20T17:11:05Z

I would like to see this get in, but in the rootless case, where we want to modify the cgroups of a container will this work? Will runc be able to talk to Systemd to setup a cgroup for the container?

poettering · 2019-02-20T17:21:05Z

I would like to see this get in, but in the rootless case, where we want to modify the cgroups of a container will this work? Will runc be able to talk to Systemd to setup a cgroup for the container?

i am not sure how unpriv runc precisely works. But note that PID1 (i.e. the system instance of systemd) will deny delegation of cgroups subtrees to unprivileged clients if they already dropped privs. However, it's fine to delegate cgroup subtrees to programs that start unpriv and drop privs later, as well as to service payloads that use systemd's User= and thus let systemd drop privs for you.

Also note that each regular user generally has their own systemd --user instance. Unpriv users can request their instance for a delegated subtree too, and this is then permitted. The APIs are exactly the same as they are for the system instance, except that you ask on the user rather than the system bus for delegation.

rhatdan · 2019-02-20T19:49:54Z

This sounds like exactly what we need. If a user is alloced X% of a resource, then we want them to further subdevice the X% to their containers.

giuseppe · 2019-02-21T10:37:00Z

I've written this message privately to some of you, but I'll report it here as well:

something I've noticed and that will block its adoption for rootless containers is that D-Bus doesn't work from a user namespace if euid_in_the_namespace != euid_on_the_host.

We create the user namespace to manage the storage and the networking before we call the OCI runtime. The OCI runtime for rootless containers can create a nested userns if there are different mappings used but it already runs within a userns with euid=0.

A simple test:

we create a user namespace but we keep the same uid we had on the host:

$ bwrap --unshare-user --uid $(id -u) --bind / / dbus-send  --session  --dest=org.freedesktop.DBus --type=method_call --print-reply /org/freedesktop/DBus org.freedesktop.DBus.ListNames

we use uid=0 in the user namespace, it doesn't work:

$ bwrap --unshare-user --uid 0 --bind / / dbus-send  --session --dest=org.freedesktop.DBus     --type=method_call  --print-reply /org/freedesktop/DBus org.freedesktop.DBus.ListNames
Failed to open connection to "session" message bus: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

I think it depends on D-Bus including the euid in the AUTH EXTERNAL request:

https://github.com/systemd/systemd/blob/master/src/libsystemd/sd-bus/bus-socket.c#L620

giuseppe · 2019-02-21T14:17:59Z

I think it depends on D-Bus including the euid in the AUTH EXTERNAL request:

https://github.com/systemd/systemd/blob/master/src/libsystemd/sd-bus/bus-socket.c#L620

being addressed by: systemd/systemd#11785

rhatdan · 2019-03-08T16:36:27Z

There continues to be progress being made in cgroupsv2.

vbatts · 2019-03-08T17:14:22Z

as you have gaps identified, please report them to upstream tracker opencontainers/runtime-spec#1002

…

On Fri, Mar 8, 2019 at 11:36 AM Daniel J Walsh ***@***.***> wrote: There continues to be progress being made in cgroupsv2. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1429 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEF6SO7CkEGBpX-VGn01BhikB4lD-uiks5vUpGNgaJpZM4Wf2vk> .

rhatdan · 2019-04-13T09:01:51Z

@filbranden Any update on the cgroupsv2 work?

filbranden · 2019-05-01T21:51:18Z

Hi @rhatdan

I just added an update to opencontainers/runc#2007 with a proposed approach.

I think we still need more work on the underlying components, to ensure everything is in place. In particular, we'll need freezer support in cgroup2 in the kernel (last I looked, it was planned for 5.2, but not sure if it's still in schedule) and systemd needs to export more cgroup2 interfaces to userspace, via D-Bus (such as freezer, as mentioned, and also cpuset, which I believe made it into kernel 5.0)

Cheers!
Filipe

rhatdan · 2019-05-02T10:52:56Z

Thanks for keeping us up2date. I am watching the runc PRs and keeping up with it as best I can. @filbranden Keep up the good work. Eventually we will get there.

rhatdan · 2019-08-05T18:03:23Z

@giuseppe Since we now have cgroupsv2 support, can we close this PR?

giuseppe · 2019-08-06T08:58:08Z

@giuseppe Since we now have cgroupsv2 support, can we close this PR?

yes I think we can close the issue here and address any future issue separately

mheon added question rootless labels Sep 9, 2018

rhatdan assigned giuseppe Sep 25, 2018

giuseppe mentioned this issue Sep 26, 2018

rootless: raise an error when trying to use cgroups #1547

Merged

giuseppe added a commit to giuseppe/libpod that referenced this issue Oct 1, 2018

rootless: raise an error when trying to use cgroups

abde1ef

containers#1429 (comment) Signed-off-by: Giuseppe Scrivano <[email protected]>

AkihiroSuda mentioned this issue Oct 17, 2018

Allow running dockerd as a non-root user (Rootless mode) moby/moby#38050

Merged

AkihiroSuda mentioned this issue Mar 8, 2019

rootless k3s-io/k3s#195

Merged

gabibeyer mentioned this issue Jul 18, 2019

rootless: add rootless to kata kata-containers/runtime#1875

Merged

giuseppe closed this as completed Aug 6, 2019

fat-tire mentioned this issue Sep 17, 2022

Add 'bluetooth' group and policy to home-assistant container onedr0p/containers#68

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023

[rootless] question: plan for supporting cgroups? #1429

[rootless] question: plan for supporting cgroups? #1429

Comments

AkihiroSuda commented Sep 8, 2018

rhatdan commented Sep 9, 2018

brauner commented Sep 9, 2018

cyphar commented Sep 10, 2018

rhatdan commented Sep 10, 2018

rhatdan commented Sep 10, 2018

cyphar commented Sep 10, 2018

jerboaa commented Sep 24, 2018

brauner commented Sep 24, 2018

mheon commented Sep 24, 2018

cyphar commented Sep 25, 2018

rhatdan commented Sep 25, 2018

rhatdan commented Sep 25, 2018

giuseppe commented Sep 26, 2018

rhatdan commented Dec 22, 2018

AkihiroSuda commented Dec 22, 2018

rhatdan commented Dec 23, 2018

cyphar commented Dec 23, 2018

poettering commented Dec 27, 2018

cyphar commented Dec 27, 2018

poettering commented Jan 18, 2019

jerboaa commented Jan 21, 2019

AkihiroSuda commented Feb 18, 2019

giuseppe commented Feb 18, 2019

rhatdan commented Feb 18, 2019

filbranden commented Feb 20, 2019

AkihiroSuda commented Feb 20, 2019

filbranden commented Feb 20, 2019

AkihiroSuda commented Feb 20, 2019

filbranden commented Feb 20, 2019

vbatts commented Feb 20, 2019 via email

poettering commented Feb 20, 2019 via email

rhatdan commented Feb 20, 2019

poettering commented Feb 20, 2019

rhatdan commented Feb 20, 2019

giuseppe commented Feb 21, 2019

giuseppe commented Feb 21, 2019

rhatdan commented Mar 8, 2019

vbatts commented Mar 8, 2019 via email

rhatdan commented Apr 13, 2019

filbranden commented May 1, 2019

rhatdan commented May 2, 2019

rhatdan commented Aug 5, 2019

giuseppe commented Aug 6, 2019