-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rootless] question: plan for supporting cgroups? #1429
Comments
Is there a way to handle this via communications with systemd? It would seem that systemd probably provides a mechanism for user apps to manipulate the cgroups available to the user? |
systemd won't do unprivileged cgroup delegation on v1 hierarchies since there's no way to do it safely so it needs to be up to the administrator to switch this on. So on v1 hierarchies now way to get this going without |
@brauner Does systemd remount |
From devel list on fedora, this is @poettering response.
|
Of course if Podman is successful we could add better support for V2 CGroups to allow distributions to have the option. |
I thought you guys used the systemd cgroup driver on Fedora/RHEL -- that does exactly what Lennart is referring to as "hindering [him]". In addition, it's untrue that Docker (or runc) are entirely responsible for blocking cgroupv2 adoption -- a lack of the Most of the points Lennart made are just longer versions of what @brauner said earlier as well (though his disdain of But I guess that answers my question on whether systemd intends to support |
How do people feel about mentioning this root-less issue in the man page or provide some warning when --memory is known to NOT work? Currently I see this in
It doesn't say that one needs to run podman as root in order for it to be effective. Neither mentions the command itself any limitation :-( Something like this would have been helpful:
|
Why not simply fail if a user requests memory constraints or any cgroup constraints that the user requested but that runC can't fulfill. Seems printing a simple warning can be easily overlooked. |
I agree that failing it probably the appropriate course of action. On the manpages - they really need a thorough overhaul to show what can and cannot be done with rootless. |
@brauner |
@giuseppe PTAL |
@brauner Please open a PR for the updated man page. --memory should definitely error out. |
containers#1429 (comment) Signed-off-by: Giuseppe Scrivano <[email protected]>
@AkihiroSuda @giuseppe Where are we on this issue? |
@rhatdan It looks like Fedora is not going to adopt |
Well I would love to go to CGroups v2, but we always seem to be stuck in V1 world. |
Not to mention that switching to cgroupv2 is effectively a userspace regression for containers, because now programs that understand cgroupv1 cannot work with cgroupv2 (and if you switch partially to cgroupv2 then programs in containers cannot use cgroupv1 and thus would be be broken by the switch). File capabilities v3 was handled by silently doing conversions (in quite a few ways) specifically to avoid this problem. But cgroupv2 has no such work, and as a result there will probably be a split like this for a very long time... |
Note that it's possible to set up a cgroupsv1 compatible environment for container payloads on a cgroupsv2 host and vice versa (if the kernel supports both APIs). systemd-nspawn supports that for example. It's a bit restricted though, since it means you can't reasonably delegate any controllers to the container, but quite frankly controller delegation on cgroupsv1 is unsafe anyway, and hence not desirable, regardless if the host runs cgroupsv2 or cgroupsv1. Or in other words: whether the host runs a cgroupsv2 or cgroups1 setup does not necessarily have to have effect on what the container payloads see. |
How does this work? Last I checked, you have to enable an entire controller on either cgroupv1 or cgroupv2 and you can't use them in parallel. So if the host is using cgroupv2 controllers, then the container cannot use the cgroupv1 equivalent of the same controller simultaneously. This is what I was referring to. |
Given that cpuset.cpus on cgroupsv1 is actually the same thing as the normal process affinity mask (yes, they propagate towards each other, it's fucked), there's really no benefit in using I am not sure I grok why java wants to read that and what for. I mean, does it assume it's the only thing running inside a cgroup? What good is a memory measurement for for yourself when you don't know if it's actually you or something else too that is accounted into it? Sounds all very fishy to me... Either way, this sounds like no big issue to me. A patch fixing this should be pretty straight-forwards, and it doesn't actually "break" stuff I guess anyway, except some stats...
As I wrote countless times elsewhere and here: if you follow those guidelines then your program doesn't need anything systemd specific really: the whole delegation docs just say: you asked for delegation, you got it, now stay within the subtree you got, and you are fine. Also systemd insists on nsdelegate when it's available. It's not even an option to opt-out of nsdelegate. Since a longer time actually. |
Speaking for OpenJDK, yes, it uses Having said that, cpuset.cpus isn't very common to be used in cloud frameworks. E.g. kubernetes uses cpu shares and cpu quotas. OpenJDK takes cpu shares and cpu quotas into account. In doing so it makes some assumptions about the higher level cloud frameworks, like kubernetes, and how they set up and run containers. Example:
OpenJDK hotspot has its own memory management. If run in a container with memory limits, it needs to know so as to not run afoul of the OOM-killer. It would otherwise size its heap too big and eventually an OOM-kill would happen. As for the CPU limits, it does that so that it can do some guestimate on the available CPUs. It's never going to be accurate, but as the JVM does some sizing of its threads (JIT threads, GC threads, etc.) based on CPUs it thinks it has available it works better if it takes cgroup limits into account.
It doesn't. However, that's actually a fairly common thing in cloud containers. Anyhow, it's better off considering the container limits than the actual host values.
Agreed. There is no perfect answer for this. But considering there is a container limit it can be assumed that the user wanted the entire container (cgroup) to not go beyond that limit. Be it one process or more.
Not sure if it's related, but we've discovered that with Kernel 4.18 and above the container detection breaks with systemd slices. Last working Kernel was 4.15. See: |
https://fedoraproject.org/wiki/Changes/CGroupsV2 Is Red Hat working on cgroup2 support for runc? |
Filipe (@filbranden) is working on it: containers/conmon#8 |
We are trying to support his efforts and make changes in Podman and Conmon to further his testing along. Bottom line this is a high priority for us, and anything we can do to help this along, we shall help. |
@AkihiroSuda Just cc'd you on opencontainers/runc#1991 where I'm starting to fix libcontainer's systemd cgroup driver to actually always go through systemd (using the D-Bus interface) for all the writes. That first PR is trying to establish an interface for the subsystems to translate their OCI configuration into systemd D-Bus properties, and it implements it for the "devices" controller (as a proof of concept.) Once the interface is approved/merged, we can convert the other cgroups (memory, cpu, etc.) and get all going through systemd. Once that's in, I already have some code to gather the stats from the cgroupv2 tree (it's a fairly simple patch.) So... progress! Watch that PR and pitch in if you like! Cheers, |
Thanks, just to confirm, non-systemd cgroup2 is also going to be supported? |
No, only through the systemd cgroupdriver. That's the thing, doing it through systemd gets it for free, we only go through D-Bus and systemd abstracts all that from us. The only remaining implementation is when getting statistics directly from the tree ( Frankly, I don't see cgroupv2 on cgroupfs cgroupdriver ever happening, since some controllers (such as "devices") were discontinued on cgroupv2, so systemd is actually installing an eBPF rule to implement device restrictions there. I really don't see libcontainer duplicating that effort... (But I might be wrong about it.) In any case, I'd say 99% of systems I care about are running on systemd anyways, so going through it makes sense to me. |
So it doesn't work with nested containers and Alpine hosts? |
I, at least, am only looking to fixing the systemd path to support cgroupv2. So I think it won't work nested, unless you're running systemd inside your container (e.g. like KIND does.) I believe systemd can run in a fairly unprivileged container (at least in nspawn world...) but I haven't looked a lot into this, so you'd need to double check that... |
On 19/02/19 20:54 -0800, Filipe Brandenburger wrote:
I, at least, am only looking to fixing the systemd path to support cgroupv2. So I think it won't work nested, unless you're running systemd inside your container (e.g. like KIND does.) I believe systemd can run in a fairly unprivileged container (at least in nspawn world...) but I haven't looked a lot into this, so you'd need to double check that...
i suppose nothing is stopping a hook from mounting cgroup v1 for the
container. It sounds gross and i'm not sure how manageable it would be.
|
On Mi, 20.02.19 14:05, Vincent Batts ***@***.***) wrote:
On 19/02/19 20:54 -0800, Filipe Brandenburger wrote: >I, at least,
am only looking to fixing the systemd path to support cgroupv2. So I
think it won't work nested, unless you're running systemd inside
your container (e.g. like KIND does.) I believe systemd can run in a
fairly unprivileged container (at least in nspawn world...) but I
haven't looked a lot into this, so you'd need to double check
that...
i suppose nothing is stopping a hook from mounting cgroup v1 for the
container. It sounds gross and i'm not sure how manageable it would be.
note that nspawn actually supports running cgroupsv1 container
payloads on a cgroupsv2 host. It does so by mounting the old
hierarchies internally, and using that, replicating the minimal
hierarchy from the cgroupsv2 tree as necessary. But this is pretty
messy, since nobody maintains that tree and cleans it up afterwards.
|
I would like to see this get in, but in the rootless case, where we want to modify the cgroups of a container will this work? Will runc be able to talk to Systemd to setup a cgroup for the container? |
i am not sure how unpriv runc precisely works. But note that PID1 (i.e. the system instance of systemd) will deny delegation of cgroups subtrees to unprivileged clients if they already dropped privs. However, it's fine to delegate cgroup subtrees to programs that start unpriv and drop privs later, as well as to service payloads that use systemd's User= and thus let systemd drop privs for you. Also note that each regular user generally has their own systemd --user instance. Unpriv users can request their instance for a delegated subtree too, and this is then permitted. The APIs are exactly the same as they are for the system instance, except that you ask on the user rather than the system bus for delegation. |
This sounds like exactly what we need. If a user is alloced X% of a resource, then we want them to further subdevice the X% to their containers. |
I've written this message privately to some of you, but I'll report it here as well: something I've noticed and that will block its adoption for rootless containers is that D-Bus doesn't work from a user namespace if euid_in_the_namespace != euid_on_the_host. We create the user namespace to manage the storage and the networking before we call the OCI runtime. The OCI runtime for rootless containers can create a nested userns if there are different mappings used but it already runs within a userns with euid=0. A simple test:
I think it depends on D-Bus including the euid in the AUTH EXTERNAL request: https://github.com/systemd/systemd/blob/master/src/libsystemd/sd-bus/bus-socket.c#L620 |
being addressed by: systemd/systemd#11785 |
There continues to be progress being made in cgroupsv2. |
as you have gaps identified, please report them to upstream tracker
opencontainers/runtime-spec#1002
…On Fri, Mar 8, 2019 at 11:36 AM Daniel J Walsh ***@***.***> wrote:
There continues to be progress being made in cgroupsv2.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1429 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEF6SO7CkEGBpX-VGn01BhikB4lD-uiks5vUpGNgaJpZM4Wf2vk>
.
|
@filbranden Any update on the cgroupsv2 work? |
Hi @rhatdan I just added an update to opencontainers/runc#2007 with a proposed approach. I think we still need more work on the underlying components, to ensure everything is in place. In particular, we'll need freezer support in cgroup2 in the kernel (last I looked, it was planned for 5.2, but not sure if it's still in schedule) and systemd needs to export more cgroup2 interfaces to userspace, via D-Bus (such as freezer, as mentioned, and also cpuset, which I believe made it into kernel 5.0) Cheers! |
Thanks for keeping us up2date. I am watching the runc PRs and keeping up with it as best I can. @filbranden Keep up the good work. Eventually we will get there. |
@giuseppe Since we now have cgroupsv2 support, can we close this PR? |
yes I think we can close the issue here and address any future issue separately |
Rootless mode could support cgroups when
pam_cgfs.so
is available (opencontainers/runc#1839 cc @cyphar), but it is not available on Fedora (AFAIK)
Is there plan for supporting
pam_cgfs.so
or any equivalent of that?(This question is not specific to podman, and I'm not sure this repo is the right place to ask this question :p)
The text was updated successfully, but these errors were encountered: