Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

virtcontainers: prepend a kata specific string to host cgroups path #1518

Merged
merged 1 commit into from
Apr 12, 2019

Conversation

lifupan
Copy link
Member

@lifupan lifupan commented Apr 11, 2019

prepend a kata specific string to oci cgroup path to form a different cgroup path, thus cAdvisor couldn't find kata containers cgroup path on host to prevent it from grabbing the stats data.

Fixes:#1488

Signed-off-by: lifupan [email protected]

@lifupan lifupan changed the title virtcontainers: only create host cgroups for podsandbox in shimv2 virtcontainers: add a kata specific appendix to host cgroups path Apr 11, 2019
@lifupan lifupan force-pushed the fixtop branch 3 times, most recently from a95202c to d608e93 Compare April 11, 2019 12:31
@lifupan
Copy link
Member Author

lifupan commented Apr 11, 2019

/test

@lifupan
Copy link
Member Author

lifupan commented Apr 11, 2019

@egernst If we accept the fix, we should change the test cases for cgroup test.

@egernst
Copy link
Member

egernst commented Apr 11, 2019

I think this is the best option right now for fixing. Let's get the tests updated!

Thanks @lifupan

@egernst
Copy link
Member

egernst commented Apr 11, 2019

/cc @devimc @mcastelino @krsna1729

@devimc
Copy link

devimc commented Apr 11, 2019

@mocolic does this change work for you? (I don't think so) 🤔

@egernst
Copy link
Member

egernst commented Apr 11, 2019

@devimc - can you clarify their assumptions on cgroups? How do they use our cgroups?

@mocolic
Copy link

mocolic commented Apr 11, 2019

@devimc
If this change doesn't honor the cgroup name we passed to kata, it will break the behavior, so won't work for us. Our resource governance implementation relies on honoring cgroup, as resource limits are set on cgroup level and expectation is that kata creates and activates a container inside this cgroup.

However, recent tests have shown another problem with this check-in - #1189
Please let me know if I should open a new issue for this and we can continue there.

I believe the problem is this code snippet in virtcontainers/cgroups.go - as this will govern containers even though they do not have constraints - which is not expected.

    // use a default constraint for sandboxes without cpu constraints
if period == uint64(0) && quota == int64(0) {
	// set a quota and period equal to the default number of vcpus
	quota = int64(s.config.HypervisorConfig.NumVCPUs) * 100000
	period = 100000
}

In our clusters default number of vcpus is -1 (entire machine) and as parent cgroup can’t have stricter limits than the child cgroup, container activation fails with this error:

"Could not update cgroup /fabric/SingleInstance_0_App1:myCpuServicePkg@3016e3dc-60b9-8746-a897-4f1bb73dd80a@0c665905-47af-8147-b616-8a0c099ca7a5/crio-5e4f661e9035b89d190a34bcec617a75ac0006b47b784dab207f02bda98736d1: write /sys/fs/cgroup/cpu/fabric/SingleInstance_0_App1:myCpuServicePkg@3016e3dc-60b9-8746-a897-4f1bb73dd80a@0c665905-47af-8147-b616-8a0c099ca7a5/crio-5e4f661e9035b89d190a34bcec617a75ac0006b47b784dab207f02bda98736d1/cpu.cfs_quota_us: invalid argument"
Without the code snippet above, our resource governance would have worked out of the box.

@devimc
Copy link

devimc commented Apr 11, 2019

@mocolic

If this change doesn't honor the cgroup name we passed to kata, it will break the behavior, so won't work for us. Our resource governance implementation relies on honoring cgroup, as resource limits are set on cgroup level and expectation is that kata creates and activates a container inside this cgroup.

yes, with this change the cgroup path won't be honoured (again)

However, recent tests have shown another problem with this check-in - #1189
Please let me know if I should open a new issue for this and we can continue there.

good finding, please file an issue

@mocolic
Copy link

mocolic commented Apr 11, 2019

@devimc

However, recent tests have shown another problem with this check-in - #1189
Please let me know if I should open a new issue for this and we can continue there.

good finding, please file an issue

Thanks, opened #1521

@egernst
Copy link
Member

egernst commented Apr 11, 2019

This forces a pretty tight coupling. As a data point; in Kubernetes, the node level part of the orchestrator/manager (kubelet) will create a cgroups (kubepod) under which it will place containers (in this specific instance, a pod cgroup). Kubelet is able to then go ahead and modify and restrict/enforce by modifying the cgroup (kubepod) that it created.

In the other scenario (@mocolic), the upper layer relies on Kata creating container cgroups which the upper layer can then go and modify. This makes a bit of a tight coupling, as can be seen in this scenario.

I’m not sure what the right solution is yet, but the coupling isn’t ideal.

@mocolic
Copy link

mocolic commented Apr 11, 2019

@egernst

In the other scenario (@mocolic), the upper layer relies on Kata creating container cgroups which the upper layer can then go and modify. This makes a bit of a tight coupling, as can be seen in this scenario.

So, we create our cgroup and change limits on that level, we do not modify kata child cgroups. The only what we rely on is that container will be under this parent cgroup (as long as container cgroup is under /fabric/, the child hierarchy doesn't matter to us). Also, all other container runtimes(docker/clear containers) work in this way.
On the other hand, with this solution, there is no way to create a container that is ungoverned, right? It is implicitly governed on the whole machine, even if the customer don't want that.

@egernst
Copy link
Member

egernst commented Apr 11, 2019

can you provide a complete hierarchy example @mocolic?

It sounds like the name of the cgroup we create doesn't matter, so long as we place it in the correct place in the hierarchy, correct?

With this PR, we place it in the same exact location, but provide a slightly modified name.

@mocolic
Copy link

mocolic commented Apr 11, 2019

@egernst

can you provide a complete hierarchy example @mocolic?

It sounds like the name of the cgroup we create doesn't matter, so long as we place it in the correct place in the hierarchy, correct?

Correct.
If we pass cgroup name to kata, for example fabric, the hierarchy should look like:
/sys/fs/cgroup/cpu,cpuacct
└── fabric
├── crio-89e60b84 # Sandbox cgroupsPath=/sys/fs/cgroup/cpu,cpuacct/fabric/crio-89e60b84
├── crio-2a95e0f8 # Container A cgroupsPath=/sys/fs/cgroup/cpu,cpuacct/fabric/crio-2a95e0f8
└── crio-d037bce7 # Container B cgroupsPath=/sys/fs/cgroup/cpu,cpuacct/fabric/crio-d037bce7

The hierarchy if parent cgroup is not provided, I expect something like this (or whatever naming convention you have)
/sys/fs/cgroup/cpu,cpuacct
├── crio-89e60b84 # Sandbox cgroupsPath=/sys/fs/cgroup/cpu,cpuacct/crio-89e60b84
├── crio-2a95e0f8 # Container A cgroupsPath=/sys/fs/cgroup/cpu,cpuacct/crio-2a95e0f8
└── crio-d037bce7 # Container B cgroupsPath=/sys/fs/cgroup/cpu,cpuacct/crio-d037bce7

One of the examples for the problem:
We set limits on fabric cgroup (for example 1.3 cores which translates to cpu.cfs_quota_us = 130000). Let's say that machine has 2 cores (and as in our configuration default_vcpus=-1), the code snippet above will try to set limits of cpu.cfs_quota_us = 200000 on child cgroup fabric/crio-d037bce7. Linux doesn't allow this, so container activation fails.

@devimc
Copy link

devimc commented Apr 11, 2019

@mocolic

One of the examples for the problem:
We set limits on fabric cgroup (for example 1.3 cores which translates to cpu.cfs_quota_us = 130000). Let's say that machine has 2 cores (and as in our configuration default_vcpus=-1), the code snippet above will try to set limits of cpu.cfs_quota_us = 200000 on child cgroup fabric/crio-d037bce7. Linux doesn't allow this, so container activation fails.

then the constraint for the container should be -1 (no constraint) right?

@mocolic
Copy link

mocolic commented Apr 11, 2019

@mocolic

One of the examples for the problem:
We set limits on fabric cgroup (for example 1.3 cores which translates to cpu.cfs_quota_us = 130000). Let's say that machine has 2 cores (and as in our configuration default_vcpus=-1), the code snippet above will try to set limits of cpu.cfs_quota_us = 200000 on child cgroup fabric/crio-d037bce7. Linux doesn't allow this, so container activation fails.

then the constraint for the container should be -1 (no constraint) right?

When I checked the configuration.toml file, it says this:
#Default number of vCPUs per SB/VM:
#< 0 --> will be set to the actual number of physical cores

If I understood correctly, this means that you would internally change this to the entire core machine (to 2), right? Also, when I created a container that requires whole machine (2 cores are set on fabric cgroup), the container activation was successful.

@devimc
Copy link

devimc commented Apr 11, 2019

@mocolic

If I understood correctly, this means that you would internally change this to the entire core machine (to 2), right?

yes, currently it's now working because we try to set a constraint equal to the number of physical resources (in this case 2), but that won't be possible if the parent cgroup is smaller (for example 1.3), so we shouldn't apply a constraint

@mcastelino
Copy link
Contributor

@mocolic as Kata really does not (and cannot really) impose container level cgroups unless CPU Sets are used can you expand a little bit on what your container level cgroups is expected to have. We will place the kata-shim in this cgroup, but the actual workloads will be in what we will designate as the pod cgroup. So any stats or constraints you apply on the container cgroup will apply to the shim and not the actual workload. Do you place anything else in the container cgroup?

└── fabric
├── crio-89e60b84 # Sandbox cgroupsPath=/sys/fs/cgroup/cpu,cpuacct/fabric/crio-89e60b84
├── crio-2a95e0f8 # Container A cgroupsPath=/sys/fs/cgroup/cpu,cpuacct/fabric/crio-2a95e0f8
└── crio-d037bce7 # Container B cgroupsPath=/sys/fs/cgroup/cpu,cpuacct/fabric/crio-d037bce7

@egernst
Copy link
Member

egernst commented Apr 12, 2019

@fupan approach looks good. Preprend would be better than append imo. Let’s fry the tests fixed.

@lifupan
Copy link
Member Author

lifupan commented Apr 12, 2019

@fupan approach looks good. Preprend would be better than append imo. Let’s fry the tests fixed.

@egernst Got it, I'll change it to prepend and then change the testcases

prepend a kata specific string to oci cgroup path to
form a different cgroup path, thus cAdvisor couldn't
find kata containers cgroup path on host to prevent it
from grabbing the stats data.

Fixes:kata-containers#1488

Signed-off-by: lifupan <[email protected]>
@lifupan lifupan changed the title virtcontainers: add a kata specific appendix to host cgroups path virtcontainers: prepend a kata specific string to host cgroups path Apr 12, 2019
lifupan added a commit to lifupan/tests that referenced this pull request Apr 12, 2019
Kata prepended a kata specific string to cgroup path name
to prevend vAdvisor from picking stats data, otherwise, kubelet
will use those "zero" data to override the stats data got from cri
provider.

Fixes:kata-containers#1461

Depends-on:github.com/kata-containers/runtime#1518

Signed-off-by: lifupan <[email protected]>
lifupan added a commit to lifupan/tests that referenced this pull request Apr 12, 2019
Kata prepended a kata specific string to cgroup path name
to prevend vAdvisor from picking stats data, otherwise, kubelet
will use those "zero" data to override the stats data got from cri
provider.

Fixes:kata-containers#1461

Depends-on:github.com/kata-containers/runtime#1518

Signed-off-by: lifupan <[email protected]>
@lifupan
Copy link
Member Author

lifupan commented Apr 12, 2019

The changes for tests is here: kata-containers/tests#1462
Let merge it first then trigger the tests for this PR.

@devimc
Copy link

devimc commented Apr 12, 2019

/test

@devimc devimc merged commit d99693a into kata-containers:master Apr 12, 2019
@mcastelino
Copy link
Contributor

@chavafg can we have a test case for this using something along the lines of a pod such as

apiVersion: v1
kind: Pod
metadata:
  name: test-metrics
spec:
  runtimeClassName: kata
  containers:
  - name: cont-2cpu-400m
    image: busybox
    resources:
      limits:
        cpu: 2
        memory: "400Mi"
    command: ["md5sum"]
    args: ["/dev/urandom"]
  - name: cont-3cpu-200m
    image: busybox
    resources:
      limits:
        cpu: 3
        memory: "200Mi"
    command: ["md5sum"]
    args: ["/dev/urandom"]   

And then check the actual results viakubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/test-metrics"

which should yeild pretty high CPU utilization

along the lines of

  "containers": [
    {
      "name": "cont-3cpu-200m",
      "usage": {
        "cpu": "398739040n",
        "memory": "744Ki"
      }
    },
    {
      "name": "cont-2cpu-400m",
      "usage": {
        "cpu": "200848120n",
        "memory": "852Ki"
      }
    }

@egernst egernst modified the milestones: cgoup-sprint, cgroup-sprint Apr 12, 2019
ganeshmaharaj pushed a commit to ganeshmaharaj/kata-runtime that referenced this pull request Apr 16, 2019
virtcontainers: prepend a kata specific string to host cgroups path
(cherry picked from commit d99693a)
ganeshmaharaj pushed a commit to ganeshmaharaj/kata-runtime that referenced this pull request Apr 16, 2019
virtcontainers: prepend a kata specific string to host cgroups path
(cherry picked from commit d99693a)
Signed-off-by: Ganesh Maharaj Mahalingam <[email protected]>
ganeshmaharaj pushed a commit to ganeshmaharaj/kata-runtime that referenced this pull request Apr 16, 2019
virtcontainers: prepend a kata specific string to host cgroups path
(cherry picked from commit d99693a)

Fixes: kata-containers#1488

Signed-off-by: Ganesh Maharaj Mahalingam <[email protected]>
ganeshmaharaj pushed a commit to ganeshmaharaj/kata-runtime that referenced this pull request Apr 16, 2019
…fixtop

virtcontainers: prepend a kata specific string to host cgroups path
(cherry picked from commit d99693a)

Fixes: kata-containers#1488

Signed-off-by: Ganesh Maharaj Mahalingam <[email protected]>
@egernst egernst mentioned this pull request Apr 16, 2019
@lifupan lifupan deleted the fixtop branch August 22, 2019 09:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kata metrics broken in Kubernetes v1.14.0 + Containerd 1.2.5 + kata-runtime shimv2 : 1.6.0
7 participants