Skip to content
This repository has been archived by the owner on Mar 9, 2022. It is now read-only.

--privileged should add explicit "rw" config for /sys mount #753

Closed
corrieb opened this issue Apr 26, 2018 · 23 comments
Closed

--privileged should add explicit "rw" config for /sys mount #753

corrieb opened this issue Apr 26, 2018 · 23 comments
Labels
Milestone

Comments

@corrieb
Copy link

corrieb commented Apr 26, 2018

If you create a privileged container using the ctr client, it will explicitly mount /sys using the "rw" flag. See https://github.com/containerd/containerd/blob/9d9d1bc13c107a460212d12ed7ee2f422379a10f/oci/spec_opts_unix.go#L602

cri-containerd simply removes the "ro" flag for --privileged, assuming the container runtime will default to a RW mount, which appears to not be the case. See

clearReadOnly(&spec.Mounts[i])

As a result, running kube-proxy using containerd 1.1 via cri-containerd results in the following error:

E0426 18:40:33.281905 5 conntrack.go:124] sysfs is not writable: {Device:sysfs Path:/sys Type:sysfs Opts:[ro nosuid nodev noexec relatime] Freq:0 Pass:0} (mount options are [ro nosuid nodev noexec relatime])

@corrieb corrieb changed the title --privileged should add explicit "rw" config for /sys --privileged should add explicit "rw" config for /sys mount Apr 26, 2018
@Random-Liu Random-Liu added this to the v1.0.1 milestone Apr 27, 2018
@Random-Liu
Copy link
Member

@corrieb Thanks for the bug report. I'll look into it.

@Random-Liu
Copy link
Member

Random-Liu commented Apr 27, 2018

It seems fine in my cluster:

# crictl ps | grep kube-proxy
1b5e8f1282b81       sha256:bfc21aadc7d3e20e34cec769d697f93543938e9151c653591861ec5f2429676b   23 hours ago        CONTAINER_RUNNING   kube-proxy
# crictl exec -s 1b5e8f1282b81 /bin/sh -c "cat /proc/mounts | grep sysfs"
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
# crictl inspect 1b5e8f1282b81
...
        {
          "destination": "/sys",
          "type": "sysfs",
          "source": "sysfs",
          "options": [
            "nosuid",
            "noexec",
            "nodev"
          ]
        },
...

And actually Docker is doing similar thing IIUC. https://github.com/moby/moby/blob/master/daemon/oci_linux.go#L679

  1. Are you sure you start kube-proxy with privileged set?
  2. Do you set ReadOnlyRootFilesystem with privileged? https://kubernetes.io/docs/concepts/policy/pod-security-policy/#volumes-and-file-systems

Can you share your pod yaml?

@corrieb
Copy link
Author

corrieb commented Apr 27, 2018

Thanks for helping me investigate @Random-Liu. I'm not using crictl, I'm creating a PodSandboxConfig and ContainerConfig, based on the v1.Pod specification for kube-proxy. I'm trying to figure out why I'm seeing this inconsistency. It could be something I've missed, but I'm struggling to see what it might be.

Here are my configs:

PodSandboxConfig{
	Metadata:&PodSandboxMetadata{
		Name:kube-proxy-57x6t,
		Uid:9b225ac1-49dd-11e8-aea8-005056b4fdca,
		Namespace:kube-system,Attempt:0,
	},
	Hostname:virtual-kubelet,
	LogDirectory:/var/log/vk-cri/9b225ac1-49dd-11e8-aea8-005056b4fdca,
	DnsConfig:nil,
	PortMappings:[],
	Labels:map[string]string{
		controller-revision-hash: 1193416634,
		k8s-app: kube-proxy,
		pod-template-generation: 1,
	},
	Annotations:map[string]string{},
	Linux:&LinuxPodSandboxConfig{
		CgroupParent:,
		SecurityContext:
		&LinuxSandboxSecurityContext{
			NamespaceOptions:nil,
			SelinuxOptions:nil,
			RunAsUser:nil,
			ReadonlyRootfs:false,
			SupplementalGroups:[],
			Privileged:true,
			SeccompProfilePath:,
			RunAsGroup:nil,
		},
		Sysctls:map[string]string{},
	},
}

ContainerConfig{	
	Metadata:&ContainerMetadata{
		Name:kube-proxy,Attempt:0,
	},
	Image:&ImageSpec{
		Image:sha256:bfc21aadc7d3e20e34cec769d697f93543938e9151c653591861ec5f2429676b,
	},
	Command:[/usr/local/bin/kube-proxy --config=/var/lib/kube-proxy/config.conf],
	Args:[],
	WorkingDir:,
	Envs:[],
	Mounts:[
		&Mount{
			ContainerPath:/var/lib/kube-proxy,
			HostPath:/run/vk-cri/volumes/9b225ac1-49dd-11e8-aea8-005056b4fdca/configmaps/kube-proxy,
			Readonly:false,
			SelinuxRelabel:false,
			Propagation:PROPAGATION_PRIVATE,
		} 
		&Mount{
			ContainerPath:/run/xtables.lock,
			HostPath:/run/xtables.lock,
			Readonly:false,
			SelinuxRelabel:false,
			Propagation:PROPAGATION_PRIVATE,
		} 
		&Mount{
			ContainerPath:/lib/modules,HostPath:/lib/modules,
			Readonly:true,SelinuxRelabel:false,
			Propagation:PROPAGATION_PRIVATE,
		} 
		&Mount{
			ContainerPath:/var/run/secrets/kubernetes.io/serviceaccount,
			HostPath:/run/vk-cri/volumes/9b225ac1-49dd-11e8-aea8-005056b4fdca/secrets/kube-proxy-token-tllzk,
			Readonly:true,SelinuxRelabel:false,
			Propagation:PROPAGATION_PRIVATE,}
	],
	Devices:[],
	Labels:map[string]string{},
	Annotations:map[string]string{},
	LogPath:kube-proxy-0.log,
	Stdin:false,
	StdinOnce:false,
	Tty:false,
	Linux:&LinuxContainerConfig{
		Resources:nil,
		SecurityContext:&LinuxContainerSecurityContext{
			Capabilities:nil,
			Privileged:true,
			NamespaceOptions:nil,
			SelinuxOptions:nil,
			RunAsUser:nil,
			RunAsUsername:,
			ReadonlyRootfs:false,
			SupplementalGroups:[],
			ApparmorProfile:,
			SeccompProfilePath:,
			NoNewPrivs:false,RunAsGroup:nil,
		},
	},
	Windows:nil,
}

@corrieb
Copy link
Author

corrieb commented Apr 27, 2018

This is a dump of a kube-proxy Pod Spec (not the exact same pod as you'll see from the ID)

&Pod{
	ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{
		Name:kube-proxy-lvrtc,
		GenerateName:kube-proxy-,
		Namespace:kube-system,
		SelfLink:/api/v1/namespaces/kube-system/pods/kube-proxy-lvrtc,
		UID:dcaa3b96-3d4a-11e8-aea8-005056b4fdca,
		ResourceVersion:1487683,
		Generation:0,
		CreationTimestamp:2018-04-10 22:40:37 -0700 PDT,
		DeletionTimestamp:<nil>,
		DeletionGracePeriodSeconds:nil,
		Labels:map[string]string{
			controller-revision-hash: 1193416634,
			k8s-app: kube-proxy,
			pod-template-generation: 1,
		},
		Annotations:map[string]string{},
		OwnerReferences:[{apps/v1 DaemonSet kube-proxy c8d6018d-383a-11e8-b3c9-005056b4fdca 0xc420248ce9 0xc420248cea}],
		Finalizers:[],
		ClusterName:,
		Initializers:nil,
	},
	Spec:PodSpec{
		Volumes:[{
			kube-proxy {
				nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil 
				ConfigMapVolumeSource{
					LocalObjectReference:LocalObjectReference{
						Name:kube-proxy,
					},
					Items:[],
					DefaultMode:*420,
					Optional:nil,
				} nil nil nil nil nil nil nil nil
			}
		}{
			xtables-lock {
				&HostPathVolumeSource{
					Path:/run/xtables.lock,
					Type:*FileOrCreate,
				} 
				nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
			}
		}{
			lib-modules {
				&HostPathVolumeSource{
					Path:/lib/modules,
					Type:*,
				} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
			}
		}{
			kube-proxy-token-tllzk {
				nil nil nil nil nil 
				&SecretVolumeSource{
					SecretName:kube-proxy-token-tllzk,
					Items:[],
					DefaultMode:*420,
					Optional:nil,
				} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
			}
		}],
		Containers:[{
			kube-proxy k8s.gcr.io/kube-proxy-amd64:v1.10.0 [
				/usr/local/bin/kube-proxy --config=/var/lib/kube-proxy/config.conf
			] []  [] [] [] {
				map[] map[]}[
					{kube-proxy false /var/lib/kube-proxy  <nil>} 
					{xtables-lock false /run/xtables.lock  <nil>} 
					{lib-modules true /lib/modules  <nil>} 
					{kube-proxy-token-tllzk true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}
				] 
				nil nil nil /dev/termination-log File IfNotPresent 
				SecurityContext{
					Capabilities:nil,
					Privileged:*true,
					SELinuxOptions:nil,
					RunAsUser:nil,
					RunAsNonRoot:nil,
					ReadOnlyRootFilesystem:nil,
					AllowPrivilegeEscalation:nil,
				} 
				false false false
			}],
			RestartPolicy:Always,
			TerminationGracePeriodSeconds:*30,
			ActiveDeadlineSeconds:nil,
			DNSPolicy:ClusterFirst,
			NodeSelector:map[string]string{},
			ServiceAccountName:kube-proxy,
			DeprecatedServiceAccount:kube-proxy,
			NodeName:virtual-kubelet,
			HostNetwork:true,HostPID:false,
			HostIPC:false,
			SecurityContext:&PodSecurityContext{
				SELinuxOptions:nil,RunAsUser:nil,
				RunAsNonRoot:nil,
				SupplementalGroups:[],
				FSGroup:nil,},
			ImagePullSecrets:[],
			Hostname:,
			Subdomain:,
			Affinity:nil,
			SchedulerName:default-scheduler,
			InitContainers:[],
			AutomountServiceAccountToken:nil,
			Tolerations:[
				{node-role.kubernetes.io/master   NoSchedule <nil>} 
				{node.cloudprovider.kubernetes.io/uninitialized  true NoSchedule <nil>} 
				{node.kubernetes.io/not-ready Exists  NoExecute <nil>} 
				{node.kubernetes.io/unreachable Exists  NoExecute <nil>} 
				{node.kubernetes.io/disk-pressure Exists  NoSchedule <nil>} 
				{node.kubernetes.io/memory-pressure Exists  NoSchedule <nil>}
			],
			HostAliases:[],
			PriorityClassName:,
			Priority:nil,},
		Status:
			PodStatus{
				Phase:Running,
				Conditions:[],
				Message:Node virtual-kubelet which was running pod kube-proxy-lvrtc is unresponsive,
				Reason:NodeLost,HostIP:,
				PodIP:10.244.0.23,
				StartTime:2018-04-12 13:30:50 -0700 PDT,
				ContainerStatuses:[{kube-proxy {nil nil ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,
				Message:,StartedAt:2018-04-12 13:30:50 -0700 PDT,
				FinishedAt:2018-04-12 13:30:50 -0700 PDT,
				ContainerID:,}
			} 
			{nil nil nil} false 0 k8s.gcr.io/kube-proxy-amd64:v1.10.0 k8s.gcr.io/kube-proxy-amd64@sha256:fc944b06c14cb442916045a630d5e374dfb9c453dfc56d3cb59ac21ea4268875 d3334070fac902ff7980307e896987f077e5262c91341d5ed9c11a9e48241bc9}],
		QOSClass:,
		InitContainerStatuses:[],
	},
}

@corrieb
Copy link
Author

corrieb commented Apr 27, 2018

As you can see, the Pod spec states that it should be run Privileged, but it doesn't state anything explicit about /sys and how it should be mounted. I've been assuming that setting Privileged: true in the LinuxSandboxSecurityContext in both the container and the Pod would translate into /sys being mounted RW by implication.

If I run sudo ctr -n k8s.io c info 76b1fa4188f270ab795fbd731ca16627b274d9724ca94cce9886f017281caad4 - the deployed container - and then decode the Pod Spec, I see the following:

{
    "destination":"/sys",
    "type":"sysfs",
    "source":"sysfs",
    "options":[
        "nosuid",
        "noexec",
        "nodev"
    ]
},

This therefore seems to boil down to a question of what the "default" /sys mount is expected to be if neither "ro" or "rw" is specified? It appears to be being interpreted as "ro".

@Random-Liu
Copy link
Member

I see. What is the runc version you are using? And what is your OS?

Does docker work in your environment in this case? We do the same thing with docker.

@corrieb
Copy link
Author

corrieb commented Apr 27, 2018

runc version 1.0.0-rc5 - containerd 1.1.0. OS is Debian Linux 4.9.0-6-amd64

It's interesting to note that the CRI spec is very clear about what Privileged means: https://github.com/kubernetes/kubernetes/blob/a38a02792b55942177ee676a5e1993b18a8b4b0a/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L541

// 3. Any sysfs and procfs mounts are mounted RW.

What's not clear though is whether runc should interpret Privileged in this way or whether that should be made explicit by the caller. That is indeed the case with ctr - it explicitly adds "rw" for Privileged before runc gets to see the spec.

If that's the answer, then that's the answer. The client needs to make the Privileged mounts explicit. If so, I'll modify my code to do that.

@corrieb
Copy link
Author

corrieb commented Apr 27, 2018

What's curious to me though @Random-Liu is that your mount spec looks just the same as mine as reported by inspect, but you have a different outcome. That potentially suggests an OS configuration difference.

@corrieb
Copy link
Author

corrieb commented Apr 27, 2018

@Random-Liu did you ever figure out this: moby/moby#24000?

@Random-Liu
Copy link
Member

Random-Liu commented Apr 27, 2018

@Random-Liu did you ever figure out this: moby/moby#24000?

@corrieb No. We never able to figure that out. Can you update your code to explicitly set rw, and verify whether that fixes your problem? If it does, it's super welcome to send a PR, I'll review it.

I'd like to spend a little more time to figure out why there is such difference. It could be some kernel version difference, or something? Let me do a search.

But anyway, I do think we need your patch if that fixes the problem for you. :) Thanks a lot for looking into this! This might fix a problem which is there for a long time in both docker and containerd/cri.

@Random-Liu
Copy link
Member

I'll do some research. If we find a default behavior difference caused by OS version or configure difference, we should fix docker as well.

@corrieb
Copy link
Author

corrieb commented Apr 27, 2018

I think it's pretty much guaranteed to fix it. When I run sudo ctr run -t --privileged docker.io/library/debian:latest myctr on the same system, /sys is mounted "rw" and this is reflected in the spec. This is because ctr adds it explicitly, per the code referenced above.

@corrieb
Copy link
Author

corrieb commented Apr 27, 2018

What this seems to boil down to is what happens if neither "ro" or "rw" are specified. cri-containerd makes sure to remove "ro", but doesn't add "rw". That ultimately makes it ambiguous

@Random-Liu
Copy link
Member

@corrieb Yeah, i mean cri-containerd does that because docker is doing so. But it seems that this may not work in some OS config/version.

Would you like to send a PR to add rw? Or I can send one as well if you don't have time. :)

@Random-Liu
Copy link
Member

It would be super helpful if you can verify the change, because I can't reproduce it. :)

You can simply make and scp the binary to your node, and restart containerd.

@corrieb
Copy link
Author

corrieb commented Apr 27, 2018

Absolutely. I'll need to go through our legal dept to get myself approved to submit code to containerd, but heck - that's a worthwhile think to do regardless.

@justincormack curious if you have an opinion on this.

@Random-Liu
Copy link
Member

Random-Liu commented Apr 27, 2018

@corrieb Will wait for your fix.

If possible, I'd still like to root cause this:

  1. Is your host sysfs mounted ro?
  2. To the same node, does this happen every time, or sometimes happen?
  3. Does restart containerd fix the issue? (Previously, for the docker issue sysfs is mounted as readonly for privileged container moby/moby#24000, restarting docker does fix it IIRC)

@justincormack Do you have any idea?

@corrieb
Copy link
Author

corrieb commented Apr 27, 2018

  1. no - I checked that early on. That would be weird
  2. seems to always happen - cgroup mount seems to be fine - see below
  3. restarting containerd did not fix the issue.
cat /proc/mounts
overlay / overlay rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/30/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/29/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/28/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/111/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/111/work 0 0
proc /proc proc rw,relatime 0 0
tmpfs /dev tmpfs rw,nosuid,size=65536k,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666 0 0
shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs ro,nosuid,nodev,noexec,relatime 0 0
tmpfs /sys/fs/cgroup tmpfs rw,nosuid,nodev,noexec,relatime,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
/dev/sda1 /etc/hosts ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
/dev/sda1 /etc/resolv.conf ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k 0 0
tmpfs /var/lib/kube-proxy tmpfs rw,nosuid,noexec,relatime,size=405088k,mode=755 0 0
tmpfs /run/xtables.lock tmpfs rw,nosuid,noexec,relatime,size=405088k,mode=755 0 0
/dev/sda1 /lib/modules ext4 ro,relatime,errors=remount-ro,data=ordered 0 0
tmpfs /run/secrets/kubernetes.io/serviceaccount tmpfs ro,relatime,size=405088k,mode=755 0 0

@Random-Liu
Copy link
Member

Random-Liu commented Apr 27, 2018

@corrieb One last thing... If possible, can you exec into the kube-proxy pod to check whether cgroup is ro? cgroup mount and sysfs mount logic are almost the same, by checking this we know whether this is specific to sysfs. Thanks a lot!

@Random-Liu
Copy link
Member

Random-Liu commented Apr 27, 2018

@corrieb Thanks for helping me verify. The information is very useful for future debugging. This seems specific to sysfs, and only happens on some node.

Let's explicitly set rw for privileged. If you have any trouble e.g. legal approval, feel free to tell me if you'd like me to fix it. :)

Thanks a lot for looking into this!

@Random-Liu
Copy link
Member

@corrieb Do you have any problem? Probably I can send a fix for you? :)

@corrieb
Copy link
Author

corrieb commented May 8, 2018

@Random-Liu Thanks for doing this. I was at KubeCon last week, this was by far the fastest way to get it done.

@Random-Liu
Copy link
Member

@corrieb No problem. Did it for you. :)

Thanks a lot for finding and reporting the bug!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants