Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

Do not pass all host devices in /dev when launching with '--privileged' #1568

Closed
zhiminghufighting opened this issue Apr 19, 2019 · 49 comments
Closed
Assignees

Comments

@zhiminghufighting
Copy link

zhiminghufighting commented Apr 19, 2019

Description of problem

When i launch an android image with --privileged parameter, all device under host /dev directory will be passed into the container directory /dev, the behavior doesn't make sense;

Expected result

Don't pass all host devices into kata container even with --privileged parameter;

Actual result

Here is thels info under kata container directory /dev:

root@5ff9d5d81457:/dev# ls
agpgart          dri    input         lp3                 null    raw   shm       tap21605  tty14  tty21  tty29  tty36  tty43  tty50  tty58  tty8    urandom  vcs6   vfio
autofs           fb0    kmsg          mapper              nvram   rtc0  snapshot  tty       tty15  tty22  tty3   tty37  tty44  tty51  tty59  tty9    usbmon0  vcsa   vga_arbiter
bsg              fd     kvm           mcelog              oldmem  sda   snd       tty0      tty16  tty23  tty30  tty38  tty45  tty52  tty6   ttyS0   vcs      vcsa1  vhci
btrfs-control    fd0    loop-control  mem                 port    sda1  sr0       tty1      tty17  tty24  tty31  tty39  tty46  tty53  tty60  ttyS1   vcs1     vcsa2  vhost-net
console          full   loop0         mqueue              ppp     sda2  stderr    tty10     tty18  tty25  tty32  tty4   tty47  tty54  tty61  ttyS2   vcs2     vcsa3  vmci
cpu              fuse   lp0           net                 ptmx    sda3  stdin     tty11     tty19  tty26  tty33  tty40  tty48  tty55  tty62  ttyS3   vcs3     vcsa4  vsock
cpu_dma_latency  hpet   lp1           network_latency     pts     sg0   stdout    tty12     tty2   tty27  tty34  tty41  tty49  tty56  tty63  uhid    vcs4     vcsa5  zero
crash            hwrng  lp2           network_throughput  random  sg1   tap20463  tty13     tty20  tty28  tty35  tty42  tty5   tty57  tty7   uinput  vcs5     vcsa6
root@5ff9d5d81457:/dev# 

Here is the host view of docker ps:

[root@centos-k8s014 home]# docker ps
CONTAINER ID        IMAGE                                COMMAND             CREATED             STATUS              PORTS               NAMES
5ff9d5d81457        sgjesse/legacy-android-base:latest   "/bin/bash"         About an hour ago   Up About an hour                        gracious_wright
619d72f2ad28        android:NAe000047                    "/android-entry"    About an hour ago   Up About an hour                        friendly_ritchie
822fa322b90c        android:NAe000047                    "/android-entry"    About an hour ago   Up About an hour                        dazzling_fermi
@zhiminghufighting
Copy link
Author

Please see the related issue and root cause here:
#1551

@zhiminghufighting
Copy link
Author

any idea about this issue?

@jodh-intel jodh-intel changed the title Adding all host dev under /dev into kata container when luanching with '--privileged' Do not pass all host devices in /dev when launching with '--privileged' Apr 29, 2019
@jodh-intel
Copy link
Contributor

Hi @zhiminghufighting - I've updated the description and reformatted the details to make this easier to understand. Please can you provide details of which devices you would not expect to see?

@jodh-intel
Copy link
Contributor

Related: #1558.

/cc @amshinde.

@amshinde
Copy link
Member

@zhiminghufighting I am not sure why you are seeing this behaviour, before I go take a look, can you tell me what is the specific android image that you are using?

@leoluk
Copy link

leoluk commented May 2, 2019

This is highly unexpected and undocumented behavior that just fell on my feet during a CTF competition that uses Kata Containers for isolation :)

I passed --privileged=true such that I could run Docker containers inside the VM, but this also resulted in full host access that some players used to break out of the VMs.

Applied a local hotfix (no-op filterDevices), but I suggest the default behavior be changed.

@grahamwhaley
Copy link
Contributor

@leoluk - saw your PR over at kata-containers/documentation#452, thx.
Heh, I guess it is a question of 'when is --privileged not really --privileged... I have a feeling we may never please all the users all the time on this one ;-)
/cc @amshinde for thoughts and input.

leoluk pushed a commit to leoluk/documentation that referenced this issue May 2, 2019
leoluk added a commit to leoluk/documentation that referenced this issue May 2, 2019
leoluk added a commit to leoluk/documentation that referenced this issue May 2, 2019
@zhiminghufighting
Copy link
Author

@amshinde I use a android based container which is running in Tencent cloud gaming.(This android based image also is made by intel AIS team for Tencent cloud gaming)
I tried in many conditions and found it only can be found in this condition : Host - Ubuntu + Kata + android based container;
If the Host OS is changed to Centos(7.3), it will never happen. i am not sure if it is related with OS policy?
By the way, the Ubuntu OS is 18.04; Kata is 1.6; i haven't tried on other combination.

@jodh-intel Thanks for your comment and corrections.

@amshinde
Copy link
Member

@zhiminghufighting Sorry for the delay in getting back.
Can you run a ls -l /dev on the host and inside the container so that we can compare the devices that are passed to the container. We should be passing only block and vfio devices from the host to the guest and not all devices. It is not possible to pass all host devices to a virtual machine.

Finally, are you running this under docker or k8s? Could you provide the config.json file that gets passed to the runtime. For docker this should be under /var/run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby{containerid}/config.json

If you could point me to the android image that I should use, I can try and reproduce this as well.

@iuridiniz
Copy link

the number os devices are different in the host, over kata and over runc

iuri@evolux-clx:~$ docker run --runtime=runc --privileged --rm alpine ls /dev -1 | wc -l
229
iuri@evolux-clx:~$ docker run --runtime=kata --privileged --rm alpine ls /dev -1 | wc -l
228
iuri@evolux-clx:~$ sudo ls /dev -1 | wc -l
[sudo] password for iuri: 
242
iuri@evolux-clx:~$ 

@iuridiniz
Copy link

iuridiniz commented Jun 2, 2019

I didn't get, what makes you to think that devices are shared in --privileged mode? which is the evidence?

@leoluk
Copy link

leoluk commented Jun 3, 2019

Can we make this configurable, at least (and default it to off)? There's plenty of use cases that need privileged access inside the container only.

@minzcmu
Copy link

minzcmu commented Jun 27, 2019

Hi @leoluk, can you share a bit more about how you do filterDevices as the hotfix?

@awprice
Copy link
Contributor

awprice commented Jul 30, 2019

+1 for this issue. Using Kata to isolate and secure Docker in docker is pointless when all of the host devices are simply hot plugged into the VM, including the host's / device. You may as well just do away with the VM and run it as a regular runc container.

I would like to see a configurable option in the Kata config to control filtering of the /dev devices. Or maybe a configurable regex and pass the device name through?

Another possible option that @egernst suggested is to add support for an annotation, that when used would enable a more kata specific privileged mode - support for all capabilities, writable sysfs, but no host devices. A cut down version of --privileged.

Thoughts?

cc @egernst @amshinde @ganeshmaharaj @mcastelino

@bergwolf
Copy link
Member

-1.

--privileged in docker translates into two things:

  1. pass all devices under /dev to container
  2. give container all sys capabilites

If you just need the second one, you can list the required capabilities you need with docker run --cap-add. --privileged is just a lazy and dumb shortcut. docker does provide precise options. --cap-add and --device are your friends if you want precise control.

OTOH, if users require a simple way to specify that my container needs all sys capabilites but no extra implicit devices, please file issues in the moby project to request such a flag. It is not kata's role to workaround the issue.

@awprice
Copy link
Contributor

awprice commented Jul 30, 2019

@bergwolf

--privileged in docker translates into two things:

  1. pass all devices under /dev to container
  2. give container all sys capabilites

No, it translates into a bit more than that, see what is provided in containerd - https://github.com/containerd/containerd/blob/172fe90e55c3c3a452f9bec926d87ffda5ed01bf/oci/spec_opts.go#L1090-L1101

@bergwolf
Copy link
Member

@awprice Still, please file issues in moby/containerd to request for such a flag that does not send all /dev devices to the OCI runtime. It is not kata's role to ignore random part of the container spec. E.g., if we implement what's requested here, there is no way to pass a device to a container when --privileged is specified. What's the point of supporting one specific use case while breaking others?

Generally --privileged is broken in many ways for kata and I would strongly recommend against using it. E.g., what's the point of passing host /dev/fuse to the guest? OTOH, it is improper to ignore a container's device list all together. The problem is that we(kata runtime) actually don't know which device is implicitly added and which is specified by users. IOW, you are fixing the problem in a wrong layer.

At very least, I'm fine to use an annotation as @egernst suggested to tell that users actually want to ignore all devices. At least we know that users do want it and it is not a wild guess on the real use case. But still I think moby/containerd is the right place for the fix.

@leoluk
Copy link

leoluk commented Jul 30, 2019

Kata is already interpreting the container spec quite liberally due to the extra abstraction layer involved. For instance, capabilities will not allow you to access the host system, but only the guest kernel in the VM. Why make an exception for devices?

@AkihiroSuda
Copy link

E.g., if we implement what's requested here, there is no way to pass a device to a container when --privileged is specified. What's the point of supporting one specific use case while breaking others?

This feature can be implemented as a new flag for kata-runtime without breaking compatibility.

e.g.

$ cat /etc/docker/daemon.json
{
        "runtimes": {
                "kata": {
                        "path": "/opt/kata/bin/kata-runtime"
                },
                "kata-isolate-dev": {
                        "path": "/opt/kata/bin/kata-runtime",
                        "runtimeArgs": [
                                "--kata-device-mode=isolate"
                        ]
                }
}
$ docker run --privileged --runtime=kata-isolate-dev docker:dind ...

@bergwolf
Copy link
Member

Kata is already interpreting the container spec quite liberally due to the extra abstraction layer involved. For instance, capabilities will not allow you to access the host system, but only the guest kernel in the VM. Why make an exception for devices?

@leoluk Kata does translate container spec to apply a different secure context. We handle the container spec as much as we can, and there are pieces we have to ignore because there is no way to handle them in a virtual machine context. Here you want devices that are translated now to be ignored. That is a huge difference.

docker run --privileged --runtime=kata-isolate-dev --device /dev/sda docker:dind
@AkihiroSuda The above is broken but I guess you just don't care?

All you guys want is to run docker in docker with kata, even in a half broken way. But I think it is better to provide a proper --privileged translation. How about we add kata's own privilege flag instead? When it is specified:

  1. guest /dev, /proc and /sys are exposed to container in rw mode
  2. container process has full linux capabilities
  3. no selinux, no apparmor, no seccomp

This seems to match containerd's translation in a virtual machine context: https://github.com/containerd/containerd/blob/172fe90e55c3c3a452f9bec926d87ffda5ed01bf/oci/spec_opts.go#L1090-L1101

Then you can do:

$ cat /etc/docker/daemon.json
{
        "runtimes": {
                "kata": {
                        "path": "/opt/kata/bin/kata-runtime"
                },
                "kata-privileged": {
                        "path": "/opt/kata/bin/kata-runtime",
                        "runtimeArgs": [
                                "--privileged"
                        ]
                }
}
$ docker run --runtime=kata-privileged docker:dind ...

And docker run --runtime=kata-privileged --device /dev/sda busybox is still working as expected.

Then we can officially recommend kata-runtime --privileged over docker's --privileged flag. Users who want to use docker run --privileged should use kata-runtime --privileged instead.

@dadux
Copy link

dadux commented Jul 30, 2019

I'm not against an annotation, we already abuse them for other use cases.
However, we would have to modify the kubernetes pod spec, as annotations are at the pod level, not container. (As opposed to privileged )

@bergwolf
Copy link
Member

Sigh, that is too bad, not having container annotations. Then the containerd/cri change (containerd/cri#1213) seems more appealing than before. wdyt about it?

@awprice
Copy link
Contributor

awprice commented Jul 30, 2019

@bergwolf Sounds like a good approach to make the change in containerd/cri, as that's where the host devices are being listed and appended to the spec.

Only other concern - what about cri-o?

@bergwolf
Copy link
Member

I'm not familiar with cri-o but I think it can make the same change. Right now the translation for privileged container is runc specific. It makes sense to have a different translation as runc is not the only runtime containerd/cri and cri-o support.

@AkihiroSuda
Copy link

@dadux Does your usecase really need the dind container in the same pod?
I'm wondering you might be able to connect to dind in another pod (in its own VM) via headless service

@awprice
Copy link
Contributor

awprice commented Jul 30, 2019

@dadux Does your usecase really need the dind container in the same pod?
I'm wondering you might be able to connect to dind in another pod (in its own VM) via headless service

No because the dind in it's own VM still requires privileged, and in it's current state will have the host devices mounted in. If there was a vulnerability in dind or the kernel in the dind pod, then access to the host devices may occur.

It's also unnecessarily complicated.

@AkihiroSuda
Copy link

@awprice My question was about the future status with some alternative to the current --privileged, not about the current status 😅

In other words - the alternative solution really needs to be container-level rather than pod-level?

@dadux
Copy link

dadux commented Jul 30, 2019

We have a hard multi-tenancy requirements. pods (cicd builds) are isolated by namespace and have strict NetworkPolicies enforced, and when in Kata by different kernels.
Having the dind container in the same pod ( and vm ) makes sense, we only expose localhost.

@awprice
Copy link
Contributor

awprice commented Jul 30, 2019

@AkihiroSuda No, the solution would be pod level, as thats the lowest level of object that you can place an annotation on.

The logic in containerd/cri I imagine would work like this - If any of the pods have privileged AND there is a special annotation present on the pod to disable host devices, then don't append the host devices in the container spec.

@dadux
Copy link

dadux commented Jul 30, 2019

Can we extend the CRI spec to add an option for system fs /dev/ /proc ... as rw ?
AFAIK it's the only thing missing to emulate privileged without mounting all devices ? There's already all the options for adding capababilities, and SELinux. The SecurityContext is per container too...

@dadux
Copy link

dadux commented Jul 30, 2019

It looks like it's already been discussed, to redefine privileged in the CRI :

kubernetes/kubernetes#44503

But nothing has happened. 😢

@smarton6626
Copy link

But yeah, the use case is for CICD builds, with multiple containers per pod.
Usually we have a build container, specified by the customer running the job, and a we inject dind container if requested.

Hi, @dadux , is it truly unacceptable that all containers in a pod are run as privileged?

@dadux
Copy link

dadux commented Jul 31, 2019

@smarton6626 - yes. In our case we're talking untrusted containers, with arbitrary code.

If the untrusted container is privileged, it has capabilities to mount/attach devices, and access other containers in the pod's secret for instance.

@smarton6626
Copy link

@dadux - Is it avoidable? I mean, how about treating the whole pod as untrusted and not using secret in it?

@dadux
Copy link

dadux commented Jul 31, 2019

The secrets was just an illustration, we treat the whole pod as insecure. But we also don't grant extra permissions to containers that don't need it. We drop all CAPS for those containers.

"privileged" mode just makes it a ton easier to break out of the container. Why would you want to allow that when you don't have to ?

There's good progress on the containerd/cri issue, with proposed solutions.

@smarton6626
Copy link

@dadux - Because I won't care they break out the container if I treat the whole pod as insecure. They are all in a VM, can do nothing to my host. Think about the VM instances in a public cloud (eg. EC2).

@dadux
Copy link

dadux commented Jul 31, 2019

@smarton6626 - But we do care, having an extra kernel isolation is awesome but not bulletproof.
https://www.cvedetails.com/vulnerability-list/vendor_id-7506/Qemu.html

Anyway, I don't think we're adding much value to initial problem here.

@smarton6626
Copy link

@dadux - Well, thanks for your input which provides me different perspectives. I'll think over that, although for now still believe VM is secure enough for isolating untrusted workloads.

@AkihiroSuda
Copy link

AkihiroSuda commented Aug 8, 2019

privileged_without_host_devices PR for containerd/CRI (containerd/cri#1225) recently got merged.
This will be available in containerd 1.3,

Proposal for porting over this to Moby/Docker: moby/moby#39697

PR: moby/moby#39702

@haslersn
Copy link

haslersn commented Mar 6, 2021

Is this solved?

@leoluk
Copy link

leoluk commented Apr 7, 2021

Is this solved?

It's implemented in Kata and containerd, but has not yet propagated to Docker.

@fidencio
Copy link
Member

fidencio commented Apr 7, 2021

@haslersn, please, see http://lists.katacontainers.io/pipermail/kata-dev/2021-April/001819.html, there you'll find the explanation why the issue was closed.

Hi All,

# TL;DR

We're closing all open GitHub issues for Kata 1.x and 2.x that were created before 1st June 2020.


# Background 

We have too many old open issues that are making it hard to identify the reallly important newer ones.

# Plan

Since a lot of those issues are now probably "stale", we intend to close all open GitHub issues opened before 1 June 2020.

# When?

This week some time.

# Why?

Closing these old issues will:

- Allow us to better manage the more recent issues

  We're in the process of categorising current issues with labels to allow us to manage the backlog more efficiently).

- Act as a forcing function for those who raised the issue

  The originator will get notified when the issue is closed, which will remind them of the issue and the problem. And if the issue is still important, they can re-open it with a single click.

# What is the impact of closing an issue?

Whether a GitHub issue is open or closed is simply a state field - you can open and close an issue as many times as you like: no information is going to be lost by changing the issue state to closed. By closing the old issues, we'll hopefully get the list of relevant open issues down to a more manageable number. And once labelled, will allow us to concentrate on fixing the issues rather than spending lots of time managing the issues themselves.

Of course, if an issue is closed, it's highly unlikely anyone will be working on resolving it, so...

# If one of your issues is closed

As mentioned above, if one of your issues is closed, please consider whether you are happy with this. If you think the issue should remain open, just click the "reopen" button and add a comment for us. It would be helpful if you could also mention whether the problem still exists with the latest 2.x version of Kata.

# Get involved

If you're interested in helping us manage the isue backlog, please get in contact [1], and maybe consider volunteering to help on the Kata rota [2].

Cheers,

James

[1] - https://github.com/kata-containers/community/blob/master/README.md#join-us
[2] - https://github.com/kata-containers/community/wiki/Review-Team-Rota

@leoluk
Copy link

leoluk commented Apr 7, 2021

This one was rightfully closed, the feature in Kata is implemented and works! :-)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests