Do not pass all host devices in /dev when launching with '--privileged' #1568

zhiminghufighting · 2019-04-19T07:23:09Z

Description of problem

When i launch an android image with --privileged parameter, all device under host /dev directory will be passed into the container directory /dev, the behavior doesn't make sense;

Expected result

Don't pass all host devices into kata container even with --privileged parameter;

Actual result

Here is thels info under kata container directory /dev:

root@5ff9d5d81457:/dev# ls
agpgart          dri    input         lp3                 null    raw   shm       tap21605  tty14  tty21  tty29  tty36  tty43  tty50  tty58  tty8    urandom  vcs6   vfio
autofs           fb0    kmsg          mapper              nvram   rtc0  snapshot  tty       tty15  tty22  tty3   tty37  tty44  tty51  tty59  tty9    usbmon0  vcsa   vga_arbiter
bsg              fd     kvm           mcelog              oldmem  sda   snd       tty0      tty16  tty23  tty30  tty38  tty45  tty52  tty6   ttyS0   vcs      vcsa1  vhci
btrfs-control    fd0    loop-control  mem                 port    sda1  sr0       tty1      tty17  tty24  tty31  tty39  tty46  tty53  tty60  ttyS1   vcs1     vcsa2  vhost-net
console          full   loop0         mqueue              ppp     sda2  stderr    tty10     tty18  tty25  tty32  tty4   tty47  tty54  tty61  ttyS2   vcs2     vcsa3  vmci
cpu              fuse   lp0           net                 ptmx    sda3  stdin     tty11     tty19  tty26  tty33  tty40  tty48  tty55  tty62  ttyS3   vcs3     vcsa4  vsock
cpu_dma_latency  hpet   lp1           network_latency     pts     sg0   stdout    tty12     tty2   tty27  tty34  tty41  tty49  tty56  tty63  uhid    vcs4     vcsa5  zero
crash            hwrng  lp2           network_throughput  random  sg1   tap20463  tty13     tty20  tty28  tty35  tty42  tty5   tty57  tty7   uinput  vcs5     vcsa6
root@5ff9d5d81457:/dev#

Here is the host view of docker ps:

[root@centos-k8s014 home]# docker ps
CONTAINER ID        IMAGE                                COMMAND             CREATED             STATUS              PORTS               NAMES
5ff9d5d81457        sgjesse/legacy-android-base:latest   "/bin/bash"         About an hour ago   Up About an hour                        gracious_wright
619d72f2ad28        android:NAe000047                    "/android-entry"    About an hour ago   Up About an hour                        friendly_ritchie
822fa322b90c        android:NAe000047                    "/android-entry"    About an hour ago   Up About an hour                        dazzling_fermi

The text was updated successfully, but these errors were encountered:

zhiminghufighting · 2019-04-19T07:24:14Z

Please see the related issue and root cause here:
#1551

zhiminghufighting · 2019-04-29T05:26:54Z

any idea about this issue?

jodh-intel · 2019-04-29T08:46:15Z

Hi @zhiminghufighting - I've updated the description and reformatted the details to make this easier to understand. Please can you provide details of which devices you would not expect to see?

jodh-intel · 2019-04-29T09:01:20Z

Related: #1558.

/cc @amshinde.

amshinde · 2019-04-29T21:47:47Z

@zhiminghufighting I am not sure why you are seeing this behaviour, before I go take a look, can you tell me what is the specific android image that you are using?

leoluk · 2019-05-02T00:10:26Z

This is highly unexpected and undocumented behavior that just fell on my feet during a CTF competition that uses Kata Containers for isolation :)

I passed --privileged=true such that I could run Docker containers inside the VM, but this also resulted in full host access that some players used to break out of the VMs.

Applied a local hotfix (no-op filterDevices), but I suggest the default behavior be changed.

grahamwhaley · 2019-05-02T09:00:07Z

@leoluk - saw your PR over at kata-containers/documentation#452, thx.
Heh, I guess it is a question of 'when is --privileged not really --privileged... I have a feeling we may never please all the users all the time on this one ;-)
/cc @amshinde for thoughts and input.

See kata-containers/runtime#1568 Fixes kata-containers#453 Signed-off-by: Leopold Schabel <[email protected]>

zhiminghufighting · 2019-05-08T09:18:14Z

@amshinde I use a android based container which is running in Tencent cloud gaming.(This android based image also is made by intel AIS team for Tencent cloud gaming)
I tried in many conditions and found it only can be found in this condition : Host - Ubuntu + Kata + android based container;
If the Host OS is changed to Centos(7.3), it will never happen. i am not sure if it is related with OS policy?
By the way, the Ubuntu OS is 18.04; Kata is 1.6; i haven't tried on other combination.

@jodh-intel Thanks for your comment and corrections.

amshinde · 2019-05-28T22:02:38Z

@zhiminghufighting Sorry for the delay in getting back.
Can you run a ls -l /dev on the host and inside the container so that we can compare the devices that are passed to the container. We should be passing only block and vfio devices from the host to the guest and not all devices. It is not possible to pass all host devices to a virtual machine.

Finally, are you running this under docker or k8s? Could you provide the config.json file that gets passed to the runtime. For docker this should be under /var/run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby{containerid}/config.json

If you could point me to the android image that I should use, I can try and reproduce this as well.

iuridiniz · 2019-06-02T20:22:11Z

the number os devices are different in the host, over kata and over runc

iuri@evolux-clx:~$ docker run --runtime=runc --privileged --rm alpine ls /dev -1 | wc -l
229
iuri@evolux-clx:~$ docker run --runtime=kata --privileged --rm alpine ls /dev -1 | wc -l
228
iuri@evolux-clx:~$ sudo ls /dev -1 | wc -l
[sudo] password for iuri: 
242
iuri@evolux-clx:~$

iuridiniz · 2019-06-02T20:49:00Z

I didn't get, what makes you to think that devices are shared in --privileged mode? which is the evidence?

leoluk · 2019-06-03T06:29:29Z

Can we make this configurable, at least (and default it to off)? There's plenty of use cases that need privileged access inside the container only.

minzcmu · 2019-06-27T22:24:45Z

Hi @leoluk, can you share a bit more about how you do filterDevices as the hotfix?

awprice · 2019-07-30T03:51:51Z

+1 for this issue. Using Kata to isolate and secure Docker in docker is pointless when all of the host devices are simply hot plugged into the VM, including the host's / device. You may as well just do away with the VM and run it as a regular runc container.

I would like to see a configurable option in the Kata config to control filtering of the /dev devices. Or maybe a configurable regex and pass the device name through?

Another possible option that @egernst suggested is to add support for an annotation, that when used would enable a more kata specific privileged mode - support for all capabilities, writable sysfs, but no host devices. A cut down version of --privileged.

Thoughts?

cc @egernst @amshinde @ganeshmaharaj @mcastelino

bergwolf · 2019-07-30T04:06:24Z

-1.

--privileged in docker translates into two things:

pass all devices under /dev to container
give container all sys capabilites

If you just need the second one, you can list the required capabilities you need with docker run --cap-add. --privileged is just a lazy and dumb shortcut. docker does provide precise options. --cap-add and --device are your friends if you want precise control.

OTOH, if users require a simple way to specify that my container needs all sys capabilites but no extra implicit devices, please file issues in the moby project to request such a flag. It is not kata's role to workaround the issue.

awprice · 2019-07-30T04:13:20Z

@bergwolf

--privileged in docker translates into two things:

pass all devices under /dev to container

give container all sys capabilites

No, it translates into a bit more than that, see what is provided in containerd - https://github.com/containerd/containerd/blob/172fe90e55c3c3a452f9bec926d87ffda5ed01bf/oci/spec_opts.go#L1090-L1101

bergwolf · 2019-07-30T06:04:19Z

@awprice Still, please file issues in moby/containerd to request for such a flag that does not send all /dev devices to the OCI runtime. It is not kata's role to ignore random part of the container spec. E.g., if we implement what's requested here, there is no way to pass a device to a container when --privileged is specified. What's the point of supporting one specific use case while breaking others?

Generally --privileged is broken in many ways for kata and I would strongly recommend against using it. E.g., what's the point of passing host /dev/fuse to the guest? OTOH, it is improper to ignore a container's device list all together. The problem is that we(kata runtime) actually don't know which device is implicitly added and which is specified by users. IOW, you are fixing the problem in a wrong layer.

At very least, I'm fine to use an annotation as @egernst suggested to tell that users actually want to ignore all devices. At least we know that users do want it and it is not a wild guess on the real use case. But still I think moby/containerd is the right place for the fix.

leoluk · 2019-07-30T06:35:08Z

Kata is already interpreting the container spec quite liberally due to the extra abstraction layer involved. For instance, capabilities will not allow you to access the host system, but only the guest kernel in the VM. Why make an exception for devices?

AkihiroSuda · 2019-07-30T06:36:06Z

E.g., if we implement what's requested here, there is no way to pass a device to a container when --privileged is specified. What's the point of supporting one specific use case while breaking others?

This feature can be implemented as a new flag for kata-runtime without breaking compatibility.

e.g.

$ cat /etc/docker/daemon.json
{
        "runtimes": {
                "kata": {
                        "path": "/opt/kata/bin/kata-runtime"
                },
                "kata-isolate-dev": {
                        "path": "/opt/kata/bin/kata-runtime",
                        "runtimeArgs": [
                                "--kata-device-mode=isolate"
                        ]
                }
}
$ docker run --privileged --runtime=kata-isolate-dev docker:dind ...

bergwolf · 2019-07-30T07:24:15Z

Kata is already interpreting the container spec quite liberally due to the extra abstraction layer involved. For instance, capabilities will not allow you to access the host system, but only the guest kernel in the VM. Why make an exception for devices?

@leoluk Kata does translate container spec to apply a different secure context. We handle the container spec as much as we can, and there are pieces we have to ignore because there is no way to handle them in a virtual machine context. Here you want devices that are translated now to be ignored. That is a huge difference.

docker run --privileged --runtime=kata-isolate-dev --device /dev/sda docker:dind
@AkihiroSuda The above is broken but I guess you just don't care?

All you guys want is to run docker in docker with kata, even in a half broken way. But I think it is better to provide a proper --privileged translation. How about we add kata's own privilege flag instead? When it is specified:

guest /dev, /proc and /sys are exposed to container in rw mode
container process has full linux capabilities
no selinux, no apparmor, no seccomp

This seems to match containerd's translation in a virtual machine context: https://github.com/containerd/containerd/blob/172fe90e55c3c3a452f9bec926d87ffda5ed01bf/oci/spec_opts.go#L1090-L1101

Then you can do:

$ cat /etc/docker/daemon.json
{
        "runtimes": {
                "kata": {
                        "path": "/opt/kata/bin/kata-runtime"
                },
                "kata-privileged": {
                        "path": "/opt/kata/bin/kata-runtime",
                        "runtimeArgs": [
                                "--privileged"
                        ]
                }
}
$ docker run --runtime=kata-privileged docker:dind ...

And docker run --runtime=kata-privileged --device /dev/sda busybox is still working as expected.

Then we can officially recommend kata-runtime --privileged over docker's --privileged flag. Users who want to use docker run --privileged should use kata-runtime --privileged instead.

dadux · 2019-07-30T08:41:03Z

I'm not against an annotation, we already abuse them for other use cases.
However, we would have to modify the kubernetes pod spec, as annotations are at the pod level, not container. (As opposed to privileged )

bergwolf · 2019-07-30T08:49:04Z

Sigh, that is too bad, not having container annotations. Then the containerd/cri change (containerd/cri#1213) seems more appealing than before. wdyt about it?

awprice · 2019-07-30T09:05:32Z

@bergwolf Sounds like a good approach to make the change in containerd/cri, as that's where the host devices are being listed and appended to the spec.

Only other concern - what about cri-o?

bergwolf · 2019-07-30T09:12:41Z

I'm not familiar with cri-o but I think it can make the same change. Right now the translation for privileged container is runc specific. It makes sense to have a different translation as runc is not the only runtime containerd/cri and cri-o support.

AkihiroSuda · 2019-07-30T09:34:29Z

@dadux Does your usecase really need the dind container in the same pod?
I'm wondering you might be able to connect to dind in another pod (in its own VM) via headless service

awprice · 2019-07-30T09:37:28Z

@dadux Does your usecase really need the dind container in the same pod?
I'm wondering you might be able to connect to dind in another pod (in its own VM) via headless service

No because the dind in it's own VM still requires privileged, and in it's current state will have the host devices mounted in. If there was a vulnerability in dind or the kernel in the dind pod, then access to the host devices may occur.

It's also unnecessarily complicated.

AkihiroSuda · 2019-07-30T09:39:50Z

@awprice My question was about the future status with some alternative to the current --privileged, not about the current status 😅

In other words - the alternative solution really needs to be container-level rather than pod-level?

dadux · 2019-07-30T09:41:04Z

We have a hard multi-tenancy requirements. pods (cicd builds) are isolated by namespace and have strict NetworkPolicies enforced, and when in Kata by different kernels.
Having the dind container in the same pod ( and vm ) makes sense, we only expose localhost.

awprice · 2019-07-30T09:42:52Z

@AkihiroSuda No, the solution would be pod level, as thats the lowest level of object that you can place an annotation on.

The logic in containerd/cri I imagine would work like this - If any of the pods have privileged AND there is a special annotation present on the pod to disable host devices, then don't append the host devices in the container spec.

dadux · 2019-07-30T09:46:31Z

Can we extend the CRI spec to add an option for system fs /dev/ /proc ... as rw ?
AFAIK it's the only thing missing to emulate privileged without mounting all devices ? There's already all the options for adding capababilities, and SELinux. The SecurityContext is per container too...

dadux · 2019-07-30T10:04:20Z

It looks like it's already been discussed, to redefine privileged in the CRI :

kubernetes/kubernetes#44503

But nothing has happened. 😢

smarton6626 · 2019-07-31T03:09:45Z

But yeah, the use case is for CICD builds, with multiple containers per pod.
Usually we have a build container, specified by the customer running the job, and a we inject dind container if requested.

Hi, @dadux , is it truly unacceptable that all containers in a pod are run as privileged?

dadux · 2019-07-31T03:32:56Z

@smarton6626 - yes. In our case we're talking untrusted containers, with arbitrary code.

If the untrusted container is privileged, it has capabilities to mount/attach devices, and access other containers in the pod's secret for instance.

smarton6626 · 2019-07-31T03:52:55Z

@dadux - Is it avoidable? I mean, how about treating the whole pod as untrusted and not using secret in it?

dadux · 2019-07-31T04:51:29Z

The secrets was just an illustration, we treat the whole pod as insecure. But we also don't grant extra permissions to containers that don't need it. We drop all CAPS for those containers.

"privileged" mode just makes it a ton easier to break out of the container. Why would you want to allow that when you don't have to ?

There's good progress on the containerd/cri issue, with proposed solutions.

smarton6626 · 2019-07-31T06:55:49Z

@dadux - Because I won't care they break out the container if I treat the whole pod as insecure. They are all in a VM, can do nothing to my host. Think about the VM instances in a public cloud (eg. EC2).

dadux · 2019-07-31T07:09:56Z

@smarton6626 - But we do care, having an extra kernel isolation is awesome but not bulletproof.
https://www.cvedetails.com/vulnerability-list/vendor_id-7506/Qemu.html

Anyway, I don't think we're adding much value to initial problem here.

smarton6626 · 2019-07-31T10:12:18Z

@dadux - Well, thanks for your input which provides me different perspectives. I'll think over that, although for now still believe VM is secure enough for isolating untrusted workloads.

AkihiroSuda · 2019-08-08T11:13:57Z

privileged_without_host_devices PR for containerd/CRI (containerd/cri#1225) recently got merged.
This will be available in containerd 1.3,

Proposal for porting over this to Moby/Docker: moby/moby#39697

PR: moby/moby#39702

haslersn · 2021-03-06T03:58:51Z

Is this solved?

leoluk · 2021-04-07T10:58:01Z

Is this solved?

It's implemented in Kata and containerd, but has not yet propagated to Docker.

fidencio · 2021-04-07T20:56:05Z

@haslersn, please, see http://lists.katacontainers.io/pipermail/kata-dev/2021-April/001819.html, there you'll find the explanation why the issue was closed.

Hi All,

# TL;DR

We're closing all open GitHub issues for Kata 1.x and 2.x that were created before 1st June 2020.


# Background 

We have too many old open issues that are making it hard to identify the reallly important newer ones.

# Plan

Since a lot of those issues are now probably "stale", we intend to close all open GitHub issues opened before 1 June 2020.

# When?

This week some time.

# Why?

Closing these old issues will:

- Allow us to better manage the more recent issues

  We're in the process of categorising current issues with labels to allow us to manage the backlog more efficiently).

- Act as a forcing function for those who raised the issue

  The originator will get notified when the issue is closed, which will remind them of the issue and the problem. And if the issue is still important, they can re-open it with a single click.

# What is the impact of closing an issue?

Whether a GitHub issue is open or closed is simply a state field - you can open and close an issue as many times as you like: no information is going to be lost by changing the issue state to closed. By closing the old issues, we'll hopefully get the list of relevant open issues down to a more manageable number. And once labelled, will allow us to concentrate on fixing the issues rather than spending lots of time managing the issues themselves.

Of course, if an issue is closed, it's highly unlikely anyone will be working on resolving it, so...

# If one of your issues is closed

As mentioned above, if one of your issues is closed, please consider whether you are happy with this. If you think the issue should remain open, just click the "reopen" button and add a comment for us. It would be helpful if you could also mention whether the problem still exists with the latest 2.x version of Kata.

# Get involved

If you're interested in helping us manage the isue backlog, please get in contact [1], and maybe consider volunteering to help on the Kata rota [2].

Cheers,

James

[1] - https://github.com/kata-containers/community/blob/master/README.md#join-us
[2] - https://github.com/kata-containers/community/wiki/Review-Team-Rota

leoluk · 2021-04-07T22:05:51Z

This one was rightfully closed, the feature in Kata is implemented and works! :-)

zhiminghufighting mentioned this issue Apr 19, 2019

Failed to start privileged kata container #1551

Closed

jodh-intel changed the title ~~Adding all host dev under /dev into kata container when luanching with '--privileged'~~ Do not pass all host devices in /dev when launching with '--privileged' Apr 29, 2019

grahamwhaley assigned amshinde Apr 29, 2019

leoluk mentioned this issue May 2, 2019

docs: Clarify security boundaries in privileged mode kata-containers/documentation#452

Merged

leoluk pushed a commit to leoluk/documentation that referenced this issue May 2, 2019

docs: Clarify security boundaries in privileged mode

c69a806

See kata-containers/runtime#1568 Fixes kata-containers#453 Signed-off-by: Leopold Schabel <[email protected]>

leoluk added a commit to leoluk/documentation that referenced this issue May 2, 2019

docs: Clarify security boundaries in privileged mode

dfebe4e

See kata-containers/runtime#1568 Fixes kata-containers#453 Signed-off-by: Leopold Schabel <[email protected]>

leoluk added a commit to leoluk/documentation that referenced this issue May 2, 2019

docs: Clarify security boundaries in privileged mode

0721b6a

See kata-containers/runtime#1568 Fixes kata-containers#453 Signed-off-by: Leopold Schabel <[email protected]>

minzcmu mentioned this issue Jun 19, 2019

Evaluate Kata Containers over HyperD screwdriver-cd/screwdriver#818

Closed

AkihiroSuda mentioned this issue Jul 10, 2019

document docker in kata kata-containers/documentation#414

Closed

awprice mentioned this issue Jul 30, 2019

Failing to attach devicemapper block #1758

Closed

amshinde mentioned this issue Aug 8, 2019

Add flag to overload default privileged behaviour to pass/not pass host devices. cri-o/cri-o#2708

Closed

AkihiroSuda mentioned this issue Aug 8, 2019

new security-opt: privileged-without-host-devices (for safe DinD with Kata) moby/moby#39702

Closed

liubin mentioned this issue Nov 19, 2020

[k8s + kata] kubernetes privileged+ hostPath volume mount not working with kata 2.0 kata-containers/kata-containers#1129

Closed

jodh-intel closed this as completed Apr 7, 2021

maxime-peim mentioned this issue Feb 2, 2022

Désactiver l'ASLR sur Kata sans casser l'isolation. maxime-peim/INGInious-cychall#1

Open

Do not pass all host devices in /dev when launching with '--privileged' #1568

Do not pass all host devices in /dev when launching with '--privileged' #1568

Comments

zhiminghufighting commented Apr 19, 2019 • edited by jodh-intel Loading

Description of problem

Expected result

Actual result

zhiminghufighting commented Apr 19, 2019

zhiminghufighting commented Apr 29, 2019

jodh-intel commented Apr 29, 2019

jodh-intel commented Apr 29, 2019

amshinde commented Apr 29, 2019

leoluk commented May 2, 2019 • edited Loading

grahamwhaley commented May 2, 2019

zhiminghufighting commented May 8, 2019

amshinde commented May 28, 2019

iuridiniz commented Jun 2, 2019

iuridiniz commented Jun 2, 2019 • edited Loading

leoluk commented Jun 3, 2019

minzcmu commented Jun 27, 2019

awprice commented Jul 30, 2019 • edited Loading

bergwolf commented Jul 30, 2019

awprice commented Jul 30, 2019

bergwolf commented Jul 30, 2019

leoluk commented Jul 30, 2019 • edited Loading

AkihiroSuda commented Jul 30, 2019

bergwolf commented Jul 30, 2019

dadux commented Jul 30, 2019

bergwolf commented Jul 30, 2019

awprice commented Jul 30, 2019

bergwolf commented Jul 30, 2019

AkihiroSuda commented Jul 30, 2019

awprice commented Jul 30, 2019

AkihiroSuda commented Jul 30, 2019

dadux commented Jul 30, 2019

awprice commented Jul 30, 2019 • edited Loading

dadux commented Jul 30, 2019 • edited Loading

dadux commented Jul 30, 2019

smarton6626 commented Jul 31, 2019

dadux commented Jul 31, 2019

smarton6626 commented Jul 31, 2019

dadux commented Jul 31, 2019 • edited Loading

smarton6626 commented Jul 31, 2019

dadux commented Jul 31, 2019 • edited Loading

smarton6626 commented Jul 31, 2019

AkihiroSuda commented Aug 8, 2019 • edited Loading

haslersn commented Mar 6, 2021

leoluk commented Apr 7, 2021

fidencio commented Apr 7, 2021

leoluk commented Apr 7, 2021 • edited Loading

zhiminghufighting commented Apr 19, 2019 •

edited by jodh-intel

Loading

leoluk commented May 2, 2019 •

edited

Loading

iuridiniz commented Jun 2, 2019 •

edited

Loading

awprice commented Jul 30, 2019 •

edited

Loading

leoluk commented Jul 30, 2019 •

edited

Loading

awprice commented Jul 30, 2019 •

edited

Loading

dadux commented Jul 30, 2019 •

edited

Loading

dadux commented Jul 31, 2019 •

edited

Loading

dadux commented Jul 31, 2019 •

edited

Loading

AkihiroSuda commented Aug 8, 2019 •

edited

Loading

leoluk commented Apr 7, 2021 •

edited

Loading