Define Linux Network Devices #1271

aojea · 2024-11-07T11:28:39Z

The proposed "netdevices" field provides a declarative way to specify which host network devices should be moved into a container's network namespace.

This approach is similar than the existing "devices" field used for block devices but uses a dictionary keyed by the interface name instead.

The proposed scheme is based on the existing representation of network device by the struct net_device
https://docs.kernel.org/networking/netdevices.html.

This proposal focuses solely on moving existing network devices into the container namespace. It does not cover the complexities of network configuration or network interface creation, emphasizing the separation of device management and network configuration.

Fixes: #1239

aojea · 2024-11-07T11:29:28Z

/assign @samuelkarp

config-linux.md

AkihiroSuda · 2024-11-07T23:44:26Z

https://github.com/opencontainers/runtime-spec/blob/main/features.md should be updated too

aojea · 2024-11-12T12:29:47Z

https://github.com/opencontainers/runtime-spec/blob/main/features.md should be updated too

updated and addressed the comments

config-linux.md

schema/config-linux.json

schema/defs-linux.json

schema/test/config/bad/linux-netdevice.json

schema/test/config/good/linux-netdevice.json

aojea · 2024-11-12T21:52:30Z

AI @aojea (document the cleanup and destroy of the network interfaces)

config-linux.md

samuelkarp · 2024-11-12T22:23:45Z

From the in-person discussion today:

Net device lifecycle should follow the network namespace lifecycle
@aojea will follow up to determine whether any cleanup actions need to be taken by the OCI runtime on a container being deleted
@kad was concerned about restarts and error handling
Should we prohibit the new netdev addition to an existing netns? IOW only allow this for containers where a new netns is created? What about containers where the root netns is used?

utam0k · 2024-11-12T22:25:13Z

config-linux.md

+
+This schema focuses solely on moving existing network devices identified by name into the container namespace. It does not cover the complexities of network device creation or network configuration, such as IP address assignment, routing, and DNS setup.
+
+**`netDevices`** (object, OPTIONAL) set of network devices that MUST be available in the container. The runtime is responsible for providing these devices; the underlying mechanism is implementation-defined.


This spec said "MUST" but, I think it can't do it in the rootless container because the rootless container doesn't have CAP_NET_ADMIN, right?

I'm not sure we should take care of the rootless container.

Could be an error in the case of a rootless container, if the runtime is not able to satisfy the MUST condition.

Could be an error in the case of a rootless container, if the runtime is not able to satisfy the MUST condition.

+1 but It'd be better to clarify it in the spec.

added mor explanations about runtime and network devices lifecycle and runtime checks, PTAL

schema/test/config/good/linux-netdevice.json

config-linux.md

aojea · 2024-11-19T06:10:32Z

om the in-person discussion today:

Net device lifecycle should follow the network namespace lifecycle

@aojea will follow up to determine whether any cleanup actions need to be taken by the OCI runtime on a container being deleted

@kad was concerned about restarts and error handling

Should we prohibit the new netdev addition to an existing netns? IOW only allow this for containers where a new netns is created? What about containers where the root netns is used?

Pushed a new commit addressing those comments, the changelog is

the network namespace lifecycle will move migratebale network devices and destroy virtual devides, the runtime MAY decide to do cleanup actions
runtime MUST check the container has enough privileges and an associated network namespace and fail if the check fail
removed the Mask field and use the Address field with CIDR notation (IP/Prefix) to deal with IPv4 and IPv6 addresses. Only one IP is allowed to be specified on purpose to simplify the operations and reduce risks
Add a HardwareAddress field for use cases that require to set a
specific mac or infiniband address

config-linux.md

kolyshkin · 2025-02-04T00:45:16Z

@aojea can you squash the commits? Otherwise LGTM, and this is definitely something we can build upon.

aojea · 2025-02-04T07:14:32Z

@aojea can you squash the commits? Otherwise LGTM, and this is definitely something we can build upon.

@kolyshkin squashed, thanks

config-linux.md

kolyshkin

LGTM except for one nit.

tianon · 2025-02-05T18:04:02Z

for things that are easily available via few netlink api calls, I don't see any big reasons why it be too complex to implement in runtime

The problem in my mind is where do you draw the line? At what point does "a few netlink api calls" become "a few too many" and now it's no longer acceptable?

I agree with Kir that we should keep the scope really limited here so we don't accidentally stray too far into network management in the low level runtime, which is something we've traditionally left as a higher level concern for a lot of good reasons.

kolyshkin · 2025-02-08T00:06:32Z

One more comment -- why is the data structure a map? A map makes sense when we want to quickly find an element based on a key. Is it needed here?

Perhaps a simple array is easier.

aojea · 2025-02-08T12:22:10Z

One more comment -- why is the data structure a map? A map makes sense when we want to quickly find an element based on a key. Is it needed here?

interfaces names are unique per namespace, a map guarantees this avoiding user fat finger problems or misconfiguration, otherwise we need to validate that the names in the array are unique

kolyshkin · 2025-02-11T00:48:30Z

interfaces names are unique per namespace, a map guarantees this avoiding user fat finger problems or misconfiguration, otherwise we need to validate that the names in the array are unique

I don't think the uniqueness requirement justifies using map, and I don't think we should be concerned much about uniqueness. In case a network device will be listed twice, the second entry will fail with "no such device" because the first one will already be moved.

Also, we have other similar fields (Devices, Namespaces, Mounts etc.) as arrays. I see that time namespace offsets is implemented via maps (and I think it was a mistake, for similar reasons -- an array would work just fine).

Having said that, this is not a deal breaker. Also, I'm sure other maintainers may have different opinions. WDYT @opencontainers/runtime-spec-maintainers ?

pfl · 2025-02-11T09:19:49Z

I agree with Kir that we should keep the scope really limited here so we don't accidentally stray too far into network management in the low level runtime, which is something we've traditionally left as a higher level concern for a lot of good reasons.

With the IP configuration part now removed, am I correct that the "easy" solution configuring IP networking for additional interfaces is still via a custom runtime hook? In our case the containers are static and cannot be amended with IP configuration tools or the necessary networking capabilities, and I doubt we are able to inject another container into the pod without causing a stir. IP (and yes, a route configuration) would have been a good and standard solution for simple secondary network needs.

kad · 2025-02-11T09:40:54Z

for things that are easily available via few netlink api calls, I don't see any big reasons why it be too complex to implement in runtime

The problem in my mind is where do you draw the line? At what point does "a few netlink api calls" become "a few too many" and now it's no longer acceptable?

My "border line" is about features and fields provided by kernel netlink api. If something is done with it, I don't see big difference between call "move interface xyz to namespace abc" vs. "set property foo to value bar on interface xyz in namespace abc". Or to what it worth, "put value 123 into cgroup foobar".
Things like DHCP that require more than plain kernel api call are clearly outside of the boundaries that can be implemented in runtime.

kolyshkin · 2025-02-11T22:53:40Z

schema/test/config/bad/linux-netdevice.json

+    "linux": {
+        "netDevices": {
+            "eth0": {
+                "name": 23


what is 23 here?

an integer to cause a bad config since it is expecting a string

kolyshkin · 2025-02-11T23:26:05Z

for things that are easily available via few netlink api calls, I don't see any big reasons why it be too complex to implement in runtime

The problem in my mind is where do you draw the line? At what point does "a few netlink api calls" become "a few too many" and now it's no longer acceptable?

My "border line" is about features and fields provided by kernel netlink api. If something is done with it, I don't see big difference between call "move interface xyz to namespace abc" vs. "set property foo to value bar on interface xyz in namespace abc". Or to what it worth, "put value 123 into cgroup foobar". Things like DHCP that require more than plain kernel api call are clearly outside of the boundaries that can be implemented in runtime.

To me, this sounds like a runtime should implement most of what ip(8) is capable of. Which is a lot.

If that's the case, here's a crazy idea.

Considering that ip(8) is a de facto standard on Linux, perhaps such configuration can be achieved by supplying a sequence of ip command arguments in runtime-spec. A runtime when can use ip binary on the host to get to the needed configuration.

Something like

"linux": {
        "netDevices": [
            "name": "enp34s0u2u1u2",
            "ct_name": "eth0",
            "config": [
                "ip addr add 10.2.3.4/25 dev $IF",
                "ip link set $IF mtu 1350",
                "ip route add 10.10.0.0/16 via 10.2.3.99 dev $IF proto static metric 100"
                "ip route add default dev $IF"
            ]
        ]
    }

Which will result in runtime running the host's ip binary in the container's network namespace (after moving the device into container's namespace and giving it a new name). It can even be a single ip invocation if -b (batch mode) is supported.

The upsides are:

this uses a well known, de facto standard "domain language" for network-related configuration;
this allows for a quite complex configuration to be performed (limited to whatever ip understands);
this seems trivial to implement for any runtime;
this doesn't impose any specific container image requirements (no need for any extra binaries etc.);
this should be secure (as ip is running inside a netns);
some advanced runtimes can implement a subset of this internally (i.e. without the call to ip);

The biggest downside, I guess, is the scope is not well defined, as ip is evolving.

aojea · 2025-02-12T23:20:57Z

I don't like the idea of execing commands, specially in golang implementations that are problematic with goroutines and namespaces
I think that once we have just the network interfaces on the runtimes we'll get much better feedback and signal from the implementations to evaluate the next steps and to know if something else is needed and what exactly

I still think that preventing to detect failures at runtime is desirable, you can argue you should not add duplicates, but bugs exist and if somebody accidentally add a duplicate interface name it will cause a problem in production we could have avoided just not allowing to define it

kad · 2025-02-14T09:28:28Z

Considering that ip(8) is a de facto standard on Linux, perhaps such configuration can be achieved by supplying a sequence of ip command arguments in runtime-spec. A runtime when can use ip binary on the host to get to the needed configuration.

I believe that this is really bad idea, compared to have limited set of config options implemented. exec() is always slower and less reliable than calling native kernel APIs. It opens doors to all various injection vulnerabilities and other potential misuses and scope creeps than strict API subset usage. As well as Antonio mentioned proper error handling.

To provide more insights on our use case, here is snip of the code that we want to get rid of, currently implemented as OCI runtime hook (btw, including already unreliable call to ip in it): https://github.com/HabanaAI/habana-container-runtime/blob/main/cmd/habana-container-cli/netdevices.go#L131-L182

AkihiroSuda · 2025-02-20T01:16:49Z

Let me move the milestone from vNext (v1.2.1?) to vNextNext (v1.3.0?)

aojea · 2025-02-20T08:03:14Z

Do you have an aprox estimation on how long can take 1.3.0? I have some dependencies on this feature and it will be nice to be able to account for that time

samuelkarp · 2025-03-01T00:11:56Z

@AkihiroSuda It looks like 1.2.1 got tagged yesterday: https://github.com/opencontainers/runtime-spec/releases/tag/v1.2.1. Is there anything blocking the merge of this PR into main and setting it up for the 1.3.0 release?

tianon · 2025-03-01T04:24:42Z

I still don't think it's appropriate to expect the runtime to set up the network interfaces and believe we're opening a can of worms here, but not strongly enough to block it outright (especially with Kir approving it, and thus the implied maintenance of it in runc). ❤️

aojea · 2025-03-01T20:23:23Z

I still don't think it's appropriate to expect the runtime to set up the network interfaces and believe we're opening a can of worms here,

I don't see risk meanwhile we stick to network interfaces, the moment we leak networks , we start to create runtimes dependencies and I fully agree with you. That is also the reason why the last proposal removed entirely the IP initialization bits #1271 (comment), I have some ideas on how to solve this problem without modifying the spec, but my priority now is to solve the existing problem in the ecosystem.

The problem this solves is that today there is hardware that needs to use network interfaces, mainly GPUs (but there are other cases). Since there is no way to declaratively move these interfaces to the container, everyone solves the problem from a different angle, by hooking into the Pod/Container network namespace creation using:

CNI plugins
OCI hooks
out of band channels using the Kubernetes API directly
...

All these solutions are brittle , requires high privileges that exposes security problems, and are hard to troubleshoot since run in different places during the container creation and cause fragmentation in the ecosystem and bad user experience.

The goal here is that developers can just patch the OCI spec to say "move eth2 to this container with /dev/gpu0" simple to do with CDI and NRI and no need for extra privileges ... I personally don't feel that meanwhile we stick to this definition, we will open that pandora box that is the network (that non of us want)

rata

I'm just taking a look, I tried to go through all the 137 comments. But sorry if I'm missing something.

I guess most concerns are solved now that there is no reference to the IP address and friends? Is there any unaddressed concern still?

The idea LGTM. I've also checked quickly the runc implementation, it seemed cleaned and nothing that catched my attention (like no weird magic to do with IPs or anything, that is just not touched at all).

@aojea I ignore this, but how does this works in practice? Is it expected that the host network interface will be configured by the host (i.e. IP address, mtu, etc.) and then moved into the container? All of the configuration and all "just works" when moved into the containe? Or CNI or some other entity will need to do something?

rata · 2025-03-03T11:20:54Z

config-linux.md

+
+This schema focuses solely on moving existing network devices identified by name from the host network namespace into the container network namespace. It does not cover the complexities of network device creation or network configuration, such as IP address assignment, routing, and DNS setup.
+
+**`netDevices`** (object, OPTIONAL) set of network devices that MUST be made available in the container. The runtime is responsible for providing these devices; the underlying mechanism is implementation-defined.


I think here is array of objects, and each object has a string field with name name.

You want an array at some point, for the net devices, but here you say object and later you say string. I think here you want to say array of object, like here: https://github.com/opencontainers/runtime-spec/pull/1271/files#diff-048d23d864e15683f516d2c1768965d546e87f8a59b2606cf2f2d52500ba5a32R127

OHH, you want two names, the host network interface and the name to assign in the container, right? Maybe you want array of objects and each object has two fields, the name of the interface of the host and the name to assign to the container interface?

I mean, you want an array and none of the types you list (not the free-text, but the name of the field with the type in parenthesis) is an array. I think some array is missing :)

If you see the previous proposal, there were multiple attributes before , leaving this as an object allows to extend the API in a backwards compatible way

"linux": { "netDevices": { "eth0": { "name": "container_eth0", "new_field_here": "asdasd", }, "ens4": {}, "ens5": {} } }

I still think the object to represent a dictionary is better than an array because avoids duplicate names on the configuration that can cause runtime errors #1271 (comment)

rata · 2025-03-03T11:26:56Z

config-linux.md

+The name of the network device is the entry key.
+Entry values are objects with the following properties:
+
+* **`name`** *(string, OPTIONAL)* - the name of the network device inside the container namespace. If not specified, the host name is used. The network device name is unique per network namespace, if an existing network device with the same name exists that rename operation will fail. The runtime MAY check that the name is unique before the rename operation.


Just curious, as I'm not very familiar with NRI and I don't know if this concern makes sense, please let me know. How can NRI plugins using this decide on container interface name to use? I mean choose one that won't clash with the ones set by potentially other plugins? Can they see what has been done so far by previous plugins? Or this is not an issue at all (in that case, can you explain briefly why? I'm curious :))

In kubernetes both main runtimes, containerd and crio, the name of the interface inside the container is always eth0, so for 95% of the cases in kubernetes the problem is easy to solve.
There are cases where people add additional interfaces with out of band mechanisms as in #1271 (comment), in that case, there are several options:

add a random generated name with enough entropy

inspect the network namespace and check for duplicates

fail with a collision name error

Exactly, but you can't inspect the netns because it hasn't been created yet. So, how can those tools, befor choosing a name for the interface inside the container, check which names were used by others? E.g if NRI has several plugins and more than one adds a interface, how can they the second plugin know eth1 is added and avoid using that name?

The random generated would be an option, but it will be nice to understand if that is needed or if people can just choose names that avoids collisions.

In kubernetes the network namespace is created by the runtime and there will be only an eth0 interface,
If there are more interface is because some component is adding them via an out of band process, that will have exactly the same problem. This works today because cluster administrators only set up one component to add additional interfaces.

This reinforces my point in #1271 (comment) , using a well defined specification will help multiple implementations to be able to synchronize, and we need thhis primitive to standardize these behaviors, to build higher level APIs ... we are already doing it for Network Status kubernetes/enhancements#4817 , we need to do it for configuration based on this.

aojea · 2025-03-03T11:58:49Z

@aojea I ignore this, but how does this works in practice? Is it expected that the host network interface will be configured by the host (i.e. IP address, mtu, etc.) and then moved into the container? All of the configuration and all "just works" when moved into the container or CNI or some other entity will need to do something?

The goal is to decouple the interface lifecycle and configuration from the oci runtime, that is the part that SHOULD NOT be handled by the OCI runtimes or actors of the pod/container creation. I think there are two scenarios:

interface admin preconfigure the interface before moving it and runc moves it as is. This is already the case on most cases since interfaces use to be part of the host. For those cases that interfaces are hotplaggubale or are dynamically created, the existing preprovisioning hooks in DRA NodePrepareResources can do that, then it is just indicate the OCI runtime to move it inside the container.
container application configure the interface. We are already doing this with VMs, see cloud-init and network configuration, there are also use cases where the interface is going to be consumed directly by the application and exposed as configureable to the users, see telco use cases that run router or firewall like apps in the container.

rata · 2025-03-03T13:29:04Z

Okay, we were talking over slack and there are two things that we think we still need to answer:

Is the configuration (address, mtu, etc.) lost when moved to the container netns?
For the cases that we need to configure the interface, how can we expect the configuration to happen?

I ask 2. for two reasons: a) be clear that is the concern of another part of the stack; b) understand how this will be used and understand if nothing else is missing here (i.e. if CNI is expected to handle it, make sure it is not running too late, or it has all the info to realize there is an extra interface to configure, if another component is expected to handle, see that it can, etc.)

aojea · 2025-03-03T22:29:48Z

Thanks @rata for your help today, I updated the PR to implement it in runc with integration tests that show the behavior

Is the configuration (address, mtu, etc.) lost when moved to the container netns?

For the cases that we need to configure the interface, how can we expect the configuration to happen?

the interface configuration is preserved, so users can set down the interface in the host namespace, configure the interface (ip address, mtu, hw address) and the runtime will move it to the network namespace maintaining that configuration, this removes the need to include network configuration in the runtime and allow for implementations to use the preparation of the device to configure it without risks (kubernetes use case)

Users can still decide to use a process inside the container to configure the network configuration, use dhcp or some sort of bootstrap ala cloud-init

The proposed "netdevices" field provides a declarative way to specify which host network devices should be moved into a container's network namespace. This approach is similar than the existing "devices" field used for block devices but uses a dictionary keyed by the interface name instead. The proposed scheme is based on the existing representation of network device by the `struct net_device` https://docs.kernel.org/networking/netdevices.html. This proposal focuses solely on moving existing network devices into the container namespace. It does not cover the complexities of network configuration or network interface creation, emphasizing the separation of device management and network configuration. Signed-off-by: Antonio Ojea <[email protected]>

aojea · 2025-03-03T23:22:35Z

rebased and added this last requirement to preserve the network config

iff --git a/config-linux.md b/config-linux.md
index 6682e16..1f0e808 100644
--- a/config-linux.md
+++ b/config-linux.md
@@ -201,6 +201,8 @@ This schema focuses solely on moving existing network devices identified by name
 
 The runtime MUST check that is possible to move the network interface to the container namespace and MUST [generate an error](runtime.md#errors) if the check fails.
 
+The runtime MUST preserve the existing network interface attributes, like MTU, MAC and IP addresses, enabling users to preconfigure the interfaces.
+
 The runtime MUST set the network device state to "up" after moving it to the network namespace to allow the container to send and receive network traffic through that device.

aojea requested review from AkihiroSuda, crosbymichael, cyphar, dqminh, giuseppe, hqhq, kolyshkin, mrunalp, thaJeztah, tianon and utam0k as code owners November 7, 2024 11:28

AkihiroSuda reviewed Nov 7, 2024

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

AkihiroSuda reviewed Nov 7, 2024

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

AkihiroSuda reviewed Nov 7, 2024

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

AkihiroSuda reviewed Nov 7, 2024

View reviewed changes

config-linux.md Show resolved Hide resolved

aojea force-pushed the network-devices branch 2 times, most recently from 51e5104 to 3a666eb Compare November 12, 2024 12:26

samuelkarp reviewed Nov 12, 2024

View reviewed changes

askervin reviewed Nov 12, 2024

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

utam0k reviewed Nov 12, 2024

View reviewed changes

kad reviewed Nov 12, 2024

View reviewed changes

schema/test/config/good/linux-netdevice.json Outdated Show resolved Hide resolved

utam0k reviewed Nov 12, 2024

View reviewed changes

config-linux.md Show resolved Hide resolved

pfl reviewed Nov 19, 2024

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

kad reviewed Nov 19, 2024

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

aojea force-pushed the network-devices branch from ebe9192 to 0f9ee33 Compare February 4, 2025 07:12

kolyshkin reviewed Feb 5, 2025

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

kolyshkin approved these changes Feb 5, 2025

View reviewed changes

aojea force-pushed the network-devices branch from 0f9ee33 to 0ce07ae Compare February 5, 2025 09:56

kolyshkin reviewed Feb 11, 2025

View reviewed changes

AkihiroSuda modified the milestones: vNext (tentative), vNextNext (tentative) Feb 20, 2025

rata reviewed Mar 3, 2025

View reviewed changes

aojea force-pushed the network-devices branch from 0ce07ae to af8cc6f Compare March 3, 2025 23:21


		This schema focuses solely on moving existing network devices identified by name into the container namespace. It does not cover the complexities of network device creation or network configuration, such as IP address assignment, routing, and DNS setup.

		`netDevices` (object, OPTIONAL) set of network devices that MUST be available in the container. The runtime is responsible for providing these devices; the underlying mechanism is implementation-defined.

Define Linux Network Devices #1271

Are you sure you want to change the base?

Define Linux Network Devices #1271

Conversation

aojea commented Nov 7, 2024

aojea commented Nov 7, 2024

AkihiroSuda commented Nov 7, 2024

aojea commented Nov 12, 2024

aojea commented Nov 12, 2024

samuelkarp commented Nov 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea commented Nov 19, 2024

kolyshkin commented Feb 4, 2025

aojea commented Feb 4, 2025

kolyshkin left a comment

Choose a reason for hiding this comment

tianon commented Feb 5, 2025

kolyshkin commented Feb 8, 2025

aojea commented Feb 8, 2025 • edited Loading

kolyshkin commented Feb 11, 2025

pfl commented Feb 11, 2025

kad commented Feb 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolyshkin commented Feb 11, 2025

aojea commented Feb 12, 2025

kad commented Feb 14, 2025

AkihiroSuda commented Feb 20, 2025

aojea commented Feb 20, 2025

samuelkarp commented Mar 1, 2025

tianon commented Mar 1, 2025

aojea commented Mar 1, 2025 • edited Loading

rata left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea commented Mar 3, 2025 • edited Loading

rata commented Mar 3, 2025 • edited Loading

aojea commented Mar 3, 2025 • edited Loading

aojea commented Mar 3, 2025

aojea commented Feb 8, 2025 •

edited

Loading

aojea commented Mar 1, 2025 •

edited

Loading

rata left a comment •

edited

Loading

aojea commented Mar 3, 2025 •

edited

Loading

rata commented Mar 3, 2025 •

edited

Loading

aojea commented Mar 3, 2025 •

edited

Loading