-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Seccomp Notify support using UNIX sockets and container metadata #1074
Add Seccomp Notify support using UNIX sockets and container metadata #1074
Conversation
config-linux.md
Outdated
"ociVersion": "0.2.0", | ||
"seccompFd": 0, | ||
"pid": 4422, | ||
"pidFd": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about implementing pidFd
as part of seccomp notifications.
There are cases where pidfd is useful also without seccomp notifications, should we have a separate way for passing the pidfd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be fine if the spec define another separate way for passing pidfds.
But I feel it is useful here as well. Note that this pidFd
is not necessarily referring to the "pid 1" of the container, but in the case of "runc exec", it is the pid that entered in the container. I don't see how we can have a separate way for passing the pidfd in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think if we had something not tied to seccomp like?
{
"ociVersion": "0.2.0",
"fds" : {
"seccompFd" : 0,
"pidFd": 1,
"futureFeatureFd" : 2
},
"pid": 4422,
"state": {
"ociVersion": "0.2.0",
"id": "oci-container1",
"status": "creating",
"pid": 4422,
"bundle": "/containers/redis",
"annotations": {
"myKey": "myValue"
}
}
}
Would that work as well?
I am thinking we already have a way to pass the tty fd (with --console-socket
) and we could perhaps cover it as part of the OCI configuration instead of having it as a CLI option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giuseppe maybe this is a silly question, I think I'm a bit lost. How will the seccomp agent know about the new containers? Do you mean to keep the listenerPath and listenerMatadata in the seccomp section but send this more generic state instead of a seccomp state? In that case, how will be useful for other (non-seccomp) uses? Or do you mean to add fields to the state (not as your example json, as they are not part of the state) and use something like this?
I'm not sure I follow what you mean, or how we can use it from a seccomp agent that doesn't know when runc creates a container or "runc exec" is run in a container, so we have two (or more) pidfds to report: the fd for pid 1 of the container and the fd for the process that run "runc exec".
I'm fine with another mechanism, just not sure I understand it. I'll try to wake up and read your message again :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my suggestion was just about using a map for holding the file descriptors and pass the same metadata. In this way we could extend it in future, if we'll need any new kind of fd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds good to me to use a map[string]int
instead of individual fields for seccompFd and pidFd.
If you want to make it more generic, I guess you want to rename type SeccompState
in specs-go/state.go
... Do you have an idea of new name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giuseppe ohh, sorry, I see now! SGTM too, seems like a nice idea. I'll adapt to create a "fds" section in the seccomp state.
Let me know if you want to rename the seccomp state to some other (more generic) thing as alban suggested :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giuseppe I've moved to the map as you suggested. I couldn't find a good name for the struct.
What name do you suggest? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't really think of anything good to suggest. Maybe ContainerState
or ContainerProcessState
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giuseppe I just updated the PR to call it ContainerProcessState
. Please let us know what you think. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am adding some suggestions that I didn't notice before.
config-linux.md
Outdated
"ociVersion": "0.2.0", | ||
"seccompFd": 0, | ||
"pid": 4422, | ||
"pidFd": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be fine if the spec define another separate way for passing pidfds.
But I feel it is useful here as well. Note that this pidFd
is not necessarily referring to the "pid 1" of the container, but in the case of "runc exec", it is the pid that entered in the container. I don't see how we can have a separate way for passing the pidfd in that case.
173346d
to
3848ed4
Compare
3848ed4
to
2755fc5
Compare
runc has a draft PR (opencontainers/runc#2682) implementing this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question about the version negotiation / how many FDs to allocate on the receiving side prior to doing the receive, but otherwise lgtm.
377aeeb
to
c4cdab6
Compare
config-linux.md
Outdated
* **`fds`** (array, OPTIONAL) is a string array containing the names of the file descriptors passed. The index of the name in this array corresponds to index of the file descriptor the `SCM_RIGHTS` array. | ||
* **`pid`** (int, REQUIRED) is the container process ID, as seen by the runtime. | ||
* **`metadata`** (string, OPTIONAL) opaque metadata. | ||
* **`state`** (map, REQUIRED) is the [state](runtime.md#state) of the container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe where you have map
as the type, put the state
as a link to that struct definition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Done.
config-linux.md
Outdated
|
||
The container processs state includes the following properties: | ||
|
||
* **`ociVersion`** (string, REQUIRED) is version of the Open Container Initiative Runtime Specification with which the container processs state complies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: bullet list need not have periods at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it the other way around? I see end periods everywhere 🤔
On the whole, this seems fine and developed enough. (it helps to have worked through the implementation on runc, and related work already done in crun). This is not a breaking change, as the usage and fields are optional. |
@giuseppe @AkihiroSuda @KentaTada @ManaSugi I discussed this very briefly at the last OCI meeting. Since AFAIK you're the ones more interested on this, I'd like to get your feedback and if there are any concerns to move this forward. Thanks! |
config-linux.md
Outdated
|
||
The runtime sends the following file descriptors using `SCM_RIGHTS` and set their names in the `fds` array of the [container process state](#containerprocessstate): | ||
|
||
* **`seccompFd`** (int, REQUIRED) is the seccomp file descriptor returned by the seccomp syscall. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should seccompFd be optional as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind if seccompFd is optional in the spec, but how would it work if the runtime does not pass the seccompFd? Do you want to make it optional for runtimes that don't want to implement seccomp-notify, or are you thinking of an alternative way to obtain the seccompFd?
For runtimes that use seccomp-notify but implements the seccomp agent internally, they can just not set the listenerPath
field.
The "Container Process State" section is separate and does not mandate the seccompFd, but this section is in the description of "listenerPath".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could it be possible to use it for retrieving pidFd
(or anything more that is added in future) but not seccompFd
? e.g. what happens if the profile doesn't set any rule with SECCOMP_ACTION_NOTIFY
? Does the OCI runtime still need to create a seccompFd
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the profile doesn't set any notify rule, we propose this field should be ignored by the runtime. See here: https://github.com/opencontainers/runtime-spec/pull/1074/files/d87dd02ab1435993552f72fd1d98c7779f67228a#diff-048d23d864e15683f516d2c1768965d546e87f8a59b2606cf2f2d52500ba5a32R633.
Our idea is to use this UNIX socket to send to a seccomp agent. Do you want to make this field more "generic"? Be aware that this is in the seccomp section, we can reuse the container state process in other parts of the spec, but this should be about seccomp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so do you think we should move the listenerPath
outside the seccomp
section? I think it should be possible to use pidFd
without seccomp notify, or seccomp in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so do you think we should move the listenerPath outside the seccomp section?
I think we should not; it would make it more difficult to use in Kubernetes (see #1073 (comment)).
I think it should be possible to use pidFd without seccomp notify, or seccomp in general
If you want a mechanism to get the pidFd without seccomp, I'd rather doing that separately in a separate field than seccomp.listenerPath
, even if it reuses this new "Container Process State" type.
There can be only one seccomp agent receiving the seccompFd (by design from #1073 (comment)). But I see use cases where several processes could receive the pidFd, so maybe it could be an array of listeners. Could this be a follow-up PR for this? In your use cases, would you want to receive the pidFd at container-create time or at container-start time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giuseppe friendly ping? :)
0301404
to
984e162
Compare
@AkihiroSuda @KentaTada @ManaSugi Friendly ping? It would be super helpful for us if we can get a review for this :) |
@runtime-spec PTAL |
@mrunalp PTAL |
config-linux.md
Outdated
The runtime sends the following file descriptors using `SCM_RIGHTS` and set their names in the `fds` array of the [container process state](#containerprocessstate): | ||
|
||
* **`seccompFd`** (string, REQUIRED) is the seccomp file descriptor returned by the seccomp syscall. | ||
* **`pidFd`** (string, OPTIONAL) is the process file descriptor (e.g as returned by `pidfd_open(2)` or by `clone(2)` with the `CLONE_PID` flag). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, originally, you needed to support pidfd, because seccompfd left you with no way to determine if the container had been terminated or not. Now that there is tracking for that, it might be required. Do you think we should keep it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it from this PR as it's not strictly necessary anymore.
LGTM. Previously, there was the issue that you needed the PIDFD in order to track the container's lifetime. Now that we no longer need to do that because the epoll call will return EPOLLHUP (https://github.com/torvalds/linux/blob/master/kernel/seccomp.c#L1697-L1698) on the fd if there are no more containers, do we still need that? In addition to that, I fear that if we embed the pid into the metadata, it will be problematic if the notifier is running in its own pid ns (for example a daemonset?). Do we want to ensure that the process that is sending the is the one that matches the pid, therefore enabling the use of SCM_CREDENTIALS to find out the pid of the other side? Or at least being explicit and saying which process will send this? |
"listenerPath": { | ||
"type": "string" | ||
}, | ||
"listenerMetadata": { | ||
"type": "string" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, I'd move this outside the seccomp section so it can be used also without seccomp
Thanks for the review!
Good point. I think it can be useful anyways, though. The pidfd refers to the container 1 process if a container is being created, or to the process doing “kubectl exec”. I think in some cases (specially the pid of the process connected via an “exec” might be tricker to get) is useful, and if the agent will not use it can just close it.
Well, the same thing happens with the hooks. Just running the agent in the host pid will do the trick. In a daemonset, just using Also, I don’t think we want to ensure that the sender is the pid. Runc implementation is not doing that. And that's not doable today if sendmsg() is a syscall blocked by the seccomp policy, for example. In addition, it can also create issues for rootless in the future (in runc, it is sent by the runc parent and you need to have CAP_SYS_ADMIN to send a different pid in SCM_CREDENTIALS, which will complicate rootless, etc.). Furthermore, I don’t think the the spec needs to be explicit about which process is sending the message, it just needs to say that the pid is from the container runtime namespace. The PR is already explicit about that, so IMHO it is fine: https://github.com/opencontainers/runtime-spec/pull/1074/files#diff-048d23d864e15683f516d2c1768965d546e87f8a59b2606cf2f2d52500ba5a32R719 What do you think? |
yes I think this makes sense since we send also all the needed metadata and the same listenerPath can be used both for "seccomp" and future "linux" FDs. |
This adds the specification for Seccomp Userspace Notification and the Golang bindings. This contains: - New fields in the seccomp section to use with seccomp userspace notification. - Additional SeccompState struct containing the container state and file descriptors passed for seccomp. This was discussed in the OCI Weekly Discussion on September 16th, 2020. After review on github, this implementation was changed to the "Proposal with listenerPath and listenerExtraMetadata". For more information see: - opencontainers#1073 (comment) Docs presented on the community meeting (for the old implementation using hooks): - https://hackmd.io/El8Dd2xrTlCaCG59ns5cwg#September-16-2020 - https://docs.google.com/document/d/1xHw5GQjMj6ZKR-40aKmTWZRkvlPuzMGQRu-YpOFQc30/edit Documentation for this feature: - https://www.kernel.org/doc/html/v5.0/userspace-api/seccomp_filter.html#userspace-notification - man pages: seccomp_user_notif.2 at https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif - brauner's blog: https://brauner.github.io/2020/07/23/seccomp-notify.html This PR is an alternative proposal to PR 1038. While similar in nature, the main difference is that this PR adds optional metadata to be sent to the seccomp agent and specifies how the UNIX socket MUST be used. Signed-off-by: Rodrigo Campos <[email protected]> Signed-off-by: Alban Crequy <[email protected]> Signed-off-by: Mauricio Vásquez <[email protected]>
5dddf9c
to
58798e7
Compare
I squashed the commits and fixed the conflict! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This uses the runtime-spec from latest iteration from upstream PR: opencontainers/runtime-spec#1074 To reference the new commit I just run: go mod edit -replace=github.com/opencontainers/runtime-spec=github.com/kinvolk/runtime-spec@58798e75e9803d99bff5837ff39e9afe2e2efec8 go mod vendor And commited the changes. I also update the code to match the new spec. While doing that, I rewrote the code to close fds in case of failures (so we don't have open fds we are not using). Signed-off-by: Rodrigo Campos <[email protected]>
Glad to see this evolve and that there is a corresponding implementation for it (opencontainers/runc#2682). LGTM |
The OCI runtime specs[1] recently gained the support for seccomp notifications. [1] opencontainers/runtime-spec#1074 Signed-off-by: Giuseppe Scrivano <[email protected]>
The OCI runtime specs[1] recently gained the support for seccomp notifications. [1] opencontainers/runtime-spec#1074 Signed-off-by: Giuseppe Scrivano <[email protected]>
The OCI runtime specs[1] recently gained the support for seccomp notifications. [1] opencontainers/runtime-spec#1074 Signed-off-by: Giuseppe Scrivano <[email protected]>
The OCI runtime specs[1] recently gained the support for seccomp notifications. [1] opencontainers/runtime-spec#1074 Signed-off-by: Giuseppe Scrivano <[email protected]>
The OCI runtime specs[1] recently gained the support for seccomp notifications. [1] opencontainers/runtime-spec#1074 Signed-off-by: Giuseppe Scrivano <[email protected]>
The OCI runtime specs[1] recently gained the support for seccomp notifications. [1] opencontainers/runtime-spec#1074 Signed-off-by: Giuseppe Scrivano <[email protected]>
The OCI runtime specs[1] recently gained the support for seccomp notifications. [1] opencontainers/runtime-spec#1074 Signed-off-by: Giuseppe Scrivano <[email protected]>
The OCI runtime specs[1] recently gained the support for seccomp notifications. [1] opencontainers/runtime-spec#1074 Signed-off-by: Giuseppe Scrivano <[email protected]>
The runtime-spec changes were already merged in this PR: opencontainers/runtime-spec#1074 To reference the new commit merge commit with the latest fixes I just run: go get 'github.com/opencontainers/runtime-spec@9c848d91e8cf872e7453296832d66de6325e1e25 go mod vendor And commited the changes. I also update the code to match the new spec. While doing that, I rewrote the code to close fds in _most_ cases. It is tricky to close them before we have a reference to the `fds` slice, so that is left as a follow-up improvement. Signed-off-by: Rodrigo Campos <[email protected]>
The runtime-spec changes were already merged in this PR: opencontainers/runtime-spec#1074 To reference the new merge commit with the latest fixes I just run: go get 'github.com/opencontainers/runtime-spec@9c848d91e8cf872e7453296832d66de6325e1e25 go mod vendor And commited the changes. I also update the code to match the new spec. While doing that, I rewrote the code to close fds in _most_ cases. It is tricky to close them before we have a reference to the `fds` slice, so that is left as a follow-up improvement. Signed-off-by: Rodrigo Campos <[email protected]>
The runtime-spec changes were already merged in this PR: opencontainers/runtime-spec#1074 To reference the new merge commit with the latest fixes I just run: go get 'github.com/opencontainers/runtime-spec@9c848d91e8cf872e7453296832d66de6325e1e25 go mod vendor And commited the changes. I also update the code to match the new spec. While doing that, I rewrote the code to close fds in _most_ cases. It is tricky to close them before we have a reference to the `fds` slice, so that is left as a follow-up improvement. Signed-off-by: Rodrigo Campos <[email protected]>
The runtime-spec changes were already merged in this PR: opencontainers/runtime-spec#1074 To reference the new merge commit with the latest fixes I just run: go get 'github.com/opencontainers/runtime-spec@9c848d91e8cf872e7453296832d66de6325e1e25 go mod vendor And commited the changes. I also update the code to match the new spec. While doing that, I rewrote the code to close fds in _most_ cases. It is tricky to close them before we have a reference to the `fds` slice, so that is left as a follow-up improvement. Signed-off-by: Rodrigo Campos <[email protected]>
The runtime-spec changes were already merged in this PR: opencontainers/runtime-spec#1074 To reference the new merge commit with the latest fixes I just run: go get 'github.com/opencontainers/runtime-spec@9c848d91e8cf872e7453296832d66de6325e1e25 go mod vendor And commited the changes. I also update the code to match the new spec. While doing that, I rewrote the code to close fds in _most_ cases. It is tricky to close them before we have a reference to the `fds` slice, so that is left as a follow-up improvement. Signed-off-by: Rodrigo Campos <[email protected]>
Hi!
This is a follow-up of PR #1073 using the alternative discussed here. The reasoning to go this way over the others is here
I’m a co-worker of @alban and the commit has his sign-off too :)
This adds the specification for Seccomp Userspace Notification and the
Golang bindings. This contains:
notification.
descriptors passed for seccomp.
This was discussed in the OCI Weekly Discussion on September 16th,
2020. After review on github, this implementation was changed to the
"Proposal with listenerPath and listenerExtraMetadata". For more
information see:
Docs presented on the community meeting (for the old implementation
using hooks):
Documentation for this feature:
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
https://brauner.github.io/2020/07/23/seccomp-notify.html
This PR is an alternative proposal to PR #1038 . While similar in nature,
the main difference is that this PR adds optional metadata to be sent to
the seccomp agent and specifies how the UNIX socket MUST be used.
Signed-off-by: Alban Crequy [email protected]
Signed-off-by: Rodrigo Campos [email protected]
We will make a PR for runc implementing this shortly.
Please also note that this change doesn’t imply a fork+exec, as @giuseppe wanted to avoid
Thanks again and don't hesitate to ask any questions! (I'll answer the ones in Alban's PR shortly :))
cc @KentaTada @ManaSugi @giuseppe @AkihiroSuda @brauner @alban @mauriciovasquezbernal