-
Notifications
You must be signed in to change notification settings - Fork 195
Failures on libpod tests | podman stop - basic test #2503
Comments
The issue seems to be caused by https://github.com/kata-containers/agent/blob/master/grpc.go#L941, where we can see:
This code has been introduced in order to fix kata-containers/agent#525, but seems to be the reason of the libpod test not passing. |
This one is quite interesting, I must admit. At first I was afraid, I was petrified, that the conmon process wouldn't have a SIGTERM handler registered and that was the reason a SIGKILL would be sent to the process. Turns out I was totally wrong was the code above happens inside guest. Now, focusing on whatś going on inside the guest ... I don't see anywhere in the agent a place where we'd register a SIGTERM handler for any process. More than that, "grep'ing" for SIGTERM on the agent code shows that the only mention is in the piece of code showed above. With this in mind, I start to think that, at the first place, we shouldn't have used the hammer to send a SIGKILL to process, but instead actually ensure we register a SIGTERM handler. Right now, doesn't matter what we do, we'll end up in the SIGKILL case. @lifupan, is my understanding of the problem correct? If not, would you mind to bring some light here? @devimc, @bergwolf, as you reviewed kata-containers/agent#526, your input is also appreciated! |
Hi @fidencio It's depends on the container's init process. Generally speaking , the container's init process should handle the SIGTERM signal, thus the "container stop" can terminate the container init process gracefully, otherwise the "stop" command would send another SIGKILL signal to kill the process after a time awhile such as 10s. If we do know that the SIGTERM wouldn't terminate the container process, then why do we send SIGTERM first and then wait 10s and send SIGKILL later? Why cannot we send SIGKILL first to terminate the process at the first step? That's why I introduced the codes https://github.com/kata-containers/agent/blob/master/grpc.go#L941 in kata. |
Sorry, I have to ask you for more info here in order to properly understand the situation. I'd like to understand how it differs from, for instance, runc and kata.
Yes, that's the expected behaviour. But spawning exactly the same pod using kata and runc I see different behaviours then, as when using kata SIGTERM is never ever handled by the container's init process, making us always take the SIGKILL path.
Your approach makes sense. If SIGTERM is not handled, just SIGKILL, that's okay. What I'm trying to understand is if there's any difference between the init process when using kata and runc. I'm not even sure this question makes sense so, please, bear with me here. All in all, I just want to understand why some tests shoud be skipped and if there's something we do differently, I'd like to know why. It'll help the future us to understand the different limitations / approaches taken. |
The code starting the proccess located in libcontainer, right? The process here I mean the first process in the container, such as the container 's "cmd" or "entrypoint" executed in the container namespace. When we do stop a container, what does the runtime exactly do? it'll try to send a SIGTERM signal to container's init process, right?
The difference between runc and kata is that, if the container's init process doesn't handle the SIGTERM signal, then runc needs to send SIGTERM and later resend another SIGKILL signal to stop the container successfully, but for kata, it only needs to send a SIGTERM can stop the container successfully, since kata would transform the SIGTERM signal to SIGKILL automatically.
Here kata replaces SIGTERM with SIGKILL signal cause it knows that the SIGTERM cannot stop container successfully, if the init process take care the SIGTERM, then it wouldn't to the transform.
There is no difference for container init process between runc and kata. The difference is the way to stop the container when the container's init process didn't take care the SIGTERM signal from the user's perspective.
|
Okay, so this one we also should just skip the test on libpod for the reasons mentioned / discussed above. @lifupan, thanks for the explanation! |
This is one of the issues raised due to the failures running libpod "system tests" using kata as runtime.
This specific test is part of https://github.com/containers/libpod/blob/master/test/system/050-stop.bats and fails due to:
I understand the test is not part of the list of tests run (as in https://github.com/kata-containers/tests/blob/master/.ci/podman/configuration_podman.yaml), but there's also no mention about the issue in the limitations document.
Note: Maybe this is not the best place for this issue, but I'd like to have it opened somwhere in order to either have documented what's the reason of the failure (if it's expected), or to have it investigated (in case it's not expected).
The text was updated successfully, but these errors were encountered: