Failures on libpod tests | podman stop - basic test #2503

fidencio · 2020-05-04T19:47:42Z

This is one of the issues raised due to the failures running libpod "system tests" using kata as runtime.

This specific test is part of https://github.com/containers/libpod/blob/master/test/system/050-stop.bats and fails due to:

 ✗ podman stop - basic test
   (from function `die' in file /usr/share/podman/test/system/helpers.bash, line 274,
    in test file /usr/share/podman/test/system/050-stop.bats, line 23)
     `die "podman stop: ran too quickly! ($delta_t seconds; expected >= 10)"' failed
   $ /usr/bin/podman --runtime=/usr/bin/kata-runtime rm --all --force
   $ /usr/bin/podman --runtime=/usr/bin/kata-runtime images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
   quay.io/libpod/alpine_labels:latest   4fab981df737
   $ /usr/bin/podman --runtime=/usr/bin/kata-runtime run -d quay.io/libpod/alpine_labels:latest sleep 60
   261e842b4d64759ae915d5aeafd3413f442469ad7d2be9b62330ff822b2a34fb
   $ /usr/bin/podman --runtime=/usr/bin/kata-runtime stop 261e842b4d64759ae915d5aeafd3413f442469ad7d2be9b62330ff822b2a34fb
   261e842b4d64759ae915d5aeafd3413f442469ad7d2be9b62330ff822b2a34fb
   $ /usr/bin/podman --runtime=/usr/bin/kata-runtime inspect --format {{.State.Status}} {{.State.ExitCode}} 261e842b4d64759ae915d5aeafd3413f442469ad7d2be9b62330ff822b2a34fb
   exited   137
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #| FAIL: podman stop: ran too quickly! (0 seconds; expected >= 10)
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   # [teardown]
   $ /usr/bin/podman --runtime=/usr/bin/kata-runtime pod rm --all --force
   $ /usr/bin/podman --runtime=/usr/bin/kata-runtime rm --all --force
   261e842b4d64759ae915d5aeafd3413f442469ad7d2be9b62330ff822b2a34fb

I understand the test is not part of the list of tests run (as in https://github.com/kata-containers/tests/blob/master/.ci/podman/configuration_podman.yaml), but there's also no mention about the issue in the limitations document.

Note: Maybe this is not the best place for this issue, but I'd like to have it opened somwhere in order to either have documented what's the reason of the failure (if it's expected), or to have it investigated (in case it's not expected).

The text was updated successfully, but these errors were encountered:

fidencio · 2020-05-04T19:51:35Z

The issue seems to be caused by https://github.com/kata-containers/agent/blob/master/grpc.go#L941, where we can see:

		// For container initProcess, if it hasn't installed handler for "SIGTERM" signal,
		// it will ignore the "SIGTERM" signal sent to it, thus send it "SIGKILL" signal
		// instead of "SIGTERM" to terminate it.
		if signal == syscall.SIGTERM && !isSignalHandled(pid, syscall.SIGTERM) {
			signal = syscall.SIGKILL
		}
		return emptyResp, ctr.container.Signal(signal, false)

This code has been introduced in order to fix kata-containers/agent#525, but seems to be the reason of the libpod test not passing.

fidencio · 2020-05-05T09:52:05Z

This one is quite interesting, I must admit.

At first I was afraid, I was petrified, that the conmon process wouldn't have a SIGTERM handler registered and that was the reason a SIGKILL would be sent to the process.

Turns out I was totally wrong was the code above happens inside guest. Now, focusing on whatś going on inside the guest ... I don't see anywhere in the agent a place where we'd register a SIGTERM handler for any process. More than that, "grep'ing" for SIGTERM on the agent code shows that the only mention is in the piece of code showed above.

With this in mind, I start to think that, at the first place, we shouldn't have used the hammer to send a SIGKILL to process, but instead actually ensure we register a SIGTERM handler. Right now, doesn't matter what we do, we'll end up in the SIGKILL case.

@lifupan, is my understanding of the problem correct? If not, would you mind to bring some light here?

@devimc, @bergwolf, as you reviewed kata-containers/agent#526, your input is also appreciated!

lifupan · 2020-05-06T03:17:11Z

This one is quite interesting, I must admit.

At first I was afraid, I was petrified, that the conmon process wouldn't have a SIGTERM handler registered and that was the reason a SIGKILL would be sent to the process.

Turns out I was totally wrong was the code above happens inside guest. Now, focusing on whatś going on inside the guest ... I don't see anywhere in the agent a place where we'd register a SIGTERM handler for any process. More than that, "grep'ing" for SIGTERM on the agent code shows that the only mention is in the piece of code showed above.

With this in mind, I start to think that, at the first place, we shouldn't have used the hammer to send a SIGKILL to process, but instead actually ensure we register a SIGTERM handler. Right now, doesn't matter what we do, we'll end up in the SIGKILL case.

@lifupan, is my understanding of the problem correct? If not, would you mind to bring some light here?

Hi @fidencio

It's depends on the container's init process. Generally speaking , the container's init process should handle the SIGTERM signal, thus the "container stop" can terminate the container init process gracefully, otherwise the "stop" command would send another SIGKILL signal to kill the process after a time awhile such as 10s. If we do know that the SIGTERM wouldn't terminate the container process, then why do we send SIGTERM first and then wait 10s and send SIGKILL later? Why cannot we send SIGKILL first to terminate the process at the first step? That's why I introduced the codes https://github.com/kata-containers/agent/blob/master/grpc.go#L941 in kata.

fidencio · 2020-05-06T11:43:46Z

@lifupan

It's depends on the container's init process.

Sorry, I have to ask you for more info here in order to properly understand the situation.
What's the piece of code starting the processs? What's the process we're talking about, specifically?

I'd like to understand how it differs from, for instance, runc and kata.

Generally speaking , the container's init process should handle the SIGTERM signal, thus the "container stop" can terminate the container init process gracefully, otherwise the "stop" command would send another SIGKILL signal to kill the process after a time awhile such as 10s.

Yes, that's the expected behaviour. But spawning exactly the same pod using kata and runc I see different behaviours then, as when using kata SIGTERM is never ever handled by the container's init process, making us always take the SIGKILL path.

. If we do know that the SIGTERM wouldn't terminate the container process, then why do we send SIGTERM first and then wait 10s and send SIGKILL later? Why cannot we send SIGKILL first to terminate the process at the first step? That's why I introduced the codes https://github.com/kata-containers/agent/blob/master/grpc.go#L941 in kata.

Your approach makes sense. If SIGTERM is not handled, just SIGKILL, that's okay.

What I'm trying to understand is if there's any difference between the init process when using kata and runc. I'm not even sure this question makes sense so, please, bear with me here.

All in all, I just want to understand why some tests shoud be skipped and if there's something we do differently, I'd like to know why. It'll help the future us to understand the different limitations / approaches taken.

lifupan · 2020-05-07T01:19:02Z

@lifupan

It's depends on the container's init process.

Sorry, I have to ask you for more info here in order to properly understand the situation.
What's the piece of code starting the processs? What's the process we're talking about, specifically?

The code starting the proccess located in libcontainer, right? The process here I mean the first process in the container, such as the container 's "cmd" or "entrypoint" executed in the container namespace.

When we do stop a container, what does the runtime exactly do? it'll try to send a SIGTERM signal to container's init process, right?

I'd like to understand how it differs from, for instance, runc and kata.

The difference between runc and kata is that, if the container's init process doesn't handle the SIGTERM signal, then runc needs to send SIGTERM and later resend another SIGKILL signal to stop the container successfully, but for kata, it only needs to send a SIGTERM can stop the container successfully, since kata would transform the SIGTERM signal to SIGKILL automatically.

Generally speaking , the container's init process should handle the SIGTERM signal, thus the "container stop" can terminate the container init process gracefully, otherwise the "stop" command would send another SIGKILL signal to kill the process after a time awhile such as 10s.

Yes, that's the expected behaviour. But spawning exactly the same pod using kata and runc I see different behaviours then, as when using kata SIGTERM is never ever handled by the container's init process, making us always take the SIGKILL path.

Here kata replaces SIGTERM with SIGKILL signal cause it knows that the SIGTERM cannot stop container successfully, if the init process take care the SIGTERM, then it wouldn't to the transform.

. If we do know that the SIGTERM wouldn't terminate the container process, then why do we send SIGTERM first and then wait 10s and send SIGKILL later? Why cannot we send SIGKILL first to terminate the process at the first step? That's why I introduced the codes https://github.com/kata-containers/agent/blob/master/grpc.go#L941 in kata.

Your approach makes sense. If SIGTERM is not handled, just SIGKILL, that's okay.

What I'm trying to understand is if there's any difference between the init process when using kata and runc. I'm not even sure this question makes sense so, please, bear with me here.

There is no difference for container init process between runc and kata. The difference is the way to stop the container when the container's init process didn't take care the SIGTERM signal from the user's perspective.

All in all, I just want to understand why some tests shoud be skipped and if there's something we do differently, I'd like to know why. It'll help the future us to understand the different limitations / approaches taken.

fidencio · 2020-05-07T07:59:18Z

Okay, so this one we also should just skip the test on libpod for the reasons mentioned / discussed above.

@lifupan, thanks for the explanation!

fidencio added enhancement Improvement to an existing feature needs-review Needs to be assessed by the team. labels May 4, 2020

fidencio closed this as completed May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failures on libpod tests | podman stop - basic test #2503

Failures on libpod tests | podman stop - basic test #2503

fidencio commented May 4, 2020

fidencio commented May 4, 2020

fidencio commented May 5, 2020

lifupan commented May 6, 2020

fidencio commented May 6, 2020

lifupan commented May 7, 2020

fidencio commented May 7, 2020

Failures on libpod tests | podman stop - basic test #2503

Failures on libpod tests | podman stop - basic test #2503

Comments

fidencio commented May 4, 2020

fidencio commented May 4, 2020

fidencio commented May 5, 2020

lifupan commented May 6, 2020

fidencio commented May 6, 2020

lifupan commented May 7, 2020

fidencio commented May 7, 2020