This repository has been archived by the owner on May 12, 2021. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
virtcontainers: Properly remove the container when shim gets killed
Here is an interesting case I have been debugging. I was trying to understand why a "kubeadm reset" was not working for kata-runtime compared to runc. In this case, the only pod started with Kata is the kube-dns pod. For some reasons, when this pod is stopped and removed, its containers receive some signals, 2 of them being SIGTERM signals, which seems the way to properly stop them, but the third container receives a SIGCONT. Obviously, nothing happens in this case, but apparently CRI-O considers this should be the end of the container and after a few seconds, it kills the container process (being the shim in Kata case). Because it is using a SIGKILL, the signal does not get forwarded to the agent because the shim itself is killed right away. After this happened, CRI-O calls into "kata-runtime state", we detect the shim is not running anymore and we try to stop the container. The code will eventually call into agent.RemoveContainer(), but this will fail and return an error because inside the agent, the container is still running. The approach to solve this issue here is to send a SIGKILL signal to the container after the shim has been waited for. This call does not check for the error returned because most of the cases, regular use cases, will end up returning an error because the shim itself not being there actually represents the container inside the VM has already terminated. And in case the shim has been killed without the possibility to forward the signal (like described in first paragraph), the SIGKILL will work and will allow the following call to agent.stopContainer() to proceed to the removal of the container inside the agent. Fixes #274 Signed-off-by: Sebastien Boeuf <[email protected]>
- Loading branch information