Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

docker rm sometimes hangs up #515

Closed
grahamwhaley opened this issue Jul 25, 2018 · 3 comments
Closed

docker rm sometimes hangs up #515

grahamwhaley opened this issue Jul 25, 2018 · 3 comments

Comments

@grahamwhaley
Copy link
Contributor

Description of problem

Whilst running the density test under the metrics CI VMs, the docker stop in that test started to hang up regularly (maybe 30% of the time in the CI runs). Jenkins then times out and fails the job after 10 minutes of inactivity.

This feels like a new type of hang up - and not the yamux related hangups, as the test has a workaround for that that has been 'holding' for a couple of weeks now. These hangups started in the last 2-3 days. Sorry if I've not spotted them before, but the similar yamux type issues may have been masking them from my view.

Looking at the metrics CI logs (for those with access at present: - https://clearlinux.org/cc-ci/computer/vm_ubuntu_16_04/builds), it feels like the problem started with the merge of the VM factory code, but as we have a number of kata components all changing in parallel, sometimes it is hard to be accurate.

I've tried to bisect down to check that, and it is feeling maybe it is between
failing - 8dda2dd: virtcontainers: add a vm abstraction layer
and
working - 28b6104: qemu: prepare for vm templating support

BUT - reading that last commit, it looks benign, so I am not yet convinced that is the precise problem.

The hang is reasonably easy to re-create, but does not happen every time (which is why I'm not 100% which commit is causing the problem). I am using this to run a test:

$ bash ./density/docker_memory_usage.sh 30 5

I only set the number of containers to 30 as I am testing in a VM from the metrics VM scripts, and it is limited on memory. If you use more containers you are likely to see the issue quicker.

I also have a tiny patch in place just so I can get some extra info (it is not necessary to add this - it is just quicker (as today I also discovered docker stop cannot kill the busybox sh, so then times out and does a kill anyway - and thus takes ~10.5s to do the stop/kill - ask me if you want more details on that - it is a docker signal/pid1/sh thing...)):

diff --git a/metrics/density/docker_memory_usage.sh b/metrics/density/docker_memory_usage.sh
index 365fd7a..89ad466 100644
--- a/metrics/density/docker_memory_usage.sh
+++ b/metrics/density/docker_memory_usage.sh
@@ -282,7 +282,7 @@ EOF
        # Or, in fact we can leave the '--rm' on the docker run, and just change this
        # to a 'docker stop *'.
        for c in ${containers[@]}; do
-               docker stop $c
+               /usr/bin/time -f "%E" docker kill $c
                sleep 3
        done
 }

@bergwolf - copying you in here as the author of the VM template series of patches that I think may hold the clues. My gut is telling me there might be a lock/race maybe on the new vm filesystem tree, as when we've had such issues in the past this is how the bugs felt - but, that is just my gut feeling.

I will continue to test this tomorrow. If I can't narrow down the exact patch then I'll move to diagnosing a hung up case to see what is hung where.

If you need any more input/info from me then just note here.
If you have any insights or thoughts, also drop them here please.

Expected result

I expect the docker stop or docker rm to never hang... and the metrics CI to get stable again :-)

@grahamwhaley
Copy link
Contributor Author

After some more digging.
Looks like we are stuck in a runtime kill 9 command, holding our lock, trying to talk gRPC to the agent - so, a very similar situation to #406 . My current theory is that the recent PR merges may not be 'at fault' here, but might have shifted some timing that makes the issue more prominent.

I tried running a single container cycle of create/kill - that ran for 1000 instances, no problem.
I then ran a single cycle of 32creates/single kills - and that hung up on the first round. That is odd. It would seem somehow that the number of running containers affects the bug, even though we are only doing a single 'rm' at a time. This also smells to me like maybe having more containers alive slows something down (iterating a list or filesystem, just plain slowing down all the QEMUs ?), which means we hit a window.

Just as a reminder, over at #406 (comment) I posted a patch snippet that adds a timeout to the gPRC calls - so at least we would not hang solid holding the lock, and thus basically kill the runtime. But, to land that I think we'd need to figure out how we clean up after such a timeout.

I'm starting to run out of ideas. I'm going to check if the QEMU matching the container is still alive when I get this hangup (and if so, maybe I'll enable the console so I can try to peek into a dead one). I'll do some more comparing of a working kill against a stuck one.

@grahamwhaley
Copy link
Contributor Author

Oh, and I found no yamux mention in the system journal ;-)

@grahamwhaley
Copy link
Contributor Author

Whilst pursuing this (enabling the VM console so I could go peek at if the agent was still alive and use the magic kill command to peek at its stacks if it were....), the metrics CI suddenly got stable again - about 2d19h ago.
@sboeuf suggested, and looking at the logs it aligns well, that kata-containers/agent#263 and related yamux merges are the likely fix.

With that, I will abandon this Issue, and try a run of the soak test over on #406 to check the yamux/timeout and docker rm status.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant