-
Notifications
You must be signed in to change notification settings - Fork 373
docker rm sometimes hangs up #515
Comments
After some more digging. I tried running a single container cycle of create/kill - that ran for 1000 instances, no problem. Just as a reminder, over at #406 (comment) I posted a patch snippet that adds a timeout to the gPRC calls - so at least we would not hang solid holding the lock, and thus basically kill the runtime. But, to land that I think we'd need to figure out how we clean up after such a timeout. I'm starting to run out of ideas. I'm going to check if the QEMU matching the container is still alive when I get this hangup (and if so, maybe I'll enable the console so I can try to peek into a dead one). I'll do some more comparing of a working |
Oh, and I found no yamux mention in the system journal ;-) |
Whilst pursuing this (enabling the VM console so I could go peek at if the agent was still alive and use the magic With that, I will abandon this Issue, and try a run of the soak test over on #406 to check the yamux/timeout and |
Description of problem
Whilst running the density test under the metrics CI VMs, the
docker stop
in that test started to hang up regularly (maybe 30% of the time in the CI runs). Jenkins then times out and fails the job after 10 minutes of inactivity.This feels like a new type of hang up - and not the yamux related hangups, as the test has a workaround for that that has been 'holding' for a couple of weeks now. These hangups started in the last 2-3 days. Sorry if I've not spotted them before, but the similar yamux type issues may have been masking them from my view.
Looking at the metrics CI logs (for those with access at present: - https://clearlinux.org/cc-ci/computer/vm_ubuntu_16_04/builds), it feels like the problem started with the merge of the VM factory code, but as we have a number of kata components all changing in parallel, sometimes it is hard to be accurate.
I've tried to bisect down to check that, and it is feeling maybe it is between
failing - 8dda2dd: virtcontainers: add a vm abstraction layer
and
working - 28b6104: qemu: prepare for vm templating support
BUT - reading that last commit, it looks benign, so I am not yet convinced that is the precise problem.
The hang is reasonably easy to re-create, but does not happen every time (which is why I'm not 100% which commit is causing the problem). I am using this to run a test:
I only set the number of containers to 30 as I am testing in a VM from the metrics VM scripts, and it is limited on memory. If you use more containers you are likely to see the issue quicker.
I also have a tiny patch in place just so I can get some extra info (it is not necessary to add this - it is just quicker (as today I also discovered
docker stop
cannot kill thebusybox sh
, so then times out and does a kill anyway - and thus takes ~10.5s to do the stop/kill - ask me if you want more details on that - it is a docker signal/pid1/sh thing...)):@bergwolf - copying you in here as the author of the VM template series of patches that I think may hold the clues. My gut is telling me there might be a lock/race maybe on the new
vm
filesystem tree, as when we've had such issues in the past this is how the bugs felt - but, that is just my gut feeling.I will continue to test this tomorrow. If I can't narrow down the exact patch then I'll move to diagnosing a hung up case to see what is hung where.
If you need any more input/info from me then just note here.
If you have any insights or thoughts, also drop them here please.
Expected result
I expect the
docker stop
ordocker rm
to never hang... and the metrics CI to get stable again :-)The text was updated successfully, but these errors were encountered: