-
Notifications
You must be signed in to change notification settings - Fork 373
Kubernetes Pods stuck in ContainerCreating when using CRI-O #388
Comments
More journal logs:
|
Hi @jamiehannaford - thanks for reporting. The gist shows that your system is misconfigured:
This implies you have built and installed the runtime manually, but did not disable either the |
Related: #332. |
@jodh-intel Thanks for responding so quickly. I actually noticing that just after posting, but we're still seeing Pods stuck. Here's a more recent output: https://gist.github.com/jamiehannaford/dd304f3884d3b2fefb50761a15045af0 |
Please can you enable full debug and re-run: |
Clearly qemu is unhappy:
That coupled with the bios error suggests either a corrupt file (or FS/HDD). I'd re-install the
|
@jodh-intel I reinstalled qemu-lite and enabled debug mode, it's still getting |
Some more info:
|
@jamiehannaford could you reload the driver: # modprobe -r kvm_intel
# modprobe kvm_intel nested=1 and re-run |
And, just to throw some ideas out there and garner a few more details @jamiehannaford
Just trying to narrow down things that might affect just that node (it sounds like you are always failing on the same node, yes?) |
Before I try other things, I've just noticed something interesting. When I remove resource requests from the Pod manifest the container is created. So this works:
But this doesn't work:
I just got a new error for the above manifest:
|
@sboeuf It doesn't seem to make a difference:
@grahamwhaley Nope, it's just a standard GCE VM with nested virt enabled. More details:
|
@jamiehannaford did this |
@sboeuf I killed all the kata containers on that box, ran
I'm interested to know if resource requests work -- either in general and specifically in GCE. FWIW, Pods without resource requests seem to be fine. |
@jamiehannaford on the top of my head, I would say that if you request a certain amount of resources (20MiB in your case from the pod describe you've posted), I think this will ask Kata to run the VM with not enough memory for the kernel+agent to properly boot. Can you check the qemu command line issued when you try to start a Kata Containers pod with specific resources. |
Sure! How do I do that? |
@sboeuf Oh that's interesting, when I bumped to
it worked! So your hypothesis is probably correct. What is the minimum CPU/RAM requirements for the VM? We'd like to lock this down. |
You can enable full logs by adding |
Oh, yeah, my brain didn't see that as 20Mb... yeah, that is too small to run the minimal kata requirements I expect - previously we have run small containers in 64Mb. I've not tried to see how small a memory allocation we need for a while though (so, if you happened to be doing a binary/bisect on that @jamiehannaford - then please fee free to post your results ;-) ) |
Ok cool, that's a good news, we definitely need to add some padding for the VM... |
I'll try to figure it out through trial and error. Is there any way we can improve the error message on that? It seemed like a pretty random and nondeterministic error |
Feel free to open an issue for that ;) |
@jamiehannaford yes, definitely we should look at improving that - perhaps best to open a new Issue with a more accurate title now we know the problem, and reference this Issue in it (^^ me sees @sboeuf echo that ;-) ) |
@jon has raised a PR for a GCE install guide so might be worth looking at kata-containers/documentation#154. |
So I'm wondering how we can improve the communication between kubelet <-> cri-o <-> kata-runtime here so sandboxes aren't forever stuck in If we trace the workflow:
|
Hi @jamiehannaford , nice and accurate analysis here. I'm not sure either why Qemu is killed, but regarding the last point |
I've run Kata on GCE with Kubernetes, but not with CRI-O -- I used containerd for that experiment. From a qemu/kvm-intel perspective I didn't run into any issues at all (and thus didn't have to do any debugging :). My experiments were with Ubuntu 18.04 -- I have not tried 16.04. If the agent is failing at network setup, I wonder if there are some subtle differences between distributions (or Calico versions?). In my experiment I ran Kubernetes 1.10 with containerd 1.1.0 and Calico 3.0. I bootstrapped the cluster using kubeadm (with the package from the Kubernetes Xenial repo). Let me run through that experiment again to verify that it still works as advertised, and then we can maybe tease apart what the differences are. |
Small updated to my comment from yesterday -- I tried the single-node cluster bits and they no longer produce a functioning Kata cluster. The runtime installer DaemonSet tries to do its thing, but never succeeds, so unfortunately it seems there's currently no one-stop way to do this :( I have not had time to diagnose whether the failure is Kata related, kubeadm related, or containerd related. |
Closing this issue since it seems to be resolved, and was an related to not passing enough memory to the Kata VM for the kernel+agent to properly boot |
When memory is hot added the udevpath is: uevent-devpath=/devices/system/memory/memory81 But when the agent checks for the path it expects it has to check in the /sys filesystem. Fixes: kata-containers#388 Signed-off-by: Jose Carlos Venegas Munoz <[email protected]>
config: there is no need to check vhost-vosck for FC
Description of problem
We're running Kata and CRI-O within GCE VMs that have nested virt enabled. Whilst it mostly works, we've been seeing consistent failures with certain pods:
When we dig into the kubelet logs on that node, we see the following:
Earlier, we also saw the following errors:
We then tried to manually set the
firmware
field to/usr/share/qemu-lite/qemu
, but started seeingcontainer create failed: qemu: could not load PC BIOS '/usr/share/qemu-lite/qemu'
. So we reverted.What types of problems might these errors indicate? Perhaps a problem with nested virt on GCE, or is there a common misconfiguration somewhere? Any help you might have would be super appreciated!
Output of
kata-collect-data.sh
: https://gist.github.com/jamiehannaford/f3024f6e45ce11c00312636945f397a9The text was updated successfully, but these errors were encountered: