-
Notifications
You must be signed in to change notification settings - Fork 373
clh: Potential regression test failure VmInfoGet failed
#2864
Comments
It is a potential regression test failure observed as `VmInfoGet failed` after block device hotplug and memory hotplug. Fixes: kata-containers#2864 Signed-off-by: Bo Chen <[email protected]>
Interesting/unexpected experiments results from running clh-docker CI jobs on different PRs: |
@likebreath thanks for keeping track of these issues! |
I still not have a cloud-hypervisor only case, but I now have created a script with to docker to isolate a bit more. #!/bin/bash
set -x
set -e
loops=9
loop_list=()
create_loop_devices() {
for i in $(seq 1 ${loops}); do
loop_name="loop${i}"
dd if=/dev/zero of=/tmp/${loop_name} count=1 bs=50M
bash -c printf "g\nn\n\n\n\nw\n" | sudo fdisk /tmp/${loop_name}
loop_path=$(sudo losetup -fP --show "/tmp/${loop_name}")
loop_list+=($loop_path)
sudo losetup -j "/tmp/${loop_name}"
done
}
delete_loop_devices() {
for p in "${loop_list[@]}"; do
sudo losetup -d "${p}"
done
}
create_loop_devices
trap delete_loop_devices EXIT
docker_cmd="docker"
docker_cmd+=" run"
docker_cmd+=" --runtime kata-runtime"
docker_cmd+=" --rm"
for p in "${loop_list[@]}"; do
docker_cmd+=" --device ${p}"
done
docker_cmd+=" busybox find /dev -name 'loop*'"
while
eval "${docker_cmd}"
do
echo ok
done
exit
|
Usually using 9 loop devices is easier to reproduce. |
It sounds like this is a regression in functionality? Can we attempt to bisect? Is this observable on stable? |
@egernst We are working on a reproducer of the problem for the master upstream. Would you please share how you encountered the failure on your setup (w/ memory hotplug)? |
On a Kube system w/ Kata runtime classes already installed, and kata-deploy already run, update binaries for latest master release:
Then, just run a pod:
|
@likebreath @egernst Have we tried against the stable 1.11.2 release to check id we see the issue? |
@amshinde I am unable to reproduce with 1.11.2 |
A quick update. @jcvenegas and I located the root cause of the @jcvenegas has submitted a patch to CLH to fix the issue (cloud-hypervisor/cloud-hypervisor#1548), and the patch will be included in the clh v0.9.0 (coming out tomorrow). Hopefully, the patch will be enough to cover system calls required for workloads of using clh+kata (w/ virtiofsd 5.0). Note that, similar silent failures can be triggered with different workloads (that can trigger new syscalls), which can be the reason of random/sporadic failures we see from our kata CI. |
Could you document how to quickly check and identify this kind of issue? The goal is to avoid wasting too much time next time we might run into this issue and quickly identify the missing syscall. |
Highlights for cloud-hypervisor version 0.9.0 include: virtiofs updates to new dax implementation based in qemu 5.0 Fixed random issues caused due to seccomp filters io_uring Based Block Device Support If the io_uring feature is enabled and the host kernel supports it then io_uring will be used for block devices. This results a very significant performance improvement. Block and Network Device Statistics Statistics for activity of the virtio network and block devices is now exposed through a new vm.counters HTTP API entry point. These take the form of simple counters which can be used to observe the activity of the VM. HTTP API Responses The HTTP API for adding devices now responds with the name that was assigned to the device as well the PCI BDF. CPU Topology A topology parameter has been added to --cpus which allows the configuration of the guest CPU topology allowing the user to specify the numbers of sockets, packages per socket, cores per package and threads per core. Release Build Optimization Our release build is now built with LTO (Link Time Optimization) which results in a ~20% reduction in the binary size. Hypervisor Abstraction A new abstraction has been introduced, in the form of a hypervisor crate so as to enable the support of additional hypervisors beyond KVM. Snapshot/Restore Improvements Multiple improvements have been made to the VM snapshot/restore support that was added in the last release. This includes persisting more vCPU state and in particular preserving the guest paravirtualized clock in order to avoid vCPU hangs inside the guest when running with multiple vCPUs. Virtio Memory Ballooning Support A virtio-balloon device has been added, controlled through the resize control, which allows the reclamation of host memory by resizing a memory balloon inside the guest. Enhancements to ARM64 Support The ARM64 support introduced in the last release has been further enhanced with support for using PCI for exposing devices into the guest as well as multiple bug fixes. It also now supports using an initramfs when booting. Intel SGX Support The guest can now use Intel SGX if the host supports it. Details can be found in the dedicated SGX documentation. Seccomp Sandbox Improvements The most frequently used virtio devices are now isolated with their own seccomp filters. It is also now possible to pass --seccomp=log which result in the logging of requests that would have otherwise been denied to further aid development. Notable Bug Fixes Our virtio-vsock implementation has been resynced with the implementation from Firecracker and includes multiple bug fixes. CPU hotplug has been fixed so that it is now possible to add, remove, and re-add vCPUs (kata-containers#1338) A workaround is now in place for when KVM reports MSRs available MSRs that are in fact unreadable preventing snapshot/restore from working correctly (kata-containers#1543). virtio-mmio based devices are now more widely tested (kata-containers#275). Multiple issues have been fixed with virtio device configuration (kata-containers#1217) Console input was wrongly consumed by both virtio-console and the serial. (kata-containers#1521) Fixes: kata-containers#2864 Signed-off-by: Jose Carlos Venegas Munoz <[email protected]>
Highlights for cloud-hypervisor version 0.9.0 include: virtiofs updates to new dax implementation based in qemu 5.0 Fixed random issues caused due to seccomp filters io_uring Based Block Device Support If the io_uring feature is enabled and the host kernel supports it then io_uring will be used for block devices. This results a very significant performance improvement. Block and Network Device Statistics Statistics for activity of the virtio network and block devices is now exposed through a new vm.counters HTTP API entry point. These take the form of simple counters which can be used to observe the activity of the VM. HTTP API Responses The HTTP API for adding devices now responds with the name that was assigned to the device as well the PCI BDF. CPU Topology A topology parameter has been added to --cpus which allows the configuration of the guest CPU topology allowing the user to specify the numbers of sockets, packages per socket, cores per package and threads per core. Release Build Optimization Our release build is now built with LTO (Link Time Optimization) which results in a ~20% reduction in the binary size. Hypervisor Abstraction A new abstraction has been introduced, in the form of a hypervisor crate so as to enable the support of additional hypervisors beyond KVM. Snapshot/Restore Improvements Multiple improvements have been made to the VM snapshot/restore support that was added in the last release. This includes persisting more vCPU state and in particular preserving the guest paravirtualized clock in order to avoid vCPU hangs inside the guest when running with multiple vCPUs. Virtio Memory Ballooning Support A virtio-balloon device has been added, controlled through the resize control, which allows the reclamation of host memory by resizing a memory balloon inside the guest. Enhancements to ARM64 Support The ARM64 support introduced in the last release has been further enhanced with support for using PCI for exposing devices into the guest as well as multiple bug fixes. It also now supports using an initramfs when booting. Intel SGX Support The guest can now use Intel SGX if the host supports it. Details can be found in the dedicated SGX documentation. Seccomp Sandbox Improvements The most frequently used virtio devices are now isolated with their own seccomp filters. It is also now possible to pass --seccomp=log which result in the logging of requests that would have otherwise been denied to further aid development. Notable Bug Fixes Our virtio-vsock implementation has been resynced with the implementation from Firecracker and includes multiple bug fixes. CPU hotplug has been fixed so that it is now possible to add, remove, and re-add vCPUs (kata-containers#1338) A workaround is now in place for when KVM reports MSRs available MSRs that are in fact unreadable preventing snapshot/restore from working correctly (kata-containers#1543). virtio-mmio based devices are now more widely tested (kata-containers#275). Multiple issues have been fixed with virtio device configuration (kata-containers#1217) Console input was wrongly consumed by both virtio-console and the serial. (kata-containers#1521) Fixes: kata-containers#2864 Signed-off-by: Jose Carlos Venegas Munoz <[email protected]>
@likebreath @jcvenegas Can we throw an explicit error in that case which clearly shows that the system call was not allowed by seccomp? Wdyt @sboeuf @rbradford ? |
Description of problem
As observed from the two PRs (#2840 and #2833), we see the clh-docker CI job is failing on
run hot plug block devices
. Also, @egernst reported a similar failure ofVmInfoGet failed
after hotplug memory to kata+clh.As @jcvenegas and I confirmed the failure is not related to the changes from the PRs, we believe this is a regression failure introduced recently. The last CI job was passing few days ago on 07/25 [here](http://jenkins.katacontainers.io/job/kata-containers-runtime-ubuntu-1804-PR-cloud-hypeprvisor-docker/141/.
I am opening a dummy PR to verify whether it is actually a regression test failure that was escaped from previous checks/CIs.
The text was updated successfully, but these errors were encountered: