Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

integration: soak: add rm soak test #414

Merged
merged 3 commits into from
Aug 10, 2018

Conversation

grahamwhaley
Copy link
Contributor

Add an 'rm' soak test. The test was originally written
to capture 'stuck' docker rm's of many containers, but
as it also does a lot of sanity checking of many other
parts of the system (checks for runtime/qemu/proxy/shims
running when they should, and not running when they should
not, and that 'kata-runtime list' matches what we have asked
docker to do, and check we don't leave dangling mounts around
etc.), it has also been useful for general stability checking.

Fixes: #195

Signed-off-by: Graham Whaley [email protected]

@grahamwhaley
Copy link
Contributor Author

Marking RFC/WIP right now, as I'd like to confirm some things with others....
For me, right now, this looks like it hangs up on a docker rm of many containers. @chavafg @jodh-intel - could one or either of you give this a spin for me and confirm if you are seeing that?

I have a sneaky feeling this could be related to kata-containers/runtime#396 - just a feeling. It feels like the old 'stuck lock' in vc issues we had a long long time ago - but, irrc, @sboeuf rewrote how all the locking works, so I suspect it has similar symptoms, but is not the same issue.
Note, once things are stuck, a docker ps works, but a sudo kata-runtime list gets hung up.

/cc @jamiehannaford for completeness.

@grahamwhaley
Copy link
Contributor Author

Oh, once we've worked over this somewhat I'll add a commit to add this into the QA CI scripts so it runs on every PR - if we agree (we should look at how long it takes to run for instance).

Also, this PR is somewhat strongly related to #215 - that is, much of the sanity checking code from this PR can be lifted out into a lib maybe and then we can run that before/after each test.

@sboeuf
Copy link

sboeuf commented Jun 15, 2018

@grahamwhaley I confirm that running the soak test triggers the error you're describing. I have tried on a simple Azure VM with 4vCPUs and 16GiB of RAM. I have run the test for 20 containers only.
The full first run went well, but during the second run, it hanged during container removal.

The issue seems to be related to locking but it might not be the root cause. This is worth some investigations.

@grahamwhaley
Copy link
Contributor Author

OK - I've got some debug from the stuck docker rm's - let me open an Issue to put that data on, and ref it back here...

@jodh-intel
Copy link
Contributor

As mentioned on this issue, it would be extremely useful to get this landed.

@grahamwhaley
Copy link
Contributor Author

This'll still be mine then. Things to do are:

  • Test against current HEAD repos
  • Check we are checking all the correct paths (in /var, /lib, /run etc.)
  • Add support to check the vc/vm directory - although, with the factory, it is not clear how to check when that should or should not be empty etc. - will need a touch of research

@jodh-intel
Copy link
Contributor

Hi @grahamwhaley - that list sounds good. But I don't think we need to wait until the script is checking all those things before landing a basic test though? You could add features as we go along on separate PRs. The existing script seems to have proved itself so the sooner we get it running regularly, the safer we'll be from those nightmare cross-repo bisects right? 😄

@grahamwhaley grahamwhaley changed the title [RFC] integration: soak: add rm soak test integration: soak: add rm soak test Aug 7, 2018
@grahamwhaley
Copy link
Contributor Author

OK...
@jodh-intel I got to revisit this. I tested with the HEAD components (it ran the full 110 containers, 5 cycles sweep fine). I fixed one minor thing that has niggled me forerver - an off-by-one check on the number of containers to run.
And then.... I have added it into the 'make tests' rules with a config to make it do a limited length run, so we can run a basic test in the CIs.
A full run of 110 containers for 5 cycles takes ~17minutes on my machine. As configured in the Makefile, for 20 containers for 2 cycles, it takes ~1m7s, which maybe we can accept.

I've dropped the RFC and DNM...

/cc @chavafg @GabyCT

@grahamwhaley
Copy link
Contributor Author

doh - I missed the .PHONY - let me do that...

@jodh-intel
Copy link
Contributor

jodh-intel commented Aug 7, 2018

@grahamwhaley - nice! Hopefully the CI will be able to crunch through this a bit quicker (I see they are triggered so we'll need to look at the logs once finished). Hence, aside from the Makefile conflict...

lgtm

Approved with PullApprove

@grahamwhaley
Copy link
Contributor Author

phew, let's try one more time (fingers crossed!). This time:

  • rebased to fix Makefile conflict
  • added a chronic around the call to reduce log noise
  • removed function keywords from script, as we are moving away from them (and they are not needed).

/cc @bergwolf , as #578 will maybe have similar or benefit from all of those as well :-)

@chavafg
Copy link
Contributor

chavafg commented Aug 7, 2018

lgtm

Approved with PullApprove

@chavafg
Copy link
Contributor

chavafg commented Aug 7, 2018

I think you'll need to check if docker service is started (and start it if not) before executing the tests.

@grahamwhaley
Copy link
Contributor Author

Ah, I see:

cd integration/stability && \
export ITERATIONS=2 && export MAX_CONTAINERS=20 && chronic ./soak_parallel_rm.sh
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Checking Kata runtime kata-runtime
Start iteration 1 of 2
Running...
Checking 0 containers have all relevant components
Run kata-runtime: nginx: 
Checking 1 containers have all relevant components
Wrong number of containers running (0 != 1) - stopping
Got 1 errors, quitting
Makefile:62: recipe for target 'docker-stability' failed
make: *** [docker-stability] Error 255
Build step 'Execute shell' marked build as failure

@chavafg - is that the common 'idiom' for our test suites then - they have to check and start the services they need before they run?
Is there a set of shell library helper funcs for that perhaps?, like a start_docker() func or something I can invoke?

@grahamwhaley
Copy link
Contributor Author

OK, I see we do something similar for other test suites - sometimes in the Makefile (swarm), sometimes in the scripts themselves (crio). I've added the relevant systemctl lines copied from the swarm Makefile case to the docker-stability section of the Makefile and re-pushed. Let's see how that goes with the CIs.

@chavafg
Copy link
Contributor

chavafg commented Aug 8, 2018

network errors on the jobs :(
Sent a rebuild...

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  5 99.4M    5 5390k    0     0  5390k      0  0:00:18 --:--:--  0:00:18 12.4M
 47 99.4M   47 46.9M    0     0  46.9M      0  0:00:02  0:00:01  0:00:01 33.4M
 48 99.4M   48 47.9M    0     0  47.9M      0  0:00:02  0:00:01  0:00:01 33.2M
curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104

@grahamwhaley
Copy link
Contributor Author

Heh heh. From the F27 CI:

systemctl is-active --quiet docker || sudo systemctl start docker
cd integration/stability && \
export ITERATIONS=2 && export MAX_CONTAINERS=20 && chronic ./soak_parallel_rm.sh
sudo: kata-runtime: command not found
Unable to find image 'nginx:latest' locally
latest: Pulling from library/nginx
be8881be8156: Already exists
32d9726baeef: Pulling fs layer
87e5e6f71297: Pulling fs layer
87e5e6f71297: Verifying Checksum
87e5e6f71297: Download complete
32d9726baeef: Verifying Checksum
32d9726baeef: Download complete
32d9726baeef: Pull complete
87e5e6f71297: Pull complete
Digest: sha256:d85914d547a6c92faa39ce7058bd7529baacab7e0cd4255442b04577c4d1f424
Status: Downloaded newer image for nginx:latest
sudo: kata-runtime: command not found
Checking Kata runtime kata-runtime
Start iteration 1 of 2
Running...
Checking 0 containers have all relevant components
Run kata-runtime: nginx: 
caa1e8a411ccc5c12e968c76d9d7bb107c962ab12500bbb88d05de6524c9db7d
Checking 1 containers have all relevant components
Wrong number of qemus running (1 != 0) - stopping
Wrong number of 'runtime list' containers running (1 != 0) - stopping
Got 2 errors, quitting
make: *** [Makefile:63: docker-stability] Error 255

and the same for Centos7 it seems.

I'll check where/what the sudo: kata-runtime: command not found really means (maybe on those CI systems sudo cannot see kata-runtime in its path by default?).

Now, otherwise in a way this is sort of good, as that is exactly the sort of situation the test is mean to pick up.
I think what I should do is add a dump_diagnostics() func that will give us a bunch of pgrep and docker ps -qa type info to help diagnosis if the test fails. wdyt @chavafg @jodh-intel :-)

@chavafg
Copy link
Contributor

chavafg commented Aug 10, 2018

@grahamwhaley, the dump_diagnostics function sounds good.
And you are right, sudo cannot access kata-runtime:

[fuentess@kata-fedora ~]$ which kata-runtime
/usr/local/bin/kata-runtime
[fuentess@kata-fedora ~]$ sudo -E PATH=$PATH kata-runtime
sudo: kata-runtime: command not found

Although not sure what is the best way to solve this. In https://github.com/kata-containers/tests/blob/master/integration/openshift/hello_world.bats, we use:

kata_runtime_bin=$(command -v kata-runtime)
sudo -E "$kata_runtime_bin"

Graham Whaley added 3 commits August 10, 2018 11:37
Add a function to show a number of kata relevant system information
items such as what docker and the runtime thinks is running, and
what components we can see alive.

Useful as a diagnostic tool for if we fail a sanity check during
testing.

Signed-off-by: Graham Whaley <[email protected]>
Add an 'rm' soak test. The test was originally written
to capture 'stuck' docker rm's of many containers, but
as it also does a lot of sanity checking of many other
parts of the system (checks for runtime/qemu/proxy/shims
running when they should, and not running when they should
not, and that 'kata-runtime list' matches what we have asked
docker to do, and check we don't leave dangling mounts around
etc.), it has also been useful for general stability checking.

Fixes: kata-containers#195

Signed-off-by: Graham Whaley <[email protected]>
Enable the docker soak test in the Makefile.
Over-ride the test default configuration to bring the test time
down to something more acceptible in the CIs.

Signed-off-by: Graham Whaley <[email protected]>
@grahamwhaley
Copy link
Contributor Author

Added a system info dump function to the common lib. Fixed the sudo RUNTIME invocation. Let's see how the CIs are feeling...

@chavafg
Copy link
Contributor

chavafg commented Aug 10, 2018

lgtm,

CI happy. Merging

@chavafg chavafg merged commit 623dfba into kata-containers:master Aug 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants