Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

vc: kata_agent: grpc: We should not have infinite timeouts #460

Closed
grahamwhaley opened this issue Jul 3, 2018 · 0 comments
Closed

vc: kata_agent: grpc: We should not have infinite timeouts #460

grahamwhaley opened this issue Jul 3, 2018 · 0 comments

Comments

@grahamwhaley
Copy link
Contributor

Description of problem

Our current grpc code defaults to an infinite timeout. This has been witnessed to hang up when the agent at the other end is not responding (for whatever reason). We should have some sort of safety timeout and/or retry mechanism in place to (at least eventually) notice a dead connection.

There may be implicit complications here involving paused containers or containers that are quiescent for a long time, and maybe some of our grpc calls expect a long delay (but tbh, they are grpc calls and I don't think we parallelise them do we - so, maybe they never expect a very long delay).

Here is a patch I knocked up whilst doing some debug, just for inspiration:

diff --git a/virtcontainers/kata_agent.go b/virtcontainers/kata_agent.go
index b49efae..868f2ae 100644
--- a/virtcontainers/kata_agent.go
+++ b/virtcontainers/kata_agent.go
@@ -967,6 +967,13 @@ func (k *kataAgent) signalProcess(c *Container, processID string, signal syscall
        }

        _, err := k.sendReq(req)
+
+       if err != nil {
+               k.Logger().WithFields(logrus.Fields{
+                       "container-id": c.id,
+                       }).WithError(err).Error("signalProcess() sendReq failed")
+       }
+
        return err
 }

@@ -1230,7 +1237,16 @@ func (k *kataAgent) sendReq(request interface{}) (interface{}, error) {
                return nil, errors.New("Invalid request type")
        }

-       return handler(context.Background(), request)
+       ctx, cancel := context.WithTimeout(context.Background(), 120*time.Second)
+       defer cancel()
+       i, err := handler(ctx, request)
+
+       if err != nil {
+               k.Logger().WithError(err).Error("sendReq failed")
+       }
+
+       return i, err
 }

Expected result

I don't expect the runtime to ever hang up solid.

Actual result

Over in #406, due to some interesting yamux/qemu timeout death scenarios, we got the runtime to hang up solid in the grpc call dispatch.

zklei pushed a commit to zklei/runtime that referenced this issue Jun 13, 2019
If a proto file is changed then we want a larger number of folks
to know about and review it. Add them to the CODEOWNERS file.

Fixes: kata-containers#460

Signed-off-by: Graham Whaley <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants