Terminate NPC application if any of the watches panic #3792

naemono · 2020-04-03T16:58:14Z

Catch panics within any of the ResourceEventHandlerFuncs and terminate the full NPC application

Per discussions in #3764
and #3771

This change will cause the whole NPC application to crash when a panic occurs within any of the ResourceEventHandlerFuncs, which are called when add/update/deletes happen for pods/namespaces/network policies. Stripped down the weave main package to just this https://gist.github.com/naemono/7c47213de6bb2eaab07b66b7ad2b86b8 and tested within one of our K8s clusters, and proved that this approach will cause the whole application to exit upon an internal panic. This approach is considerably simpler than a full reconcile on restart, which is discussed in #3771 . Since the NPC application is in a fully broken state when an internal goroutine crashes, it seems as if this approach (allowing the orchestration system to restart the application/pod) would be better long term than allowing the application to continue running in a failed state.

Suggestions on better approach welcomed. Unsure exactly how to unit test this; suggestions on that welcomed as well.

…e the full application

bboreham

Thanks for the PR. I have a few thoughts, mostly around Go coding style.

Does any of this help to understand why the Kubernetes code which looks like it will exit on panic doesn't just do that?

prog/weave-npc/main.go

bboreham · 2020-04-03T17:28:29Z

prog/weave-npc/main.go

+
+	go func() {
+		nsController.Run(stopChan)
+		signals <- syscall.SIGINT


What's the intention here? Looks like you are faking a signal arriving from the OS?

I am, since that's what weave is waiting on to completely stop the application fully

How about os.Exit(1) ?

yeah, but if the weave team wants waitgroups in place, I'll likely have to use a channel of some sort, and this was already setup to use... I'm not against this, but I'm curious as to how the weave team feels.

On line 222 we know the program has panic'd; it seems best to exit quickly and simply at that point rather than adding channels, fake signals, etc.

I've updated this to simply os.Exit when one of the goroutines panics.

prog/weave-npc/main.go

bboreham · 2020-04-03T17:30:46Z

prog/weave-npc/main.go

+	}()
+	go func() {
+		<-stopChan
+		close(stopChan)


Isn't it dangerous to close this while there are still goroutines that could send to it?

well, the application is ending anyways, and I was considering waitgroups, but I'm not sure the harm here, as the application is completely shutting down at this point regardless. I'm certainly not against adding a waitgroup, and waiting until all the others stop completely.

Please don't.

this has been removed.

named function for panic recovery

naemono · 2020-04-03T17:53:59Z

@bboreham as for the question of why the k8s code isn't handling this, I found this when I was digging into this

https://github.com/kubernetes/client-go/blob/master/tools/cache/controller.go#L145

Which mentions

// Until loops until stop channel is closed, running f every period.

Which is why I added the goroutine to close the channel when a signal is sent on this channel. I was wondering why sending a struct down that channel didn't kill everything, until I found that...

naemono · 2020-05-27T13:01:21Z

I'll get back to this in the next Sprint and spend some more time here.

bboreham · 2020-06-04T09:32:14Z

prog/weave-npc/main.go

-	var npController cache.Controller
+	var (
+		npController cache.Controller
+		stopChan     chan struct{}


stopChan seems to be left-over and unneeded now.

bboreham · 2020-06-04T09:32:44Z

prog/weave-npc/main.go

@@ -218,8 +218,17 @@ func ipsetExist(ips ipset.Interface, name ipset.Name) (bool, error) {
 	return false, nil
 }

+func stopOnPanicRecover(stopChan chan struct{}) {
+	if r := recover(); r != nil {
+		os.Exit(1)


What's the user experience if this happens? Can they understand where the real problem occurred?

gobomb · 2020-08-03T06:51:46Z

It's maybe a bug of client-go that panic in event handler didn't propagate and the process didn't crash. I created an issue to report it in client-go repo. How do you think？

kubernetes/client-go#838

update:

I created a pr to k8s repo and tried to fix it: kubernetes/kubernetes#93646

bboreham · 2020-08-04T12:07:54Z

Brilliant work @gobomb , I agree this is very likely the root cause of the problem.
I have proposed a different fix: kubernetes/kubernetes#93679

If @naemono isn't going to fix the nits I can do that.

bboreham · 2020-08-04T14:42:15Z

Please see my proposed fix-up at master...fail-controller-if-goroutines-fail

(I should add that I consider this stopOnPanicRecover() as a temporary work-around until there is a Kubernetes client-go that fixes the underlying problem)

bboreham · 2020-08-05T09:49:14Z

Replaced by #3841

Catch panics within any of the ResourceEventHandlerFuncs and terminat…

40a3f80

…e the full application

bboreham reviewed Apr 3, 2020

View reviewed changes

Make stop channel local scope change

b147158

named function for panic recovery

Simply run os.Exit on panic of goroutine

990777f

naemono mentioned this pull request Jun 3, 2020

Add configurable reconciliation loop for pods, namespaces, and networ… #3772

Closed

bboreham reviewed Jun 4, 2020

View reviewed changes

bboreham added this to the 2.7 milestone Aug 4, 2020

bboreham mentioned this pull request Aug 5, 2020

Terminate NPC application if any of the watches panic #3841

Merged

bboreham closed this Aug 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminate NPC application if any of the watches panic #3792

Terminate NPC application if any of the watches panic #3792

naemono commented Apr 3, 2020

bboreham left a comment

bboreham Apr 3, 2020

naemono Apr 3, 2020

bboreham Apr 3, 2020

naemono Apr 3, 2020

bboreham May 14, 2020

naemono Jun 3, 2020

bboreham Apr 3, 2020

naemono Apr 3, 2020

bboreham May 14, 2020

naemono Jun 3, 2020

naemono commented Apr 3, 2020

naemono commented May 27, 2020

bboreham Jun 4, 2020

bboreham Jun 4, 2020

gobomb commented Aug 3, 2020 •

edited

Loading

bboreham commented Aug 4, 2020

bboreham commented Aug 4, 2020

bboreham commented Aug 5, 2020

Terminate NPC application if any of the watches panic #3792

Terminate NPC application if any of the watches panic #3792

Conversation

naemono commented Apr 3, 2020

bboreham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naemono commented Apr 3, 2020

naemono commented May 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gobomb commented Aug 3, 2020 • edited Loading

bboreham commented Aug 4, 2020

bboreham commented Aug 4, 2020

bboreham commented Aug 5, 2020

gobomb commented Aug 3, 2020 •

edited

Loading