Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Terminate NPC application if any of the watches panic #3792

Closed

Conversation

naemono
Copy link
Contributor

@naemono naemono commented Apr 3, 2020

Catch panics within any of the ResourceEventHandlerFuncs and terminate the full NPC application

Per discussions in #3764
and #3771

This change will cause the whole NPC application to crash when a panic occurs within any of the ResourceEventHandlerFuncs, which are called when add/update/deletes happen for pods/namespaces/network policies. Stripped down the weave main package to just this https://gist.github.com/naemono/7c47213de6bb2eaab07b66b7ad2b86b8 and tested within one of our K8s clusters, and proved that this approach will cause the whole application to exit upon an internal panic. This approach is considerably simpler than a full reconcile on restart, which is discussed in #3771 . Since the NPC application is in a fully broken state when an internal goroutine crashes, it seems as if this approach (allowing the orchestration system to restart the application/pod) would be better long term than allowing the application to continue running in a failed state.

Suggestions on better approach welcomed. Unsure exactly how to unit test this; suggestions on that welcomed as well.

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I have a few thoughts, mostly around Go coding style.

Does any of this help to understand why the Kubernetes code which looks like it will exit on panic doesn't just do that?


go func() {
nsController.Run(stopChan)
signals <- syscall.SIGINT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intention here? Looks like you are faking a signal arriving from the OS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am, since that's what weave is waiting on to completely stop the application fully

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about os.Exit(1) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but if the weave team wants waitgroups in place, I'll likely have to use a channel of some sort, and this was already setup to use... I'm not against this, but I'm curious as to how the weave team feels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 222 we know the program has panic'd; it seems best to exit quickly and simply at that point rather than adding channels, fake signals, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this to simply os.Exit when one of the goroutines panics.

}()
go func() {
<-stopChan
close(stopChan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it dangerous to close this while there are still goroutines that could send to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the application is ending anyways, and I was considering waitgroups, but I'm not sure the harm here, as the application is completely shutting down at this point regardless. I'm certainly not against adding a waitgroup, and waiting until all the others stop completely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been removed.

named function for panic recovery
@naemono
Copy link
Contributor Author

naemono commented Apr 3, 2020

@bboreham as for the question of why the k8s code isn't handling this, I found this when I was digging into this

https://github.com/kubernetes/client-go/blob/master/tools/cache/controller.go#L145

Which mentions

// Until loops until stop channel is closed, running f every period.

Which is why I added the goroutine to close the channel when a signal is sent on this channel. I was wondering why sending a struct down that channel didn't kill everything, until I found that...

@naemono
Copy link
Contributor Author

naemono commented May 27, 2020

I'll get back to this in the next Sprint and spend some more time here.

var npController cache.Controller
var (
npController cache.Controller
stopChan chan struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopChan seems to be left-over and unneeded now.

@@ -218,8 +218,17 @@ func ipsetExist(ips ipset.Interface, name ipset.Name) (bool, error) {
return false, nil
}

func stopOnPanicRecover(stopChan chan struct{}) {
if r := recover(); r != nil {
os.Exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the user experience if this happens? Can they understand where the real problem occurred?

@gobomb
Copy link
Contributor

gobomb commented Aug 3, 2020

It's maybe a bug of client-go that panic in event handler didn't propagate and the process didn't crash. I created an issue to report it in client-go repo. How do you think?

kubernetes/client-go#838

update:

I created a pr to k8s repo and tried to fix it: kubernetes/kubernetes#93646

@bboreham
Copy link
Contributor

bboreham commented Aug 4, 2020

Brilliant work @gobomb , I agree this is very likely the root cause of the problem.
I have proposed a different fix: kubernetes/kubernetes#93679

If @naemono isn't going to fix the nits I can do that.

@bboreham
Copy link
Contributor

bboreham commented Aug 4, 2020

Please see my proposed fix-up at master...fail-controller-if-goroutines-fail

(I should add that I consider this stopOnPanicRecover() as a temporary work-around until there is a Kubernetes client-go that fixes the underlying problem)

@bboreham
Copy link
Contributor

bboreham commented Aug 5, 2020

Replaced by #3841

@bboreham bboreham closed this Aug 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants