-
Notifications
You must be signed in to change notification settings - Fork 678
Terminate NPC application if any of the watches panic #3792
Terminate NPC application if any of the watches panic #3792
Conversation
…e the full application
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I have a few thoughts, mostly around Go coding style.
Does any of this help to understand why the Kubernetes code which looks like it will exit on panic doesn't just do that?
prog/weave-npc/main.go
Outdated
|
||
go func() { | ||
nsController.Run(stopChan) | ||
signals <- syscall.SIGINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the intention here? Looks like you are faking a signal arriving from the OS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am, since that's what weave is waiting on to completely stop the application fully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about os.Exit(1)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, but if the weave team wants waitgroups in place, I'll likely have to use a channel of some sort, and this was already setup to use... I'm not against this, but I'm curious as to how the weave team feels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On line 222 we know the program has panic'd; it seems best to exit quickly and simply at that point rather than adding channels, fake signals, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this to simply os.Exit
when one of the goroutines panics.
prog/weave-npc/main.go
Outdated
}() | ||
go func() { | ||
<-stopChan | ||
close(stopChan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it dangerous to close this while there are still goroutines that could send to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, the application is ending anyways, and I was considering waitgroups, but I'm not sure the harm here, as the application is completely shutting down at this point regardless. I'm certainly not against adding a waitgroup, and waiting until all the others stop completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has been removed.
named function for panic recovery
@bboreham as for the question of why the k8s code isn't handling this, I found this when I was digging into this https://github.com/kubernetes/client-go/blob/master/tools/cache/controller.go#L145 Which mentions
Which is why I added the goroutine to close the channel when a signal is sent on this channel. I was wondering why sending a struct down that channel didn't kill everything, until I found that... |
I'll get back to this in the next Sprint and spend some more time here. |
var npController cache.Controller | ||
var ( | ||
npController cache.Controller | ||
stopChan chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stopChan
seems to be left-over and unneeded now.
@@ -218,8 +218,17 @@ func ipsetExist(ips ipset.Interface, name ipset.Name) (bool, error) { | |||
return false, nil | |||
} | |||
|
|||
func stopOnPanicRecover(stopChan chan struct{}) { | |||
if r := recover(); r != nil { | |||
os.Exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the user experience if this happens? Can they understand where the real problem occurred?
It's maybe a bug of client-go that panic in event handler didn't propagate and the process didn't crash. I created an issue to report it in client-go repo. How do you think? update: I created a pr to k8s repo and tried to fix it: kubernetes/kubernetes#93646 |
Brilliant work @gobomb , I agree this is very likely the root cause of the problem. If @naemono isn't going to fix the nits I can do that. |
Please see my proposed fix-up at master...fail-controller-if-goroutines-fail (I should add that I consider this |
Replaced by #3841 |
Catch panics within any of the ResourceEventHandlerFuncs and terminate the full NPC application
Per discussions in #3764
and #3771
This change will cause the whole NPC application to crash when a panic occurs within any of the ResourceEventHandlerFuncs, which are called when add/update/deletes happen for pods/namespaces/network policies. Stripped down the weave main package to just this https://gist.github.com/naemono/7c47213de6bb2eaab07b66b7ad2b86b8 and tested within one of our K8s clusters, and proved that this approach will cause the whole application to exit upon an internal panic. This approach is considerably simpler than a full reconcile on restart, which is discussed in #3771 . Since the NPC application is in a fully broken state when an internal goroutine crashes, it seems as if this approach (allowing the orchestration system to restart the application/pod) would be better long term than allowing the application to continue running in a failed state.
Suggestions on better approach welcomed. Unsure exactly how to unit test this; suggestions on that welcomed as well.