LoadBalancer vpc-nat-gateway. #4052

Orfheo · 2024-05-20T09:29:05Z

Trying to use the kube-ovn loadbalancer I got a weird behaviour when the lb-svc pod deployed by the controller service_lb.go is terminated manually or because the node where it runs is drained for whatever reason.

The vpc-nat-gateway pod is correctly restarted from the kubernetes deployment created by the kube-ovn controller, but the commands:

service_lb.go:305] add eip rules for lb svc pod, [XXXX]
service_lb.go:325] add dnat rules for lb svc pod, [XXXX]

are never executed again, so the vpc-nat-gateway container is never initialized and, of course, doesn't perform its job.

Apparently the only way to to reinitialize the vpc-nat-gateway with the correct routing/iptables rules seems to delete and apply again the loadbalancer service definition.

I've found a very similar behaviour described in this issue:

#3241

about a VPC nat-gateway not restarted correctly at node reboot, so I thought I'm behind a something.

Apologies, I'm new to kube-ovn, not sure if I did something wrong or I'm actually against a bug.

Thanks for your patience and attention, Regards.

zbb88888 · 2024-05-21T01:45:38Z

please attach your kube-ovn version and image the err log detail, thanks

Orfheo · 2024-05-21T08:46:32Z

please attach your kube-ovn version and image the err log detail, thanks

Sure, here is the kube-ovn version, running under kk8s 1.30 (stable), installed with helm:

NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
kube-ovn default 2 2024-05-15 10:36:18.299539318 +0200 CEST deployed kube-ovn-v1.12.14 1.12.14

I'm using multus thick daemon, installed from git (version 4.01), to allocate an additional interface for a postgres pod on a a defined provider subnet.

The service ask to the kube-ovn loadbalancer a static IP on the subnet, I'm including the log of a complete sequence which I think show the problem:

ll2.log

The sequence define the lb-svc service, the vpc-nat-gateway is started correctly, then I manually deleted the vpc-nat-gateway container and in the log should be easy to recognize tha eip and dnat rules are not initialized after the vpc-nat-gateway is terminated in the new spawned vpc-nat-gateway pod.

Easy to check, using the documentation, that the vpc-nat-gateway is not initialized and can't perform its job:

lb-svc-test-postgres-57d45578f4-c7l9b:/kube-ovn# ip rule
0: from all lookup local
32766: from all lookup main
32767: from all lookup default

lb-svc-test-postgres-57d45578f4-c7l9b:/kube-ovn# iptables -L -n -v -t nat
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination

Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination

Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination

By the way, I've up and running the kind kube-ovn development setup, so I should be in the condition to check the problem in a controlled environment aligned with the git kube-ovn repo.

Thanks for your attention.

zbb88888 · 2024-05-22T01:37:22Z

the ip is not right.

you should check the ip, make sure the ip is in the subnet.

zbb88888 · 2024-05-22T08:19:25Z

I am not aware of anything else err log about other things. after node reboot, the nat gw pod is still on the previous node ??

Orfheo · 2024-05-22T08:32:23Z

Ops, sorry, deleted previous post while trying to edit it ;-(

Yes, if all the worker nodes are rebooted the kube-ovn controller do the correct job and spawn correctly the vpc-nat-gateway.

The problem arise if the vpc-nat-gateway is "DELETED" and a new one is spawned.

By the way, I had been able to replicate the problem using the kube-ovn kind development environment and I've a test case, with its log, available.

For what I can uderstand of the controller golang code, the "pod listener" doesn't handle this event , somewhere in the "pod.go" source the contoller function updatePodAttachNets() should be called as it is called from the "service.go" event listener. It is not.

Orfheo · 2024-05-22T08:38:54Z

This is my test case for the kube-ovn kind cluster:

make kind-init
make kind-install
make kind-install-lb-svc

kubectl apply -f postgres-config.yml
kubectl apply -f postgres.yml

kubectl apply -f lb-attach-subnet.yaml
kubectl apply -f test-postgres.yml

After the vpc-nat-gateway is in running state the lb-svc pod may be manually deleted to replicate the problem:

kubectl delete pod lb-svc-test-postgres-7b477ff76d-klbqt

The testcase files may be recovered at this personal URL:

http://www.vitillaro.org/curr/kind.tar.gz

Orfheo · 2024-05-22T19:55:07Z

Using my testcase, just to understand if I'm on the right track, I applied a patch to the kube-ovn controller, sources pod.go and service.go, to move the updatePodAttachNets() call to the pod listener from the service listener.

It looks now the vpc-nat-gateway is correctly configured each time is spawned, each time is deleted.

The patch is quick and dirty, few lines of code, wondering if anyone may get the time to review what I've done.

The patch should be easy to apply to the current git version of kube-ovn and may be tested under the kube-ovn kind environment using the test case.

If I'm right, with few changes a potential forever loop, if worker nodes can't spawn new pods for example, in the service.go source may be avoided as the pod listener knows if the service pod is running and may grant the correct state to init the vpc-nat-gateway container.

zbb88888 · 2024-05-23T01:18:20Z

thanks for your info and hard work. please attach your PR here, we will review it.

Orfheo · 2024-05-23T10:46:22Z

I hope attaching a zip file is the correct way.
In the kind.zip file there are all the test case yaml and the file "svc.patch", a git diff against the kube-ovn git, with my attempt to correct the problem.

It seems to work, not sure is not going to have side effects.

The problem looks complex, but I think it worths to be analyzed by a developer.

Thanks for attention.
kind.zip

zbb88888 · 2024-05-24T01:18:11Z

@Orfheo you can post a pr first, we will review and help to fix the issue.

hi @hongzhen-ma , could you please have an eye on the issue ? thanks

Orfheo · 2024-05-24T08:26:45Z

@Orfheo you can post a pr first, we will review and help to fix the issue.

hi @hongzhen-ma , could you please have an eye on the issue ? thanks

Forgive me, not sure I understood what a PR is.

Anyway, it looks like a complex trouble, there is a "chicken and egg" problem here, for what I understand.

The service listener create a deployment for instancing the vpc-nat-gateway frontend of the OVN LoadBalancer, the deployment spawn a new pod, to run the nat gateway inside a containter, then the service listener "wait", in a for loop, for the new pod to be in state "Running".

Once the pod is "Running" the service listener update the service status with the requested Ingress and then calls "updatePodAttachNet()", implemented in the service_lb.go source, to run the bash script "lb-svc.sh" to configure the nat gateway.

The service is not going to handle any event if the nat gateway pod die for any reason, only the pod listener get the event, so this portion of code is never called again while the Ingress is alreay in place and new spawned vpc-nat-gateway containers are not going to be configured at all.

My patch is just a fast rough and dirty sketch of a solution, but it is definitely NOT a SOLUTION.

You are correct, the pod listener shouldn't update in any way the loadbalancer service, it should be enough to use the portion of the code in service_lb.go needed to call "eip-add" and "dnat-add" to configure the vpc-nat-gateway if the Ingress is already in place and the service listener is not going to be called again.

Otherwise, as I've already verified, the two threads need to be synchronized in some way or time depeding errors would probably show up.

Thanks, Orfheo.

zbb88888 · 2024-05-24T08:45:37Z

PR is pull request.

nat-gw-pod is used for iptables-eip，iptables-fip，iptables-dnat，iptables-snat， if those CRDs have no the same issues.

I think the problem should fixed by reference some codes in the nat-gw pod redo code.

hongzhen-ma · 2024-05-27T10:30:58Z

I hope attaching a zip file is the correct way. In the kind.zip file there are all the test case yaml and the file "svc.patch", a git diff against the kube-ovn git, with my attempt to correct the problem.

It seems to work, not sure is not going to have side effects.

The problem looks complex, but I think it worths to be analyzed by a developer.

Thanks for attention. kind.zip

Thanks for the issue and the log files.

I looked at the contents of the svc.patch file, and it really is the correct processing location.

But when the pod is created or updated, the status of pod may still be not running. So shell cmd exec failed when I have a test.

So maybe it's an easy way to add the svc to addServiceQueue and go through the process again.

Orfheo · 2024-05-27T14:06:23Z

I hope attaching a zip file is the correct way. In the kind.zip file there are all the test case yaml and the file "svc.patch", a git diff against the kube-ovn git, with my attempt to correct the problem.
It seems to work, not sure is not going to have side effects.
The problem looks complex, but I think it worths to be analyzed by a developer.
Thanks for attention. kind.zip

Thanks for the issue and the log files.

I looked at the contents of the svc.patch file, and it really is the correct processing location.

But when the pod is created or updated, the status of pod may still be not running. So shell cmd exec failed when I have a test.

So maybe it's an easy way to add the svc to addServiceQueue and go through the process again.

You are correct, the version of the rough patch I posted is not correct, as I wrote it was just a quick and dirty way to check the logic of this issue.

I wrote a new version, I hope this time correct, which should overcome this behavior using the existing code in service.go when the lb-svc pod is created at service creation time. Then a similar code in pod.go, handling the pod update event, and locking the service listener without race possibilities, is going to handle the event dispatched when the lb-svc pod is deleted.

As soon as I can I'll try to post here the new patch for review.

Not enough github experience to create a pull request, I hope you will forgive me.

Thanks for the review, you have confirmed what I thought, appreciated.

Orfheo · 2024-05-27T16:40:28Z

I hope attaching a zip file is the correct way. In the kind.zip file there are all the test case yaml and the file "svc.patch", a git diff against the kube-ovn git, with my attempt to correct the problem.
It seems to work, not sure is not going to have side effects.
The problem looks complex, but I think it worths to be analyzed by a developer.
Thanks for attention. kind.zip

Thanks for the issue and the log files.

I looked at the contents of the svc.patch file, and it really is the correct processing location.

But when the pod is created or updated, the status of pod may still be not running. So shell cmd exec failed when I have a test.

So maybe it's an easy way to add the svc to addServiceQueue and go through the process again.

Here it is the new version, stored into the "kind1.zip" file, named "svc1.patch".

I think I understand what you suggest about using "addServiceQueue", but, if I got it correctly, this would call again the event handler "handleAddService" of the servce listener and this would create a new lb-svc pod, isn't it?

But a new lb-svc vpc-nat-gateway would have been already created by the Deployment installed by the service listener, creating a race.

What I've done in the new version is to let service.go to perform its job, creating the lb-svc pod and wait to see it running.

When a new lb-svc pod is created by the lb-svc Deployment, I check for the correct pod, in the correct state, "Running", the pod associated with the correct service name, in the correct namespace, a service of type LoadBalancer with one and just one Ingress.

Identified the lb-svc pod, I update the lb-svc service Ingress, it may be the case a LoadBalancerIP hasn't been requested and a new IP is attached to the pod, then, with a service listener lock which can't arise a deadlock, I just call "updatePodAttachNets" to call the lb-svc.sh and, eventually, update the service with the new Ingress IP, if this is the case.

In the "testcase" file is stored the sequence of yamls and commands I used to debug the code.

I followed the controller logs, in both cases, with a LoadBalancerIP requested and not requested, deleting the service pod many times, it looks the code perform consistently the correct job.

By the way, if I may use this post to ask a question, for what I've seen the kube-ovn LoadBalancer always MASQ, using iptables, the src address of incoming packets.

Am I right thinking that if logging the incoming src addresses is a must have feature an external LoadBalancer, like MetalLB or Cilium, probably configured with BGP, is the only way to go?

Thanks for the attention and the support, Orfheo.

kind1.zip

hongzhen-ma · 2024-05-28T06:22:05Z

I hope attaching a zip file is the correct way. In the kind.zip file there are all the test case yaml and the file "svc.patch", a git diff against the kube-ovn git, with my attempt to correct the problem.
It seems to work, not sure is not going to have side effects.
The problem looks complex, but I think it worths to be analyzed by a developer.
Thanks for attention. kind.zip

Thanks for the issue and the log files.
I looked at the contents of the svc.patch file, and it really is the correct processing location.
But when the pod is created or updated, the status of pod may still be not running. So shell cmd exec failed when I have a test.
So maybe it's an easy way to add the svc to addServiceQueue and go through the process again.

Here it is the new version, stored into the "kind1.zip" file, named "svc1.patch".

I think I understand what you suggest about using "addServiceQueue", but, if I got it correctly, this would call again the event handler "handleAddService" of the servce listener and this would create a new lb-svc pod, isn't it?

But a new lb-svc vpc-nat-gateway would have been already created by the Deployment installed by the service listener, creating a race.

What I've done in the new version is to let service.go to perform its job, creating the lb-svc pod and wait to see it running.

When a new lb-svc pod is created by the lb-svc Deployment, I check for the correct pod, in the correct state, "Running", the pod associated with the correct service name, in the correct namespace, a service of type LoadBalancer with one and just one Ingress.

Identified the lb-svc pod, I update the lb-svc service Ingress, it may be the case a LoadBalancerIP hasn't been requested and a new IP is attached to the pod, then, with a service listener lock which can't arise a deadlock, I just call "updatePodAttachNets" to call the lb-svc.sh and, eventually, update the service with the new Ingress IP, if this is the case.

In the "testcase" file is stored the sequence of yamls and commands I used to debug the code.

I followed the controller logs, in both cases, with a LoadBalancerIP requested and not requested, deleting the service pod many times, it looks the code perform consistently the correct job.

By the way, if I may use this post to ask a question, for what I've seen the kube-ovn LoadBalancer always MASQ, using iptables, the src address of incoming packets.

Am I right thinking that if logging the incoming src addresses is a must have feature an external LoadBalancer, like MetalLB or Cilium, probably configured with BGP, is the only way to go?

Thanks for the attention and the support, Orfheo.

kind1.zip

Thanks for the details.

I think I understand what you suggest about using "addServiceQueue", but, if I got it correctly, this would call again the event handler "handleAddService" of the servce listener and this would create a new lb-svc pod, isn't it?

When I have a test on add svc to addServiceQueue after pod is deleted, I found that resourceVersion of svc and deploy have changed.
The pod is created only once in my test. But it may not satisfy your situations.

Anyway, it really is a better way to distinguish the pod and call updatePodAttachNets at last to init pod iptable rules rather than go through the AddService process again.

Identified the lb-svc pod, I update the lb-svc service Ingress, it may be the case a LoadBalancerIP hasn't been requested and a new IP is attached to the pod, then, with a service listener lock which can't arise a deadlock, I just call "updatePodAttachNets" to call the lb-svc.sh and, eventually, update the service with the new Ingress IP, if this is the case.

Yes. It's another bug when pod attach ip changed but svc loadbalancer ip does not get updated.
Appreciate for the patch code. I will commit the patch code to github.

And the second question about src ip in iptables.
It seems that log is the only way to record the src ip address. Or any other suggestions, welcome to commit a pull request for that.

Orfheo · 2024-05-28T08:27:06Z

Thanks for the details.

You are welcome.

I think I understand what you suggest about using "addServiceQueue", but, if I got it correctly, this would call again the event handler "handleAddService" of the servce listener and this would create a new lb-svc pod, isn't it?

When I have a test on add svc to addServiceQueue after pod is deleted, I found that resourceVersion of svc and deploy have changed. The pod is created only once in my test. But it may not satisfy your situations.

Ops, ok, I guess I misunderstood the semantic of this piece of code, apologies.

Anyway, it really is a better way to distinguish the pod and call updatePodAttachNets at last to init pod iptable rules rather than go through the AddService process again.

Glad to read my logic engine is still working ;-)

Identified the lb-svc pod, I update the lb-svc service Ingress, it may be the case a LoadBalancerIP hasn't been requested and a new IP is attached to the pod, then, with a service listener lock which can't arise a deadlock, I just call "updatePodAttachNets" to call the lb-svc.sh and, eventually, update the service with the new Ingress IP, if this is the case.

Yes. It's another bug when pod attach ip changed but svc loadbalancer ip does not get updated. Appreciate for the patch code. I will commit the patch code to github.

I missed this point too first time, when using LoadBalancer I've this bad habit to consider only the case where a static IP is requested. Nice to read you consider my patch for github. The patch messages have to be cleaned somebit, I've the habit to use the prefix "GVT" to identify log messages when I patch others code.

And the second question about src ip in iptables. It seems that log is the only way to record the src ip address. Or any other suggestions, welcome to commit a pull request for that.

I think there is an opportunity to implement "externalTrafficPolicy: Local" here, it wouldn't be a question to modify the lb-svc.sh script of the vpc-nat-gateway to add the correct iptables dnat port commands?

It would end with asymmetric routing of course, but maybe would work and would unmask client src addresses, where is requested.

Thanks!

hongzhen-ma self-assigned this May 27, 2024

hongzhen-ma mentioned this issue May 28, 2024

add reinit process for lb-svc pod #4092

Merged

1 task

hongzhen-ma closed this as completed in #4092 May 30, 2024

dosubot bot mentioned this issue Feb 25, 2025

[BUG] VPC Nat Gateway iptables doesn't reconcile on node reboot #5029

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoadBalancer vpc-nat-gateway. #4052

LoadBalancer vpc-nat-gateway. #4052

Orfheo commented May 20, 2024

zbb88888 commented May 21, 2024

Orfheo commented May 21, 2024

zbb88888 commented May 22, 2024

zbb88888 commented May 22, 2024

Orfheo commented May 22, 2024 •

edited

Loading

Orfheo commented May 22, 2024 •

edited

Loading

Orfheo commented May 22, 2024 •

edited

Loading

zbb88888 commented May 23, 2024 •

edited

Loading

Orfheo commented May 23, 2024

zbb88888 commented May 24, 2024

Orfheo commented May 24, 2024

zbb88888 commented May 24, 2024

hongzhen-ma commented May 27, 2024 •

edited

Loading

Orfheo commented May 27, 2024

Orfheo commented May 27, 2024

hongzhen-ma commented May 28, 2024 •

edited

Loading

Orfheo commented May 28, 2024 •

edited

Loading

LoadBalancer vpc-nat-gateway. #4052

LoadBalancer vpc-nat-gateway. #4052

Comments

Orfheo commented May 20, 2024

zbb88888 commented May 21, 2024

Orfheo commented May 21, 2024

zbb88888 commented May 22, 2024

zbb88888 commented May 22, 2024

Orfheo commented May 22, 2024 • edited Loading

Orfheo commented May 22, 2024 • edited Loading

Orfheo commented May 22, 2024 • edited Loading

zbb88888 commented May 23, 2024 • edited Loading

Orfheo commented May 23, 2024

zbb88888 commented May 24, 2024

Orfheo commented May 24, 2024

zbb88888 commented May 24, 2024

hongzhen-ma commented May 27, 2024 • edited Loading

Orfheo commented May 27, 2024

Orfheo commented May 27, 2024

hongzhen-ma commented May 28, 2024 • edited Loading

Orfheo commented May 28, 2024 • edited Loading

Orfheo commented May 22, 2024 •

edited

Loading

Orfheo commented May 22, 2024 •

edited

Loading

Orfheo commented May 22, 2024 •

edited

Loading

zbb88888 commented May 23, 2024 •

edited

Loading

hongzhen-ma commented May 27, 2024 •

edited

Loading

hongzhen-ma commented May 28, 2024 •

edited

Loading

Orfheo commented May 28, 2024 •

edited

Loading