-
Notifications
You must be signed in to change notification settings - Fork 486
Description
What happened?
Now that #1962 returns the complete error message - I can finally see that it's
W1203 03:00:02.919551 3323 network_routes_controller.go:608] unable to get subnet for node IP: 10.35.9.6, err: failed to get list of addrs: results may be incomplete or inconsistent... skipping
W1203 03:00:02.951787 3323 linux_tunnels.go:139] failed to get tunnel link by name tun-aa2ffae2d77: Link not found, will attempt to create it
I1203 03:00:02.952835 3323 linux_tunnels.go:222] failed to check if fou is enabled on the link tun-aa2ffae2d77: failed to get link by name: Link not found, going to try to clean up and recreate the tunnel anyway
W1203 03:00:02.986407 3323 linux_tunnels.go:478] failed to list fou ports: no such file or directory
I1203 03:00:02.992879 3323 linux_tunnels.go:278] Creating tunnel tun-aa2ffae2d77 with encap ipip for destination 10.35.9.9
My speculation: apparently, during node boot up - various OS/kubernetes components mess with the network configuration (eg: after a node reboot - pods start up and CNI creates network interfaces for them). Hence if kube-router starts concurrently with everything else, there is a chance that sometimes it will clash and fail.
What did you expect to happen?
A suggestion: what if LocalLinkQuerier::AddrList implementation was a wrapper over vishvananda/netlink and performed several retries in case if nl.ErrDumpInterrupted is returned? Eg: retry 10 times with 1 second interval.
As a temporary (or permanent? :-D ) mitigation I'm having an initContainer that sleep 10, which apparently is enough to avoid the clashes, but it's feels "meh" :-)
UPD: I have made a quick research and found that's what people indeed do: openstack/ovn-bgp-agent@0a666bb
Or, as a bit controversial solution: I think the nl.ErrDumpInterrupted error could be ignored.
Why: we're detecting the host's IP address, and the interface for the node should be fully set up when kube-router starts, and we don't really care about the state of the configuration of other interfaces, right?
How can we reproduce the behavior you experienced?
Steps to reproduce the behavior:
- Step 1
- Step 2
- Step 3
- Step 4
Screenshots / Architecture Diagrams / Network Topologies
If applicable, add those here to help explain your problem.
System Information (please complete the following information)
- Kube-Router Version (
kube-router --version): [e.g. 1.0.1] 2.6.3 - Kube-Router Parameters: [e.g. --run-router --run-service-proxy --enable-overlay --overlay-type=full etc.]
- Kubernetes Version (
kubectl version) : [e.g. 1.18.3] 1.32.x - Cloud Type: [e.g. AWS, GCP, Azure, on premise] on premise
- Kubernetes Deployment Type: [e.g. EKS, GKE, Kops, Kubeadm, etc.] kubeadm
- Kube-Router Deployment Type: [e.g. DaemonSet, System Service] DS
- Cluster Size: [e.g. 200 Nodes]
Logs, other output, metrics
Please provide logs, other kind of output or observed metrics here.
Additional context
Add any other context about the problem here.