Skip to content

Sporadic failed to get list of addrs: results may be incomplete or inconsistent #1965

@zerkms

Description

@zerkms

What happened?

Now that #1962 returns the complete error message - I can finally see that it's

W1203 03:00:02.919551    3323 network_routes_controller.go:608] unable to get subnet for node IP: 10.35.9.6, err: failed to get list of addrs: results may be incomplete or inconsistent... skipping
W1203 03:00:02.951787    3323 linux_tunnels.go:139] failed to get tunnel link by name tun-aa2ffae2d77: Link not found, will attempt to create it
I1203 03:00:02.952835    3323 linux_tunnels.go:222] failed to check if fou is enabled on the link tun-aa2ffae2d77: failed to get link by name: Link not found, going to try to clean up and recreate the tunnel anyway
W1203 03:00:02.986407    3323 linux_tunnels.go:478] failed to list fou ports: no such file or directory
I1203 03:00:02.992879    3323 linux_tunnels.go:278] Creating tunnel tun-aa2ffae2d77 with encap ipip for destination 10.35.9.9

My speculation: apparently, during node boot up - various OS/kubernetes components mess with the network configuration (eg: after a node reboot - pods start up and CNI creates network interfaces for them). Hence if kube-router starts concurrently with everything else, there is a chance that sometimes it will clash and fail.

What did you expect to happen?

A suggestion: what if LocalLinkQuerier::AddrList implementation was a wrapper over vishvananda/netlink and performed several retries in case if nl.ErrDumpInterrupted is returned? Eg: retry 10 times with 1 second interval.

As a temporary (or permanent? :-D ) mitigation I'm having an initContainer that sleep 10, which apparently is enough to avoid the clashes, but it's feels "meh" :-)

UPD: I have made a quick research and found that's what people indeed do: openstack/ovn-bgp-agent@0a666bb

Or, as a bit controversial solution: I think the nl.ErrDumpInterrupted error could be ignored.

Why: we're detecting the host's IP address, and the interface for the node should be fully set up when kube-router starts, and we don't really care about the state of the configuration of other interfaces, right?

How can we reproduce the behavior you experienced?

Steps to reproduce the behavior:

  1. Step 1
  2. Step 2
  3. Step 3
  4. Step 4

Screenshots / Architecture Diagrams / Network Topologies

If applicable, add those here to help explain your problem.

System Information (please complete the following information)

  • Kube-Router Version (kube-router --version): [e.g. 1.0.1] 2.6.3
  • Kube-Router Parameters: [e.g. --run-router --run-service-proxy --enable-overlay --overlay-type=full etc.]
  • Kubernetes Version (kubectl version) : [e.g. 1.18.3] 1.32.x
  • Cloud Type: [e.g. AWS, GCP, Azure, on premise] on premise
  • Kubernetes Deployment Type: [e.g. EKS, GKE, Kops, Kubeadm, etc.] kubeadm
  • Kube-Router Deployment Type: [e.g. DaemonSet, System Service] DS
  • Cluster Size: [e.g. 200 Nodes]

Logs, other output, metrics

Please provide logs, other kind of output or observed metrics here.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions