Skip to content

Dual-stack Services with externalTrafficPolicy=Local may lose advertisements due to a race between EndpointSlices events #11502

@sergeimonakhov

Description

@sergeimonakhov

Expected Behavior

Current Behavior

The current logic infers the Service IP family from its primary ClusterIP:

svcIsIPv6 := isIPv6(svc.Spec.ClusterIP)

In a dual-stack Service:

  • if ipFamilies: [IPv6, IPv4], the ClusterIP will be IPv6
  • if ipFamilies: [IPv4, IPv6], the ClusterIP will be IPv4

Then endpoint validation is done as:

if isIPv6(address) != svcIsIPv6 {
    continue
}

This is where the mismatch happens:

  1. Service has ipFamilies: [IPv6, IPv4] -> ClusterIP is IPv6 -> svcIsIPv6 = true.
  2. The IPv6 EndpointSlice event update arrives first all IPv6 endpoints match.
2025-12-04 12:09:35.495 [DEBUG][61] Checking routes for service advertise=true svc="default/nginx-lb"
2025-12-04 12:09:35.495 [DEBUG][61] Setting routes routes=["2001:db8:20::1/128", "203.0.113.81/32", "2001:db8:900::1/128", "203.0.113.16/32"]
  1. The IPv4 EndpointSlice event update arrives afterwards.
  2. All IPv4 endpoints are skipped because isIPv6(address) != svcIsIPv6.
2025-12-04 12:09:35.504 [DEBUG][61] Skipping service with no local endpoints svc="default/nginx-lb"
2025-12-04 12:09:35.504 [DEBUG][61] Checking routes for service advertise=false svc="default/nginx-lb"
  1. The function returns false and the service is removed from advertisement.

If the update events order remains consistent (e.g., the IPv6 ep slice always updates first), the service may never be advertised at all, because the IPv4 ep slice will always be evaluated against the wrong inferred family and will always be skipped, since it never contains the IPv6 address.

Possible Solution

A possible fix using Service.spec.ipFamilies and EndpointSlice.addressType is provided in PR #11503

Steps to Reproduce (for bugs)

  1. Create a dual-stack Kubernetes cluster
  2. Create a Deployment with two replicas
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:stable
  1. Deploy a Service with ipFamilies: [IPv6, IPv4]
apiVersion: v1
kind: Service
metadata:
  name: nginx-lb
  namespace: default
  annotations:
    "projectcalico.org/loadBalancerIPs": '["2001:db8:20::1","203.0.113.81"]'
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  ipFamilies: [IPv6, IPv4]
  selector:
    app: nginx
  ports:
  - port: 80
  1. Trigger rollouts to reproduce the race condition between two endpointSlice
kubectl rollout restart deployment/nginx

Context

We observed this behavior in a production environment, where some dual-stack LoadBalancer Services with externalTrafficPolicy: Local intermittently lost their local route advertisements. This prompted an investigation that led us to identify a race in the Service IP family matching logic.

Your Environment

  • Calico version: v3.31.2
  • Calico dataplane (bpf, nftables, iptables, windows etc.): bpf
  • Orchestrator version (e.g. kubernetes, openshift, etc.): v1.33.6
  • Operating System and version: Ubuntu 24.04.4
  • Link to your project (optional):

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions