Skip to content

Conversation

@fasaxc
Copy link
Member

@fasaxc fasaxc commented Nov 28, 2025

Description

Speculative fix/diags for an issue I'm seeing in FV. Sometimes the very first felix container that we launch fails with an error accessing the k8s API server's cert file. Double check that the copy of the file succeeded and open/sync it just to make sure.

Switch to using --mount on the felix container since it's supposed to have better error reporting than -v.

Related issues/PRs

Todos

  • Tests
  • Documentation
  • Release note

Release Note

TBD

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

Was seeing the very first felix container fail to start in some
runs with a failure to mount in the cert. Check that we can
stat/open/sync the file after copying it out of the API server's
container.
Copilot AI review requested due to automatic review settings November 28, 2025 14:41
@fasaxc fasaxc requested a review from a team as a code owner November 28, 2025 14:41
@marvin-tigera marvin-tigera added this to the Calico v3.32.0 milestone Nov 28, 2025
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Nov 28, 2025
@fasaxc fasaxc added docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact and removed release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Nov 28, 2025
Copilot finished reviewing on behalf of fasaxc November 28, 2025 14:42
@fasaxc fasaxc force-pushed the diagnose-cert-fail branch from 544ddeb to 2b6008d Compare November 28, 2025 14:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses intermittent FV test failures where the first felix container occasionally fails to access the Kubernetes API server certificate file. The fix adds additional verification steps after copying the certificate and switches to using --mount for better error diagnostics.

Key Changes:

  • Enhanced cert file copy retry loop with additional verification steps (stat, open, sync)
  • Switched from -v to --mount syntax for Docker volume bindings to improve error reporting
  • Added cleanup of cert file before each retry attempt

"-e", "KUBERNETES_MASTER=" + kds.Endpoint,
"-e", "K8S_INSECURE_SKIP_TLS_VERIFY=true",
"-v", kds.CertFileName + ":/tmp/apiserver.crt",
"--mount", fmt.Sprintf("type=bind,source=%s,target=%s", kds.CertFileName, "/tmp/apiserver.crt"),
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The --mount syntax is missing the readonly option. Since this is a certificate file that should not be modified by the container, consider adding readonly for better security:

"--mount", fmt.Sprintf("type=bind,source=%s,target=%s,readonly", kds.CertFileName, "/tmp/apiserver.crt"),

This prevents accidental modification of the certificate file from within the container.

Copilot uses AI. Check for mistakes.
"-e", "K8S_API_ENDPOINT=" + kds.BadEndpoint,
"-e", "K8S_INSECURE_SKIP_TLS_VERIFY=true",
"-v", kds.CertFileName + ":/tmp/apiserver.crt",
"--mount", fmt.Sprintf("type=bind,source=%s,target=%s", kds.CertFileName, "/tmp/apiserver.crt"),
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The --mount syntax is missing the readonly option. Since this is a certificate file that should not be modified by the container, consider adding readonly for better security:

"--mount", fmt.Sprintf("type=bind,source=%s,target=%s,readonly", kds.CertFileName, "/tmp/apiserver.crt"),

This prevents accidental modification of the certificate file from within the container.

Copilot uses AI. Check for mistakes.
@fasaxc fasaxc force-pushed the diagnose-cert-fail branch from 2b6008d to 2cbf821 Compare November 28, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants