Skip to content

Conversation

@bryan-cox
Copy link
Member

@bryan-cox bryan-cox commented Nov 12, 2025

Summary

Fixes DNS record leaks in Azure DNS zones by adding explicit cleanup before AKS cluster deletion.

Problem

We recently hit the maximum DNS zone record limit of 10,000 records in the aks-e2e.hypershift.azure.devcluster.openshift.com zone. Investigation revealed DNS records were leaking during AKS e2e and conformance test runs.

Root Cause: Race condition in the deprovision workflow

  1. HyperShift cluster deletion triggers DNS cleanup via external-dns
  2. But AKS cluster (hosting external-dns controller) is deleted immediately after
  3. external-dns doesn't have time to process the deletion events before it's killed
  4. DNS records accumulate over time, eventually hitting the zone limit

Reference: /contrib/ci/Azure/Manage Azure Cloud Resources/Deleting-DNS-Zone-Recordsets.md in the hypershift repo documents this known issue.

Solution

Created a new step-registry step that explicitly cleans up DNS records after the HostedCluster is destroyed but before the AKS cluster is deleted.

Implementation

New step: hypershift-azure-cleanup-external-dns

  • Authenticates to Azure using hypershift credentials
  • Lists DNS records matching the cluster name pattern
  • Deletes all non-NS/SOA records (A, CNAME, TXT records created by external-dns)
  • Gracefully handles missing DNS zones or cluster names
  • Provides detailed logging and error handling

Updated Deprovision Flow:

- chain: hypershift-azure-destroy              # Deletes HostedCluster
- ref: hypershift-azure-cleanup-external-dns   # NEW: Explicit DNS cleanup
- chain: cucushift-installer-rehearse-azure-aks-deprovision  # Deletes AKS cluster

Changes

New Files

  • ci-operator/step-registry/hypershift/azure/cleanup-external-dns/hypershift-azure-cleanup-external-dns-ref.yaml
  • ci-operator/step-registry/hypershift/azure/cleanup-external-dns/hypershift-azure-cleanup-external-dns-commands.sh
  • ci-operator/step-registry/hypershift/azure/cleanup-external-dns/hypershift-azure-cleanup-external-dns-ref.metadata.json
  • ci-operator/step-registry/hypershift/azure/cleanup-external-dns/OWNERS

Modified Files

Updated 8 AKS HyperShift deprovision chains to include the cleanup step:

  • cucushift-installer-rehearse-azure-aks-hypershift-base-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-cilium-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-byo-vnet-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-ephemeral-creds-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-etcd-disk-encryption-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-registry-overrides-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-heterogeneous-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-disaster-recovery-infra-deprovision-chain.yaml

Test Plan

  • CI tests pass with the new cleanup step
  • Verify DNS records are cleaned up after test runs
  • Monitor DNS zone record count over time

Related

Jira: CNTRLPLANE-1857

🤖 Generated with Claude Code

@openshift-ci openshift-ci bot requested review from gpei and jianlinliu November 12, 2025 21:03
@bryan-cox
Copy link
Member Author

Update: Added conformance workflow fix

Good catch on the main AKS workflows! I've added a commit to also update the hypershift-azure-aks-conformance workflow, which was calling the deprovision chain directly and bypassing the DNS cleanup.

Changes in latest commit:

  • Updated hypershift-azure-aks-conformance-workflow.yaml to include hypershift-azure-cleanup-external-dns in post steps

Note on hypershift-azure-aks-e2e workflow:

The e2e workflow is structured differently - it installs the HyperShift operator and then runs test suites that create/destroy their own clusters internally. It doesn't create a single cluster in the pre steps, so there's no specific cluster-name available in the post steps for cleanup.

For the e2e workflow, we may need a different cleanup approach (like cleaning up all records older than X hours), but that's riskier as it could interfere with concurrent tests. Will monitor to see if this workflow is also leaking records or if the test suite handles cleanup properly.

Total workflows now protected from DNS leaks: 9 (8 specialized deprovision chains + conformance workflow)

@bryan-cox
Copy link
Member Author

Update: Added e2e workflow fix + ran make jobs

Added another commit to update the hypershift-azure-aks-e2e workflow as well. Also ran make jobs to ensure Prow configs are regenerated.

All AKS workflows now protected (10 total):

Specialized Deprovision Chains (8):

  1. cucushift-installer-rehearse-azure-aks-hypershift-base-deprovision
  2. cucushift-installer-rehearse-azure-aks-hypershift-cilium-deprovision
  3. cucushift-installer-rehearse-azure-aks-hypershift-byo-vnet-deprovision
  4. cucushift-installer-rehearse-azure-aks-hypershift-ephemeral-creds-deprovision
  5. cucushift-installer-rehearse-azure-aks-hypershift-etcd-disk-encryption-deprovision
  6. cucushift-installer-rehearse-azure-aks-hypershift-registry-overrides-deprovision
  7. cucushift-installer-rehearse-azure-aks-hypershift-heterogeneous-deprovision
  8. cucushift-installer-rehearse-azure-aks-hypershift-disaster-recovery-infra-deprovision

Main Workflows (2):
9. hypershift-azure-aks-conformance (creates single cluster, has cluster-name)
10. hypershift-azure-aks-e2e (runs test suites with multiple clusters)

For the e2e workflow, the cleanup will gracefully skip if there's no cluster-name file (since the e2e tests manage multiple ephemeral clusters internally). But this provides protection if the framework does write cluster names.

@bryan-cox
Copy link
Member Author

/pj-rehearse pull-ci-openshift-hypershift-main-e2e-aks

@openshift-ci-robot
Copy link
Contributor

@bryan-cox: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@bryan-cox
Copy link
Member Author

/pj-rehearse pull-ci-openshift-hypershift-main-e2e-aks

@openshift-ci-robot
Copy link
Contributor

@bryan-cox: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@bryan-cox bryan-cox changed the title feat(hypershift): add explicit DNS cleanup for AKS test deprovision CNTRLPLANE-1857: feat(hypershift): add explicit DNS cleanup for AKS test deprovision Nov 13, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 13, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 13, 2025

@bryan-cox: This pull request references CNTRLPLANE-1857 which is a valid jira issue.

In response to this:

Summary

Fixes DNS record leaks in Azure DNS zones by adding explicit cleanup before AKS cluster deletion.

Problem

We recently hit the maximum DNS zone record limit of 10,000 records in the aks-e2e.hypershift.azure.devcluster.openshift.com zone. Investigation revealed DNS records were leaking during AKS e2e and conformance test runs.

Root Cause: Race condition in the deprovision workflow

  1. HyperShift cluster deletion triggers DNS cleanup via external-dns
  2. But AKS cluster (hosting external-dns controller) is deleted immediately after
  3. external-dns doesn't have time to process the deletion events before it's killed
  4. DNS records accumulate over time, eventually hitting the zone limit

Reference: /contrib/ci/Azure/Manage Azure Cloud Resources/Deleting-DNS-Zone-Recordsets.md in the hypershift repo documents this known issue.

Solution

Created a new step-registry step that explicitly cleans up DNS records after the HostedCluster is destroyed but before the AKS cluster is deleted.

Implementation

New step: hypershift-azure-cleanup-external-dns

  • Authenticates to Azure using hypershift credentials
  • Lists DNS records matching the cluster name pattern
  • Deletes all non-NS/SOA records (A, CNAME, TXT records created by external-dns)
  • Gracefully handles missing DNS zones or cluster names
  • Provides detailed logging and error handling

Updated Deprovision Flow:

- chain: hypershift-azure-destroy              # Deletes HostedCluster
- ref: hypershift-azure-cleanup-external-dns   # NEW: Explicit DNS cleanup
- chain: cucushift-installer-rehearse-azure-aks-deprovision  # Deletes AKS cluster

Changes

New Files

  • ci-operator/step-registry/hypershift/azure/cleanup-external-dns/hypershift-azure-cleanup-external-dns-ref.yaml
  • ci-operator/step-registry/hypershift/azure/cleanup-external-dns/hypershift-azure-cleanup-external-dns-commands.sh
  • ci-operator/step-registry/hypershift/azure/cleanup-external-dns/hypershift-azure-cleanup-external-dns-ref.metadata.json
  • ci-operator/step-registry/hypershift/azure/cleanup-external-dns/OWNERS

Modified Files

Updated 8 AKS HyperShift deprovision chains to include the cleanup step:

  • cucushift-installer-rehearse-azure-aks-hypershift-base-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-cilium-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-byo-vnet-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-ephemeral-creds-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-etcd-disk-encryption-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-registry-overrides-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-heterogeneous-deprovision-chain.yaml
  • cucushift-installer-rehearse-azure-aks-hypershift-disaster-recovery-infra-deprovision-chain.yaml

Test Plan

  • CI tests pass with the new cleanup step
  • Verify DNS records are cleaned up after test runs
  • Monitor DNS zone record count over time

Related

Jira: CNTRLPLANE-1857

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox
Copy link
Member Author

Fix Pushed: DNS Cleanup Now Works for E2E Tests

Issue Found

After analyzing the test artifacts from the previous run, I discovered why the cleanup wasn't working:

Cleaning up DNS records for cluster: ci-op-jyhti86f-23b84-aks-cluster
Found 0 DNS record(s) to delete

Root Cause: The script was searching for DNS records matching ci-op-jyhti86f-23b84-aks-cluster (the AKS management cluster name), but external-DNS creates records for the HyperShift cluster names created by the e2e test suite (like api-test-abc123, etc.).

The Fix

Updated the cleanup script to detect when it has an AKS cluster name and switch to "all-records" mode:

Before:

  • Only deleted records matching the cluster name from ${SHARED_DIR}/cluster-name
  • Failed for e2e tests because that file contains the AKS cluster name, not HyperShift cluster names

After:

  • Detects AKS cluster names (contains -aks-cluster)
  • Switches to "all-records" mode which deletes ALL non-NS/SOA records in the zone
  • Still uses cluster-specific mode for single-cluster workflows (like conformance)

Why All-Records Mode is Safe

  1. Each test uses a dedicated DNS zone (aks-e2e.hypershift.azure.devcluster.openshift.com)
  2. Tests don't run concurrently in the same zone
  3. Only non-NS/SOA records are deleted (preserves zone infrastructure)

Testing

The next test run should show:

Detected AKS cluster name (ci-op-xxx-aks-cluster), switching to cleanup all records mode
Listing ALL non-NS/SOA DNS records in zone...
Found X DNS record(s) to delete

And successfully clean up all the leaked records.

@bryan-cox
Copy link
Member Author

/pj-rehearse pull-ci-openshift-hypershift-main-e2e-aks

@openshift-ci-robot
Copy link
Contributor

@bryan-cox: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@bryan-cox
Copy link
Member Author

/pj-rehearse pull-ci-openshift-hypershift-main-e2e-aks

@openshift-ci-robot
Copy link
Contributor

@bryan-cox: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Add explicit DNS cleanup step to prevent DNS record leaks in Azure AKS
test workflows. The issue was identified when the DNS zone hit the 10,000
record limit due to leaked records from external-dns.

Root cause:
- HyperShift cluster deletion triggers external-dns cleanup
- AKS management cluster gets deleted immediately after
- external-dns controller is killed before it can process deletion events
- DNS records accumulate over time in shared DNS zones

Solution:
Query the management cluster for all HostedClusters and extract their
infraIDs, then clean up DNS records matching those infraIDs. This approach:

- Works for e2e tests (multiple clusters with different infraIDs)
- Works for conformance tests (single cluster)
- Only cleans up DNS records from THIS test run
- Safe for shared DNS zones (won't delete concurrent test records)

Implementation:
1. Query: kubectl get hostedclusters --all-namespaces -o jsonpath='{.spec.infraID}'
2. Find DNS records containing any of those infraIDs
3. Delete only those matching records

Example:
- HostedCluster has infraID: autoscaling-9hpz5
- DNS records: api-autoscaling-9hpz5, a-api-autoscaling-9hpz5-external-dns
- Cleanup: Deletes records containing 'autoscaling-9hpz5'
- Preserves: Records from other tests (different infraIDs)

Updated workflows:
- hypershift-azure-aks-conformance (single cluster)
- hypershift-azure-aks-e2e (multiple clusters)
- 8 specialized deprovision chains

Fixes: CNTRLPLANE-1857

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@bryan-cox
Copy link
Member Author

/pj-rehearse pull-ci-openshift-hypershift-main-e2e-aks

@openshift-ci-robot
Copy link
Contributor

@bryan-cox: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Add dedicated e2e-aks-override job to test Control Plane Operator (CPO)
overrides on Azure AKS platform. This complements the existing
e2e-aws-override job.

Key features:
- Triggered only when overrides.yaml is modified (run_if_changed)
- Sets TEST_CPO_OVERRIDE=1 to enable override testing
- Uses same workflow as regular AKS tests (hypershift-azure-aks-e2e)
- Paired with runTests field in overrides.yaml for granular control

This allows testing Azure CPO overrides independently from AWS overrides,
saving CI resources by skipping tests for platforms not being modified.

Related:
- hypershift PR openshift#7206 (runTests field implementation)
- CNTRLPLANE-1893

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 17, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bryan-cox
Once this PR has been reviewed and has the lgtm label, please assign liangquanli930 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@bryan-cox: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-csi-operator-main-hypershift-e2e-aks openshift/csi-operator presubmit Registry content changed
pull-ci-openshift-csi-operator-release-4.22-hypershift-e2e-aks openshift/csi-operator presubmit Registry content changed
pull-ci-openshift-csi-operator-release-4.21-hypershift-e2e-aks openshift/csi-operator presubmit Registry content changed
pull-ci-openshift-csi-operator-release-4.20-hypershift-e2e-aks openshift/csi-operator presubmit Registry content changed
pull-ci-openshift-csi-operator-release-4.19-hypershift-e2e-aks openshift/csi-operator presubmit Registry content changed
pull-ci-openshift-cloud-network-config-controller-main-hypershift-e2e-aks openshift/cloud-network-config-controller presubmit Registry content changed
pull-ci-openshift-cloud-network-config-controller-release-4.22-hypershift-e2e-aks openshift/cloud-network-config-controller presubmit Registry content changed
pull-ci-openshift-cloud-network-config-controller-release-4.21-hypershift-e2e-aks openshift/cloud-network-config-controller presubmit Registry content changed
pull-ci-openshift-cloud-network-config-controller-release-4.20-hypershift-e2e-aks openshift/cloud-network-config-controller presubmit Registry content changed
pull-ci-openshift-cloud-network-config-controller-release-4.19-hypershift-e2e-aks openshift/cloud-network-config-controller presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-master-hypershift-e2e-aks openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.22-hypershift-e2e-aks openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.21-hypershift-e2e-aks openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.20-hypershift-e2e-aks openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.19-hypershift-e2e-aks openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-ingress-operator-master-hypershift-e2e-aks openshift/cluster-ingress-operator presubmit Registry content changed
pull-ci-openshift-cluster-ingress-operator-release-4.22-hypershift-e2e-aks openshift/cluster-ingress-operator presubmit Registry content changed
pull-ci-openshift-cluster-ingress-operator-release-4.21-hypershift-e2e-aks openshift/cluster-ingress-operator presubmit Registry content changed
pull-ci-openshift-cluster-ingress-operator-release-4.20-hypershift-e2e-aks openshift/cluster-ingress-operator presubmit Registry content changed
pull-ci-openshift-cluster-ingress-operator-release-4.19-hypershift-e2e-aks openshift/cluster-ingress-operator presubmit Registry content changed
pull-ci-openshift-cluster-storage-operator-main-hypershift-e2e-aks openshift/cluster-storage-operator presubmit Registry content changed
pull-ci-openshift-cluster-storage-operator-release-4.22-hypershift-e2e-aks openshift/cluster-storage-operator presubmit Registry content changed
pull-ci-openshift-cluster-storage-operator-release-4.21-hypershift-e2e-aks openshift/cluster-storage-operator presubmit Registry content changed
pull-ci-openshift-cluster-storage-operator-release-4.20-hypershift-e2e-aks openshift/cluster-storage-operator presubmit Registry content changed
pull-ci-openshift-cluster-storage-operator-release-4.19-hypershift-e2e-aks openshift/cluster-storage-operator presubmit Registry content changed

A total of 120 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@bryan-cox
Copy link
Member Author

/pj-rehearse pull-ci-openshift-hypershift-main-e2e-aks

@openshift-ci-robot
Copy link
Contributor

@bryan-cox: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 18, 2025

@bryan-cox: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/generated-config 35b2036 link true /test generated-config

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants