Tolerate Machine ProviderID to support slow Infrastructure Providers #51

Mmduh-483 · 2025-10-08T14:27:50Z

This change updates the CloudProvider reconciliation logic to allow the use of a newly created CAPI Machine even if it does not immediately have the ProviderID set.

Currently, the reconciliation will revert the MachineDeployment's replica count and delete the Machine if ProviderID is not assigned within the timeout period. This is problematic for InfrastructureProviders that require significant time to set the ProviderID.

This change allows the reconciliation to proceed with the Machine's creation, preventing premature deletion and supporting providers that have a slow time-to-providerID. The system will continue to wait for the Machine to eventually obtain a ProviderID. This enhances compatibility with slow-to-provision infrastructures.

Mmduh-483 · 2025-10-08T14:28:28Z

/cc @elmiko

elmiko · 2025-10-08T17:57:58Z

thanks @Mmduh-483 , i am going to need some time to review this.

elmiko

i'm still reviewing this, and i would like to test it out locally.

@Mmduh-483 i'm curious have you run this with providers that take more than a minute to return a provider id?

i'm curious how well karpenter tolerates adding the provider id later in the creation cycle.

elmiko · 2025-10-10T13:24:38Z

pkg/cloudprovider/cloudprovider.go

-		// since we have a Machine, we should be reducing the replicas and annotating the Machine for deletion.
-		return nil, fmt.Errorf("cannot satisfy create, unable to label Machine %q as a member: %w", machine.Name, err)
+	if machine.Spec.ProviderID == nil {
+		return nil, fmt.Errorf("cannot satisfy create, waiting Machine %q to have ProviderID", machine.Name)


minor nit

Suggested change

return nil, fmt.Errorf("cannot satisfy create, waiting Machine %q to have ProviderID", machine.Name)

return nil, fmt.Errorf("cannot satisfy create, waiting for Machine %q to have ProviderID", machine.Name)

Mmduh-483 · 2025-10-12T08:06:10Z

i'm still reviewing this, and i would like to test it out locally.

@Mmduh-483 i'm curious have you run this with providers that take more than a minute to return a provider id?

i'm curious how well karpenter tolerates adding the provider id later in the creation cycle.

The Kubernetes infrastructure is provisioned using KubeKey, which assigns the providerID to each node after the successful installation and running of Kubernetes on the node.

The Karpenter lifecycle reconciliation for NodeClaim logic initiates a call to the CloudProvider's Create(ctx, NodeClaim) function. This function is responsible for provisioning the underlying infrastructure and returning the providerID for the newly created node.

If the provisioned machine does not immediately possess a providerID, Karpenter logs a transitional error indicating it is waiting for this identifier. Should the Create function return an error, Karpenter will automatically re-attempt the reconciliation of the NodeClaim.

To maintain the necessary binding between the control plane object and the physical infrastructure, the associated Machine's name and namespace are applied as annotations to the NodeClaim resource.

elmiko · 2025-10-13T12:55:35Z

i appreciate the detailed explanation.

The Kubernetes infrastructure is provisioned using KubeKey, which assigns the providerID to each node after the successful installation and running of Kubernetes on the node.

have you run this patch with the kubekey provider?

how well does it work?

Mmduh-483 · 2025-10-14T08:34:04Z

have you run this patch with the kubekey provider?

yes

how well does it work?

Autoscaling up/down is working, I did few rounds of testing.

elmiko · 2025-10-14T13:16:02Z

awesome, thanks!

i want to run a local test or two, should be able to finish by end of week.

elmiko

i think this is making sense to me. using annotations on the nodeclaim is a novel approach to solving the async provider id.

one recommendation i have, could we combine the machine and namespace annotation into a single annotation. something like cluster-api.kubernetes.io/machine-with-ns: <machine namespace>/<machine name>. would there be any issues with this design?

maxcao13

What happens if the cloud provider never sets a providerID on the machine? I have not tried this, but I'm assuming that the NodeClaim would never become registered and we Karpenter just retries again. But I am not sure what happens to the Machine after. Does the karpenter-core code call cloudprovider.Delete() eventually if this happens? Is this what you observed?

maxcao13 · 2025-10-18T00:03:19Z

pkg/cloudprovider/cloudprovider.go

+	if machine.Spec.ProviderID == nil {
+		nodeClaim.Annotations[machineAnnotation] = machine.Name
+		nodeClaim.Annotations[machineNamespaceAnnotation] = machine.Namespace
+		if err = c.kubeClient.Update(ctx, nodeClaim); err != nil {
+			return nil, nil, fmt.Errorf("cannot satisfy create, unable to update NodeClaim annotaitons %q: %w", nodeClaim.Name, err)
+		}
+	}


Does this have to be conditional? Would there be any harm from always setting these annotations, even if the providerID has already been set?

No need to be conditional, updated to have the annotations always

maxcao13 · 2025-10-18T00:09:31Z

pkg/cloudprovider/cloudprovider.go

 	return []cloudprovider.RepairPolicy{}
 }

+func (c *CloudProvider) createOrGetMachine(ctx context.Context, nodeClaim *karpv1.NodeClaim) (*capiv1beta1.MachineDeployment, *capiv1beta1.Machine, error) {


Nothing to do with the implementation, but I think this might benefit from either a better name or a well worded comment above the function definition.

I think it's a little confusing to name this createOrGetMachine because it might give the impression that the function is calling some sort of caching mechanism where we create a Machine, if it doesn't already exist. But actually, the "get" is supposed to act as a status check for the asynchronous provisioning.

Renamed to provisionMachine with comment

maxcao13 · 2025-10-18T00:10:12Z

pkg/cloudprovider/cloudprovider.go

+		nodeClaim.Annotations[machineAnnotation] = machine.Name
+		nodeClaim.Annotations[machineNamespaceAnnotation] = machine.Namespace
+		if err = c.kubeClient.Update(ctx, nodeClaim); err != nil {
+			return nil, nil, fmt.Errorf("cannot satisfy create, unable to update NodeClaim annotaitons %q: %w", nodeClaim.Name, err)


nit

Suggested change

return nil, nil, fmt.Errorf("cannot satisfy create, unable to update NodeClaim annotaitons %q: %w", nodeClaim.Name, err)

return nil, nil, fmt.Errorf("cannot satisfy create, unable to update NodeClaim annotations %q: %w", nodeClaim.Name, err)

Mmduh-483 · 2025-10-22T08:39:27Z

What happens if the cloud provider never sets a providerID on the machine? I have not tried this, but I'm assuming that the NodeClaim would never become registered and we Karpenter just retries again. But I am not sure what happens to the Machine after. Does the karpenter-core code call cloudprovider.Delete() eventually if this happens? Is this what you observed?

Currently if CAPI karpenter didn't manage to get a Machine with ProviderID then it will keep retrying

elmiko

thanks for the updates @Mmduh-483 , i think we are close. i just noticed one minor change i would like, then i'm ok to approve.

elmiko · 2025-10-22T14:32:01Z

pkg/cloudprovider/cloudprovider.go

 	taintsKey       = "capacity.cluster-autoscaler.kubernetes.io/taints"
 	maxPodsKey      = "capacity.cluster-autoscaler.kubernetes.io/maxPods"
+
+	machineAnnotation = "cluster-autoscaler.kubernetes.io/machine"


sorry to be a pain about this, i think we should use the cluster-api prefix for this:

Suggested change

machineAnnotation = "cluster-autoscaler.kubernetes.io/machine"

machineAnnotation = "cluster.x-k8s.io/machine"

mostly because this annotation is not associated with the cluster-autoscaler. those other annotations have some history with the autoscaler as they originated there in the scale from zero enhancement for cluster-api. i kept the original prefixes on them so that we could share implementation details between karpenter and cluster-autoscaler. but for this machine annotation, we should make it clear that it is related to cluster-api.

…oviders This change updates the CloudProvider reconciliation logic to allow the use of a newly created CAPI Machine even if it does not immediately have the ProviderID set. Currently, the reconciliation will revert the MachineDeployment's replica count and delete the Machine if ProviderID is not assigned within the timeout period. This is problematic for InfrastructureProviders that require significant time to set the ProviderID. This change allows the reconciliation to proceed with the Machine's creation, preventing premature deletion and supporting providers that have a slow time-to-providerID. The system will continue to wait for the Machine to eventually obtain a ProviderID. This enhances compatibility with slow-to-provision infrastructures. Signed-off-by: Mamduh Alassi <[email protected]>

elmiko

thanks for all the updates @Mmduh-483 , this is looking good to me. if @maxcao13 is also good with it, i'm happy to approve.

/lgtm

maxcao13 · 2025-10-23T14:55:25Z

/lgtm

thanks!

elmiko · 2025-10-24T18:49:16Z

/approve

k8s-ci-robot · 2025-10-24T18:49:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, Mmduh-483

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [elmiko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 8, 2025

k8s-ci-robot requested review from JoelSpeed and chrischdi October 8, 2025 14:27

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 8, 2025

k8s-ci-robot requested a review from elmiko October 8, 2025 14:28

elmiko reviewed Oct 10, 2025

View reviewed changes

Mmduh-483 force-pushed the machine-no-provider-id branch from cc5d3ff to f6d3c2d Compare October 14, 2025 10:33

elmiko reviewed Oct 15, 2025

View reviewed changes

elmiko mentioned this pull request Oct 17, 2025

Prepare 0.2.0 release #52

Merged

maxcao13 reviewed Oct 18, 2025

View reviewed changes

Mmduh-483 force-pushed the machine-no-provider-id branch from f6d3c2d to 09f5d40 Compare October 22, 2025 09:52

Mmduh-483 requested review from elmiko and maxcao13 October 22, 2025 14:25

elmiko reviewed Oct 22, 2025

View reviewed changes

Mmduh-483 force-pushed the machine-no-provider-id branch from 09f5d40 to b64a507 Compare October 23, 2025 05:32

Mmduh-483 requested a review from elmiko October 23, 2025 11:39

elmiko reviewed Oct 23, 2025

View reviewed changes

k8s-ci-robot assigned elmiko Oct 23, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 23, 2025

k8s-ci-robot assigned maxcao13 Oct 23, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 24, 2025

k8s-ci-robot merged commit cb272b8 into kubernetes-sigs:main Oct 24, 2025
2 checks passed

	return nil, fmt.Errorf("cannot satisfy create, waiting Machine %q to have ProviderID", machine.Name)
	return nil, fmt.Errorf("cannot satisfy create, waiting for Machine %q to have ProviderID", machine.Name)

	return nil, nil, fmt.Errorf("cannot satisfy create, unable to update NodeClaim annotaitons %q: %w", nodeClaim.Name, err)
	return nil, nil, fmt.Errorf("cannot satisfy create, unable to update NodeClaim annotations %q: %w", nodeClaim.Name, err)

	machineAnnotation = "cluster-autoscaler.kubernetes.io/machine"
	machineAnnotation = "cluster.x-k8s.io/machine"

Tolerate Machine ProviderID to support slow Infrastructure Providers #51

Tolerate Machine ProviderID to support slow Infrastructure Providers #51

Conversation

Mmduh-483 commented Oct 8, 2025

Uh oh!

Mmduh-483 commented Oct 8, 2025

Uh oh!

elmiko commented Oct 8, 2025

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mmduh-483 commented Oct 12, 2025

Uh oh!

elmiko commented Oct 13, 2025

Uh oh!

Mmduh-483 commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elmiko commented Oct 14, 2025

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

maxcao13 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mmduh-483 commented Oct 22, 2025

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

maxcao13 commented Oct 23, 2025

Uh oh!

elmiko commented Oct 24, 2025

Uh oh!

k8s-ci-robot commented Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Mmduh-483 commented Oct 14, 2025 •

edited

Loading