fix: Ensure Orphan CR Retention in Longhorn When orphan-auto-deletion is Set to False #3961

nzhan126 · 2025-08-04T21:06:42Z

Which issue(s) this PR fixes:

Issue longhorn/longhorn#7795

What this PR does / why we need it:

Delete orphan CR only when node is deleted, evicting, or the actual data of the orphan CR no longer exist

Special notes for your reviewer:

Additional documentation or context

coderabbitai · 2025-08-04T21:06:47Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

shuo-wu

The suggested solution includes 2 parts:

1. We shouldn't clean up orphan CR when its node is down, missing manager, or Longhorn manager cannot [get disk stat](https://github.com/longhorn/longhorn-manager/blob/c4e7942684cc1f8ece900854d09126a7b1f8c0b6/util/util.go#L434-L435)
2. We still should delete orphan CR when node is deleted, evicting, or the actual data of the orphan CR no longer exist

How does this PR achieve the actual data of the orphan CR no longer exist ?

controller/orphan_controller.go

nzhan126 · 2025-08-05T19:17:55Z

The suggested solution includes 2 parts:
1. We shouldn't clean up orphan CR when its node is down, missing manager, or Longhorn manager cannot [get disk stat](https://github.com/longhorn/longhorn-manager/blob/c4e7942684cc1f8ece900854d09126a7b1f8c0b6/util/util.go#L434-L435)
2. We still should delete orphan CR when node is deleted, evicting, or the actual data of the orphan CR no longer exist
How does this PR achieve the actual data of the orphan CR no longer exist ?
@shuo-wu
I suppose it is done here:

longhorn-manager/controller/node_controller.go

Lines 1356 to 1359 in dd2415b

dataStore := orphan.Spec.Parameters[longhorn.OrphanDataName]

if _, ok := replicaDataStores[dataStore]; !ok {

missingOrphanedReplicaDataStores[dataStore] = ""

}

controller/orphan_controller.go

PhanLe1010 · 2025-08-12T02:14:10Z

Hi @derekbit @COLDTURNIP Could I have a question about orphan delete setting:

The current code is

		Description: "This setting allows Longhorn to automatically delete orphan resources and their corresponding orphaned resources. \n\n" +
			"Orphan resources located on nodes that are in down or unknown state will not be cleaned up automatically. \n\n" +
			"List the enabled resource types in a semicolon-separated list. \n\n" +
			"Available items are: \n\n" +
			"- **replica-data**: replica data store \n\n" +
			"- **instance**: engine and replica runtime instance \n\n",

From the design perceptive. What would be the expectation?

When node down/missing manager pod, we do not delete orphan CRs at all
OR When node down/missing manager pod, we do delete orphan CRs but does not actually delete data

PhanLe1010 · 2025-08-12T02:14:26Z

This PR is related to longhorn/longhorn#7795

COLDTURNIP · 2025-08-12T02:49:40Z

Hi @derekbit @COLDTURNIP Could I have a question about orphan delete setting:

The current code is

		Description: "This setting allows Longhorn to automatically delete orphan resources and their corresponding orphaned resources. \n\n" +
			"Orphan resources located on nodes that are in down or unknown state will not be cleaned up automatically. \n\n" +
			"List the enabled resource types in a semicolon-separated list. \n\n" +
			"Available items are: \n\n" +
			"- **replica-data**: replica data store \n\n" +
			"- **instance**: engine and replica runtime instance \n\n",

From the design perceptive. What would be the expectation?

1. When node down/missing manager pod, we do not delete orphan CRs at all
2. OR When node down/missing manager pod, we do delete orphan CRs but does not actually delete data

Leaving aside the impact of this setting, I think the 2nd one is expected. The CR reflects the status of the resource in the cluster. When a resource is not reachable from the cluster (e.g. node leave), the corresponding CR should be deleted. Once a node is unreachable by Longhorn, the underlying resource, including the landed data, are not controlled by Longhorn, except the eviction is required. Back to the auto deletion, when this setting is enabled, we expects the controller clear the orphaned resource once it is identified. Without the auto deletion enabled, the CRs are expected to reflect the detectability of the resource, and delete the them only when user asks to.

When auto deletion enabled:

Resource monitor detects that a resource becomes orphaned; creates an orphan CR for it.
Then clear the orphan CR immediately to clean up the orphaned resource.

No matter the auto deletion is enabled or not, when a node/disk eviction requested:

Clear the orphaned resources on the node/disk by deleting the orphan CR

No matter the auto deletion is enabled or not, when a node leaves Longhorn without eviction, and there are some resources on the leaving node:

Since the resource is no longer reachable by Longhorn, delete the orphan CR directly without touching the disk resource
Runtime instances live in instance manager, and instance manager would be terminated along with the node, hence the orphaned instance will also be cleared. This removal is not caused by the auto deletion setting.

derekbit · 2025-08-12T17:08:21Z

Hi @derekbit @COLDTURNIP Could I have a question about orphan delete setting:
The current code is
		Description: "This setting allows Longhorn to automatically delete orphan resources and their corresponding orphaned resources. \n\n" +
			"Orphan resources located on nodes that are in down or unknown state will not be cleaned up automatically. \n\n" +
			"List the enabled resource types in a semicolon-separated list. \n\n" +
			"Available items are: \n\n" +
			"- **replica-data**: replica data store \n\n" +
			"- **instance**: engine and replica runtime instance \n\n",
From the design perceptive. What would be the expectation?
1. When node down/missing manager pod, we do not delete orphan CRs at all
2. OR When node down/missing manager pod, we do delete orphan CRs but does not actually delete data
Leaving aside the impact of this setting, I think the 2nd one is expected. The CR reflects the status of the resource in the cluster. When a resource is not reachable from the cluster (e.g. node leave), the corresponding CR should be deleted. Once a node is unreachable by Longhorn, the underlying resource, including the landed data, are not controlled by Longhorn, except the eviction is required. Back to the auto deletion, when this setting is enabled, we expects the controller clear the orphaned resource once it is identified. Without the auto deletion enabled, the CRs are expected to reflect the detectability of the resource, and delete the them only when user asks to.

When auto deletion enabled:

Resource monitor detects that a resource becomes orphaned; creates an orphan CR for it.

Then clear the orphan CR immediately to clean up the orphaned resource.

No matter the auto deletion is enabled or not, when a node/disk eviction requested:

Clear the orphaned resources on the node/disk by deleting the orphan CR

No matter the auto deletion is enabled or not, when a node leaves Longhorn without eviction, and there are some resources on the leaving node:

Since the resource is no longer reachable by Longhorn, delete the orphan CR directly without touching the disk resource

Runtime instances live in instance manager, and instance manager would be terminated along with the node, hence the orphaned instance will also be cleared. This removal is not caused by the auto deletion setting.

Second that 2 is expected.

shuo-wu · 2025-08-12T20:47:33Z

No matter the auto deletion is enabled or not, when a node leaves Longhorn without eviction, and there are some resources on the leaving node:

Since the resource is no longer reachable by Longhorn, delete the orphan CR directly without touching the disk resource

Runtime instances live in instance manager, and instance manager would be terminated along with the node, hence the orphaned instance will also be cleared. This removal is not caused by the auto deletion setting.

This makes sense.

No matter the auto deletion is enabled or not, when a node/disk eviction requested:

Clear the orphaned resources on the node/disk by deleting the orphan CR

For the eviction case, I prefer to retain the related data resource and clean up the orphan CR only. One typical eviction case is node upgrade. After that, Longhorn may be able to continue using it.

Besides, we need to be careful about one corner case for the node down/manager missing scenario. If the node is suddenly back to healthy when deleting orphan CRs, Longhorn cannot accidently clean up the related data.

ref:7795 Signed-off-by: Nina Zhan <[email protected]>

github-actions bot assigned nzhan126 Aug 4, 2025

nzhan126 force-pushed the issue7795 branch 2 times, most recently from 74e9877 to 8ea7fbe Compare August 4, 2025 21:11

nzhan126 requested a review from shuo-wu August 4, 2025 21:12

nzhan126 marked this pull request as ready for review August 4, 2025 21:15

shuo-wu reviewed Aug 5, 2025

View reviewed changes

controller/orphan_controller.go Outdated Show resolved Hide resolved

nzhan126 force-pushed the issue7795 branch from d0a9888 to dd2415b Compare August 5, 2025 19:07

shuo-wu reviewed Aug 5, 2025

View reviewed changes

controller/orphan_controller.go Outdated Show resolved Hide resolved

nzhan126 force-pushed the issue7795 branch 2 times, most recently from 71d5933 to ee881bb Compare August 8, 2025 16:50

derekbit requested review from COLDTURNIP, PhanLe1010 and c3y1huang August 17, 2025 15:50

fix: ensure orphan CR retention when orphan-auto-deletion is false

1e80693

ref:7795 Signed-off-by: Nina Zhan <[email protected]>

derekbit force-pushed the issue7795 branch from ee881bb to 1e80693 Compare August 17, 2025 15:50

COLDTURNIP approved these changes Sep 23, 2025

View reviewed changes

fix: Ensure Orphan CR Retention in Longhorn When orphan-auto-deletion is Set to False #3961

Are you sure you want to change the base?

fix: Ensure Orphan CR Retention in Longhorn When orphan-auto-deletion is Set to False #3961

Uh oh!

Conversation

nzhan126 commented Aug 4, 2025 • edited by derekbit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

Uh oh!

coderabbitai bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

shuo-wu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nzhan126 commented Aug 5, 2025

Uh oh!

Uh oh!

PhanLe1010 commented Aug 12, 2025

Uh oh!

PhanLe1010 commented Aug 12, 2025

Uh oh!

COLDTURNIP commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

derekbit commented Aug 12, 2025

Uh oh!

shuo-wu commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nzhan126 commented Aug 4, 2025 •

edited by derekbit

Loading

coderabbitai bot commented Aug 4, 2025 •

edited

Loading

COLDTURNIP commented Aug 12, 2025 •

edited

Loading