Skip to content

Conversation

@nzhan126
Copy link
Contributor

@nzhan126 nzhan126 commented Aug 4, 2025

Which issue(s) this PR fixes:

Issue longhorn/longhorn#7795

What this PR does / why we need it:

Delete orphan CR only when node is deleted, evicting, or the actual data of the orphan CR no longer exist

Special notes for your reviewer:

Additional documentation or context

@coderabbitai
Copy link

coderabbitai bot commented Aug 4, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@nzhan126 nzhan126 force-pushed the issue7795 branch 2 times, most recently from 74e9877 to 8ea7fbe Compare August 4, 2025 21:11
@nzhan126 nzhan126 requested a review from shuo-wu August 4, 2025 21:12
@nzhan126 nzhan126 marked this pull request as ready for review August 4, 2025 21:15
Copy link
Contributor

@shuo-wu shuo-wu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested solution includes 2 parts:

1. We shouldn't clean up orphan CR when its node is down, missing manager, or Longhorn manager cannot [get disk stat](https://github.com/longhorn/longhorn-manager/blob/c4e7942684cc1f8ece900854d09126a7b1f8c0b6/util/util.go#L434-L435)
2. We still should delete orphan CR when node is deleted, evicting, or the actual data of the orphan CR no longer exist

How does this PR achieve the actual data of the orphan CR no longer exist ?

@nzhan126
Copy link
Contributor Author

nzhan126 commented Aug 5, 2025

The suggested solution includes 2 parts:

1. We shouldn't clean up orphan CR when its node is down, missing manager, or Longhorn manager cannot [get disk stat](https://github.com/longhorn/longhorn-manager/blob/c4e7942684cc1f8ece900854d09126a7b1f8c0b6/util/util.go#L434-L435)
2. We still should delete orphan CR when node is deleted, evicting, or the actual data of the orphan CR no longer exist

How does this PR achieve the actual data of the orphan CR no longer exist ?
@shuo-wu
I suppose it is done here:

dataStore := orphan.Spec.Parameters[longhorn.OrphanDataName]
if _, ok := replicaDataStores[dataStore]; !ok {
missingOrphanedReplicaDataStores[dataStore] = ""
}

@nzhan126 nzhan126 force-pushed the issue7795 branch 2 times, most recently from 71d5933 to ee881bb Compare August 8, 2025 16:50
@PhanLe1010
Copy link
Contributor

Hi @derekbit @COLDTURNIP Could I have a question about orphan delete setting:

The current code is

		Description: "This setting allows Longhorn to automatically delete orphan resources and their corresponding orphaned resources. \n\n" +
			"Orphan resources located on nodes that are in down or unknown state will not be cleaned up automatically. \n\n" +
			"List the enabled resource types in a semicolon-separated list. \n\n" +
			"Available items are: \n\n" +
			"- **replica-data**: replica data store \n\n" +
			"- **instance**: engine and replica runtime instance \n\n",

From the design perceptive. What would be the expectation?

  1. When node down/missing manager pod, we do not delete orphan CRs at all
  2. OR When node down/missing manager pod, we do delete orphan CRs but does not actually delete data

@PhanLe1010
Copy link
Contributor

This PR is related to longhorn/longhorn#7795

@COLDTURNIP
Copy link
Contributor

COLDTURNIP commented Aug 12, 2025

Hi @derekbit @COLDTURNIP Could I have a question about orphan delete setting:

The current code is

		Description: "This setting allows Longhorn to automatically delete orphan resources and their corresponding orphaned resources. \n\n" +
			"Orphan resources located on nodes that are in down or unknown state will not be cleaned up automatically. \n\n" +
			"List the enabled resource types in a semicolon-separated list. \n\n" +
			"Available items are: \n\n" +
			"- **replica-data**: replica data store \n\n" +
			"- **instance**: engine and replica runtime instance \n\n",

From the design perceptive. What would be the expectation?

1. When node down/missing manager pod, we do not delete orphan CRs at all
2. OR When node down/missing manager pod, we do delete orphan CRs but does not actually delete data

Leaving aside the impact of this setting, I think the 2nd one is expected. The CR reflects the status of the resource in the cluster. When a resource is not reachable from the cluster (e.g. node leave), the corresponding CR should be deleted. Once a node is unreachable by Longhorn, the underlying resource, including the landed data, are not controlled by Longhorn, except the eviction is required. Back to the auto deletion, when this setting is enabled, we expects the controller clear the orphaned resource once it is identified. Without the auto deletion enabled, the CRs are expected to reflect the detectability of the resource, and delete the them only when user asks to.

When auto deletion enabled:

  • Resource monitor detects that a resource becomes orphaned; creates an orphan CR for it.
  • Then clear the orphan CR immediately to clean up the orphaned resource.

No matter the auto deletion is enabled or not, when a node/disk eviction requested:

  • Clear the orphaned resources on the node/disk by deleting the orphan CR

No matter the auto deletion is enabled or not, when a node leaves Longhorn without eviction, and there are some resources on the leaving node:

  • Since the resource is no longer reachable by Longhorn, delete the orphan CR directly without touching the disk resource
  • Runtime instances live in instance manager, and instance manager would be terminated along with the node, hence the orphaned instance will also be cleared. This removal is not caused by the auto deletion setting.

@derekbit
Copy link
Member

Hi @derekbit @COLDTURNIP Could I have a question about orphan delete setting:
The current code is

		Description: "This setting allows Longhorn to automatically delete orphan resources and their corresponding orphaned resources. \n\n" +
			"Orphan resources located on nodes that are in down or unknown state will not be cleaned up automatically. \n\n" +
			"List the enabled resource types in a semicolon-separated list. \n\n" +
			"Available items are: \n\n" +
			"- **replica-data**: replica data store \n\n" +
			"- **instance**: engine and replica runtime instance \n\n",

From the design perceptive. What would be the expectation?

1. When node down/missing manager pod, we do not delete orphan CRs at all
2. OR When node down/missing manager pod, we do delete orphan CRs but does not actually delete data

Leaving aside the impact of this setting, I think the 2nd one is expected. The CR reflects the status of the resource in the cluster. When a resource is not reachable from the cluster (e.g. node leave), the corresponding CR should be deleted. Once a node is unreachable by Longhorn, the underlying resource, including the landed data, are not controlled by Longhorn, except the eviction is required. Back to the auto deletion, when this setting is enabled, we expects the controller clear the orphaned resource once it is identified. Without the auto deletion enabled, the CRs are expected to reflect the detectability of the resource, and delete the them only when user asks to.

When auto deletion enabled:

  • Resource monitor detects that a resource becomes orphaned; creates an orphan CR for it.
  • Then clear the orphan CR immediately to clean up the orphaned resource.

No matter the auto deletion is enabled or not, when a node/disk eviction requested:

  • Clear the orphaned resources on the node/disk by deleting the orphan CR

No matter the auto deletion is enabled or not, when a node leaves Longhorn without eviction, and there are some resources on the leaving node:

  • Since the resource is no longer reachable by Longhorn, delete the orphan CR directly without touching the disk resource
  • Runtime instances live in instance manager, and instance manager would be terminated along with the node, hence the orphaned instance will also be cleared. This removal is not caused by the auto deletion setting.

Second that 2 is expected.

@shuo-wu
Copy link
Contributor

shuo-wu commented Aug 12, 2025

No matter the auto deletion is enabled or not, when a node leaves Longhorn without eviction, and there are some resources on the leaving node:

  • Since the resource is no longer reachable by Longhorn, delete the orphan CR directly without touching the disk resource
  • Runtime instances live in instance manager, and instance manager would be terminated along with the node, hence the orphaned instance will also be cleared. This removal is not caused by the auto deletion setting.

This makes sense.

No matter the auto deletion is enabled or not, when a node/disk eviction requested:

Clear the orphaned resources on the node/disk by deleting the orphan CR

For the eviction case, I prefer to retain the related data resource and clean up the orphan CR only. One typical eviction case is node upgrade. After that, Longhorn may be able to continue using it.

Besides, we need to be careful about one corner case for the node down/manager missing scenario. If the node is suddenly back to healthy when deleting orphan CRs, Longhorn cannot accidently clean up the related data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants