Skip to content

Conversation

@elezar
Copy link
Member

@elezar elezar commented Nov 13, 2025

No description provided.

Comment on lines +21 to +22
# We difine nvidia-perisistenced.service as an (After) Requisite to ensure that this
# serivce only starts if that is already started.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We difine nvidia-perisistenced.service as an (After) Requisite to ensure that this
# serivce only starts if that is already started.
# Specify nvidia-perisistenced.service as a dependency to ensure that this service
# only starts if that has started. This is required for Confidential Compute.

@steven-bellock
Copy link

@yf23 this solution requires nvidia-persistenced to be started by systemd as outlined in the deployment guide https://docs.nvidia.com/cc-deployment-guide-tdx.pdf. Does this seem reasonable?

@elezar
Copy link
Member Author

elezar commented Nov 17, 2025

@yf23 this solution requires nvidia-persistenced to be started by systemd as outlined in the deployment guide https://docs.nvidia.com/cc-deployment-guide-tdx.pdf. Does this seem reasonable?

It is often recommended on our end, but I don't know whether we can call this a requirement. Is there a way to ONLY add this requirement IF it is actuall present?

@elezar elezar added this to the v1.18.1 milestone Nov 17, 2025
@ArangoGutierrez
Copy link
Collaborator

@elezar Thanks for this PR, Could you provide a PR description detailing the motivation behind this change? is the goal to address a specific user need, eg CoCO/CC, or a more broad motivation behind making the cdi-refresh service more stable?

@elezar elezar removed this from the v1.18.1 milestone Nov 19, 2025
@yf23
Copy link

yf23 commented Nov 21, 2025

@yf23 this solution requires nvidia-persistenced to be started by systemd as outlined in the deployment guide https://docs.nvidia.com/cc-deployment-guide-tdx.pdf. Does this seem reasonable?

Hi @steven-bellock,

Thanks for putting up the solution. Overall it looks good to me, with one small question:
Does this dependency make sure the CDI refresh happens after Nvidia persistence daemon starts OR after Nvidia persistence mode enablement finishes? One example execution order (espcially on HGX with 8 GPUs when persistence mode enablement takes longer time) would be:

Nvidia persistence daemon starts
Nvidia persistence daemon enables persistenced mode starts
Nvidia CDI refresher daemon starts
Nvidia CDI refresher triggers CDI refresh (which calls nvidia-smi)
Nvidia persistence daemon enables persistenced mode finishes

In this case it could still have the problem

@steven-bellock
Copy link

Thanks for putting up the solution.

All credit goes to @elezar.

Execution of nvidia-persistenced will not return until all GPUs have been initialized. https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html says (emphasis mine)

After= ensures the opposite, that the listed unit is fully started up before the configured unit is started.

so it sounds like that should be OK.

@yf23
Copy link

yf23 commented Nov 22, 2025

Execution of nvidia-persistenced will not return until all GPUs have been initialized.
so it sounds like that should be OK.

Thanks @steven-bellock @elezar for the solution & explanation! Looks good to me and appreciate your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants