Skip to content

Conversation

@cmeesters
Copy link
Collaborator

@cmeesters cmeesters commented Aug 14, 2025

This PR sets the MinJobAge parameter in the SLURM configuration. A feature request to the snakemake SLURM executor induces the need to set this parameter, as the new executor code will check it. If it is too low, there can be no alternative to sacct as the query command - squeue's job information would be too volatile.

It is set to a value of 12 h because otherwise the squeue option will not become available in the executor. This is insanely high, but such a requirement needs to be imposed for longer workflows not to fail.

@cmeesters cmeesters requested a review from dlaehnemann August 14, 2025 13:57
@dlaehnemann
Copy link

This generally sounds plausible, but I would like to understand the underlying issue better.

Could you link the respective feature request in the slurm executor plugin for context?

And otherwise, some questions:

Will the plugin automatically check for alternative commands if sacct fails? Or is it just that the user can provide an alternative job status query command?

From what I understand from the slurm config docs, MinJobAge ensures that the record of a completed job is not cleared from the list of jobs slurmctld for this period of time (in seconds).. So do you expect jobs where you will only query the job status more than 12 h after they have finished? I would expect this check to happen within minutes?

Also, does only squeue use the slurmctld? And thus sacct isn't affected by this configuration parameter at all?

And finally, if we discuss the details here already, maybe it makes sense to add a sentence or two as a comment in the config file, so that anybody reading the config file can directly see why this parameter is set in this way...

@cmeesters
Copy link
Collaborator Author

It is not related to a specific issue, rather this PR snakemake/snakemake-executor-plugin-slurm#336

Will the plugin automatically check for alternative commands ...

My current outline is: check for sacct availability and MinJobAge size. If sacct is available and the size sufficient, offer a choice. Else, if sacct is not available and the size is sufficient, switch to squeue with an appropriate warning, without an cli option. My PR is not ready, though.

From what I understand ...

Me too. My understanding is that you can query for the job status using squeue also for completed jobs (thereby querying the slurmctld) for up to MinJobAge. My concern is that for a long-running workflow, any low value of MinJobAge can cause a job to go out of scope. You could argue that 12 h is a lot as we ensure that we query more frequently. Yes, perhaps I need to consider a lower value. Ideas?

And thus sacct isn't affected by this configuration parameter at all?

From what I have seen, sacct talks to slurmdbd.

... a comment ...

Yep. Will do.

@dlaehnemann
Copy link

Thanks for the linkout to that pull request, this helps a lot.

The very high value makes some sense from the original poster there. But I think I would nevertheless set MinJobAge a lot lower here (and in the pull request), because otherwise you will never be able to switch over to squeue. The default value for MinJobAge is 300 seconds, and most cluster admins probably won't touch this. From the pull request I understand that this number would and should be increased in the use case where you don't set up slurmdbd, but I'm not sure everybody would set it to such a high value. And for a snakemake application, anything that is sure to be longer than the maximum timeframe for re-querying the job status should be fine. In the plugin you could even set it to something like 1.5 * or 2.0 * wait_time (not sure what the variable is over there), so that this check automatically adapts to any parameter changes (be they in the plugin code or even on the command line).

@cmeesters
Copy link
Collaborator Author

right, a dynamic setting of sorts is favourable. I will adjust this PR accordingly.

I did not think this through. Whilst coding I had to remember, that some of our users tend to demand forever jobs.

@dlaehnemann
Copy link

I think we are on the same page, but just to be sure: How long a job takes should not affect the job status query via squeue, as the MinJobAge only starts counting once the job is actually finished. So we basically just need to ensure that MinJobAge > wait_time_between_status_checks, with some good extra measure (because who knows the amounts of slowness of the system in edge cases, I guess;).

@cmeesters
Copy link
Collaborator Author

I am withdrawing this PR: it is better to set the time window dynamically in the executor plugin and allow for the cli option dynamically, too.

@cmeesters cmeesters closed this Aug 18, 2025
@dlaehnemann
Copy link

But don't you need an at least slightly increased MinJobAge setting here, in the testing setup for slurm?

@cmeesters
Copy link
Collaborator Author

I think 5 min within the CI will be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants