Commit 95821f9
feat!: proposal for dynamic partition selection (#321)
Based on some of the work in
https://github.com/harvardinformatics/snakemake-executor-plugin-cannon,
I've put together this PR to add dynamic partition selection based on
job resource requirements. Relevant discussion in #106
This works by having the user provide a YML file that specifies their
cluster's partitions and the resource limits of each via option
`--slurm-partition-config`. The expected structure of the file is
simple:
```yaml
partitions:
some_partition:
max_runtime: 100
another_partition:
...
```
I realize adding another config file isn't ideal given that workflow
specific configs, workflow profiles, and global profiles already exist.
But this approach avoids the complexity of having to determine cluster
configurations through SLURM commands, which can vary wildly (as
discussed in #106). It also sidesteps the need for complex expressions
in set-resources to determine partitions, which can get unwieldy for big
workflows with lots of rules. Ideally, users can craft a
`partitions.yml` once and set it in their global profile.
I'm not super familiar with SLURM, and partition resource limits, so I
came up with a list based on the Snakemake standard resources and SLURM
resources:
The following limits can be defined for each partition:
| Parameter | Type | Description | Default |
| ----------------------- | --------- |
---------------------------------- | --------- |
| `max_runtime` | int | Maximum walltime in minutes | unlimited |
| `max_mem_mb` | int | Maximum total memory in MB | unlimited |
| `max_mem_mb_per_cpu` | int | Maximum memory per CPU in MB | unlimited
|
| `max_cpus_per_task` | int | Maximum CPUs per task | unlimited |
| `max_nodes` | int | Maximum number of nodes | unlimited |
| `max_tasks` | int | Maximum number of tasks | unlimited |
| `max_tasks_per_node` | int | Maximum tasks per node | unlimited |
| `max_gpu` | int | Maximum number of GPUs | 0 |
| `available_gpu_models` | list[str] | List of available GPU models |
none |
| `max_cpus_per_gpu` | int | Maximum CPUs per GPU | unlimited |
| `supports_mpi` | bool | Whether MPI jobs are supported | true |
| `max_mpi_tasks` | int | Maximum MPI tasks | unlimited |
| `available_constraints` | list[str] | List of available node
constraints | none |
It could be possible to support any arbitrary resource though, by
pattern matching: "max_{resource}". Though I've not implemented this in
this PR.
To pick the "best" partition for a job, I went with a naive scoring
approach that calculates a score by summing the ratios of requested
resources to partition limits. Higher scores should indicate better
resource utilization, for example: a job requesting 8 CPUs would prefer
a 16-CPU partition (score 0.5) over a 64-CPU partition (score 0.125).
Partitions that cannot satisfy a job's requirements are not considered.
This feature is opt in and respects the `slurm_partition` job resource,
as well as existing fallback partition logic.
@cmeesters, @johanneskoester let me know what you think of this
approach! I'm not particularly experienced with SLURM, so I've made
decisions here (limits, partition specs, etc.) based on my limited
experience and the available docs - so feedback is much appreciated.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Automatic per-job SLURM partition selection that respects explicit job
partitions, supports YAML-driven multi-partition configs (via env var or
CLI/file), scores compatible partitions, and falls back to the default
when none match.
* **Documentation**
* Added docs detailing the partition YAML schema, example configs
(standard/highmem/gpu), scoring rules, selection precedence, and
fallback behavior.
* **Tests**
* Expanded test coverage for parsing, scoring/selection across
CPU/GPU/MPI, constraints, multi-cluster scenarios, and error cases.
* **CI**
* Test run expanded to execute the full test suite.
<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: meesters <[email protected]>1 parent c980d1f commit 95821f9
File tree
6 files changed
+1323
-4
lines changed- .github/workflows
- docs
- snakemake_executor_plugin_slurm
- tests
6 files changed
+1323
-4
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
98 | 98 | | |
99 | 99 | | |
100 | 100 | | |
101 | | - | |
| 101 | + | |
102 | 102 | | |
103 | 103 | | |
104 | 104 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
67 | 169 | | |
68 | 170 | | |
69 | 171 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
6 | 7 | | |
7 | 8 | | |
8 | 9 | | |
| |||
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
| 39 | + | |
38 | 40 | | |
39 | 41 | | |
40 | 42 | | |
| |||
113 | 115 | | |
114 | 116 | | |
115 | 117 | | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
116 | 132 | | |
117 | 133 | | |
118 | 134 | | |
| |||
201 | 217 | | |
202 | 218 | | |
203 | 219 | | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
204 | 240 | | |
205 | 241 | | |
206 | 242 | | |
| |||
305 | 341 | | |
306 | 342 | | |
307 | 343 | | |
| 344 | + | |
| 345 | + | |
308 | 346 | | |
309 | 347 | | |
310 | 348 | | |
| |||
698 | 736 | | |
699 | 737 | | |
700 | 738 | | |
| 739 | + | |
701 | 740 | | |
702 | 741 | | |
703 | | - | |
| 742 | + | |
| 743 | + | |
| 744 | + | |
| 745 | + | |
704 | 746 | | |
705 | 747 | | |
706 | 748 | | |
| |||
0 commit comments