Skip to content

Commit 95821f9

Browse files
cademirchcmeesters
andauthored
feat!: proposal for dynamic partition selection (#321)
Based on some of the work in https://github.com/harvardinformatics/snakemake-executor-plugin-cannon, I've put together this PR to add dynamic partition selection based on job resource requirements. Relevant discussion in #106 This works by having the user provide a YML file that specifies their cluster's partitions and the resource limits of each via option `--slurm-partition-config`. The expected structure of the file is simple: ```yaml partitions: some_partition: max_runtime: 100 another_partition: ... ``` I realize adding another config file isn't ideal given that workflow specific configs, workflow profiles, and global profiles already exist. But this approach avoids the complexity of having to determine cluster configurations through SLURM commands, which can vary wildly (as discussed in #106). It also sidesteps the need for complex expressions in set-resources to determine partitions, which can get unwieldy for big workflows with lots of rules. Ideally, users can craft a `partitions.yml` once and set it in their global profile. I'm not super familiar with SLURM, and partition resource limits, so I came up with a list based on the Snakemake standard resources and SLURM resources: The following limits can be defined for each partition: | Parameter | Type | Description | Default | | ----------------------- | --------- | ---------------------------------- | --------- | | `max_runtime` | int | Maximum walltime in minutes | unlimited | | `max_mem_mb` | int | Maximum total memory in MB | unlimited | | `max_mem_mb_per_cpu` | int | Maximum memory per CPU in MB | unlimited | | `max_cpus_per_task` | int | Maximum CPUs per task | unlimited | | `max_nodes` | int | Maximum number of nodes | unlimited | | `max_tasks` | int | Maximum number of tasks | unlimited | | `max_tasks_per_node` | int | Maximum tasks per node | unlimited | | `max_gpu` | int | Maximum number of GPUs | 0 | | `available_gpu_models` | list[str] | List of available GPU models | none | | `max_cpus_per_gpu` | int | Maximum CPUs per GPU | unlimited | | `supports_mpi` | bool | Whether MPI jobs are supported | true | | `max_mpi_tasks` | int | Maximum MPI tasks | unlimited | | `available_constraints` | list[str] | List of available node constraints | none | It could be possible to support any arbitrary resource though, by pattern matching: "max_{resource}". Though I've not implemented this in this PR. To pick the "best" partition for a job, I went with a naive scoring approach that calculates a score by summing the ratios of requested resources to partition limits. Higher scores should indicate better resource utilization, for example: a job requesting 8 CPUs would prefer a 16-CPU partition (score 0.5) over a 64-CPU partition (score 0.125). Partitions that cannot satisfy a job's requirements are not considered. This feature is opt in and respects the `slurm_partition` job resource, as well as existing fallback partition logic. @cmeesters, @johanneskoester let me know what you think of this approach! I'm not particularly experienced with SLURM, so I've made decisions here (limits, partition specs, etc.) based on my limited experience and the available docs - so feedback is much appreciated. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Automatic per-job SLURM partition selection that respects explicit job partitions, supports YAML-driven multi-partition configs (via env var or CLI/file), scores compatible partitions, and falls back to the default when none match. * **Documentation** * Added docs detailing the partition YAML schema, example configs (standard/highmem/gpu), scoring rules, selection precedence, and fallback behavior. * **Tests** * Expanded test coverage for parsing, scoring/selection across CPU/GPU/MPI, constraints, multi-cluster scenarios, and error cases. * **CI** * Test run expanded to execute the full test suite. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: meesters <[email protected]>
1 parent c980d1f commit 95821f9

File tree

6 files changed

+1323
-4
lines changed

6 files changed

+1323
-4
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ jobs:
9898
poetry install
9999
100100
- name: Run pytest
101-
run: poetry run coverage run -m pytest tests/tests.py -sv --tb=short --disable-warnings
101+
run: poetry run coverage run -m pytest tests/ -sv --tb=short --disable-warnings
102102

103103
- name: Run Coverage
104104
run: poetry run coverage report -m

docs/further.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,108 @@ See the [snakemake documentation on profiles](https://snakemake.readthedocs.io/e
6464
How and where you set configurations on factors like file size or increasing the runtime with every `attempt` of running a job (if [`--retries` is greater than `0`](https://snakemake.readthedocs.io/en/stable/executing/cli.html#snakemake.cli-get_argument_parser-behavior)).
6565
[There are detailed examples for these in the snakemake documentation.](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-resources)
6666

67+
#### Automatic Partition Selection
68+
69+
The SLURM executor plugin supports automatic partition selection based on job resource requirements, via the command line option `--slurm-partition-config`. This feature allows the plugin to choose the most appropriate partition for each job, without the need to manually specify partitions for different job types. This also enables variable partition selection as a job's resource requirements change based on [dynamic resources](#dynamic-resource-specification), ensuring that jobs are always scheduled to an appropriate partition.
70+
71+
*Jobs that explicitly specify a `slurm_partition` resource will bypass automatic selection and use the specified partition directly.*
72+
73+
##### Partition Limits Specification
74+
75+
To enable automatic partition selection, create a YAML configuration file that defines the available partitions and their resource limits. This file should be structured as follows:
76+
77+
```yaml
78+
partitions:
79+
some_partition:
80+
max_runtime: 100
81+
another_partition:
82+
...
83+
```
84+
Where `some_partition` and `another_partition` are the names of the partition on your cluster, according to `sinfo`.
85+
86+
The following limits can be defined for each partition:
87+
88+
| Parameter | Type | Description | Default |
89+
| ----------------------- | --------- | ---------------------------------- | --------- |
90+
| `max_runtime` | int | Maximum walltime in minutes | unlimited |
91+
| `max_mem_mb` | int | Maximum total memory in MB | unlimited |
92+
| `max_mem_mb_per_cpu` | int | Maximum memory per CPU in MB | unlimited |
93+
| `max_cpus_per_task` | int | Maximum CPUs per task | unlimited |
94+
| `max_nodes` | int | Maximum number of nodes | unlimited |
95+
| `max_tasks` | int | Maximum number of tasks | unlimited |
96+
| `max_tasks_per_node` | int | Maximum tasks per node | unlimited |
97+
| `max_threads` | int | Maximum threads per node | unlimited |
98+
| `max_gpu` | int | Maximum number of GPUs | 0 |
99+
| `available_gpu_models` | list[str] | List of available GPU models | none |
100+
| `max_cpus_per_gpu` | int | Maximum CPUs per GPU | unlimited |
101+
| `supports_mpi` | bool | Whether MPI jobs are supported | true |
102+
| `max_mpi_tasks` | int | Maximum MPI tasks | unlimited |
103+
| `available_constraints` | list[str] | List of available node constraints | none |
104+
| `cluster` | str | Cluster name in multi-cluster setup | none |
105+
106+
##### Example Partition Configuration
107+
108+
```yaml
109+
partitions:
110+
standard:
111+
max_runtime: 720 # 12 hours
112+
max_mem_mb: 64000 # 64 GB
113+
max_cpus_per_task: 24
114+
max_nodes: 1
115+
116+
highmem:
117+
max_runtime: 1440 # 24 hours
118+
max_mem_mb: 512000 # 512 GB
119+
max_mem_mb_per_cpu: 16000
120+
max_cpus_per_task: 48
121+
max_nodes: 1
122+
123+
gpu:
124+
max_runtime: 2880 # 48 hours
125+
max_mem_mb: 128000 # 128 GB
126+
max_cpus_per_task: 32
127+
max_gpu: 8
128+
available_gpu_models: ["a100", "v100", "rtx3090"]
129+
max_cpus_per_gpu: 8
130+
```
131+
132+
The plugin supports automatic partition selection on clusters with SLURM multi-cluster setup. You can specify which cluster each partition belongs to in your partition configuration file:
133+
134+
```yaml
135+
partitions:
136+
d-standard:
137+
cluster: "deviating"
138+
max_runtime: "6d"
139+
max_nodes: 1
140+
max_threads: 127
141+
d-parallel:
142+
cluster: "deviating"
143+
supports_mpi: true
144+
max_threads: 128
145+
max_runtime: "6d"
146+
standard:
147+
cluster: "other"
148+
max_runtime: "6d"
149+
max_nodes: 1
150+
max_threads: 127
151+
parallel:
152+
cluster: "other"
153+
supports_mpi: true
154+
max_threads: 128
155+
max_runtime: "6d"
156+
```
157+
158+
159+
##### How Partition Selection Works
160+
161+
When automatic partition selection is enabled, the plugin evaluates each job's resource requirements against the defined partition limits to ensure the job is placed on a partition that can accommodate all of its requirements. When multiple partitions are compatible, the plugin uses a scoring algorithm that favors partitions with limits closer to the job's needs, preventing jobs from being assigned to partitions with excessively high resource limits.
162+
163+
The scoring algorithm calculates a score by summing the ratios of requested resources to partition limits (e.g., if a job requests 8 CPUs and a partition allows 16, this contributes 0.5 to the score). Higher scores indicate better resource utilization, so a job requesting 8 CPUs would prefer a 16-CPU partition (score 0.5) over a 64-CPU partition (score 0.125).
164+
165+
##### Fallback Behavior
166+
167+
If no suitable partition is found based on the job's resource requirements, the plugin falls back to the default SLURM behavior, which typically uses the cluster's default partition or any partition specified explicitly in the job's resources.
168+
67169

68170
#### Standard Resources
69171

snakemake_executor_plugin_slurm/__init__.py

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
__email__ = "[email protected]"
44
__license__ = "MIT"
55

6+
import atexit
67
import csv
78
from io import StringIO
89
import os
@@ -35,6 +36,7 @@
3536
)
3637
from .efficiency_report import create_efficiency_report
3738
from .submit_string import get_submit_command
39+
from .partitions import read_partition_file, get_best_partition
3840
from .validation import validate_slurm_extra
3941

4042

@@ -113,6 +115,20 @@ class ExecutorSettings(ExecutorSettingsBase):
113115
"required": False,
114116
},
115117
)
118+
partition_config: Optional[Path] = field(
119+
default=None,
120+
metadata={
121+
"help": "Path to YAML file defining partition limits for dynamic "
122+
"partition selection. When provided, jobs will be dynamically "
123+
"assigned to the best-fitting partition based on their resource "
124+
"requirements. See documentation for complete list of available limits. "
125+
"Alternatively, the environment variable SNAKEMAKE_SLURM_PARTITIONS "
126+
"can be set to point to such a file. "
127+
"If both are set, this flag takes precedence.",
128+
"env_var": False,
129+
"required": False,
130+
},
131+
)
116132
efficiency_report: bool = field(
117133
default=False,
118134
metadata={
@@ -201,6 +217,26 @@ def __post_init__(self, test_mode: bool = False):
201217
if self.workflow.executor_settings.logdir
202218
else Path(".snakemake/slurm_logs").resolve()
203219
)
220+
# Check the environment variable "SNAKEMAKE_SLURM_PARTITIONS",
221+
# if set, read the partitions from the given file. Let the CLI
222+
# option override this behavior.
223+
if (
224+
os.getenv("SNAKEMAKE_SLURM_PARTITIONS")
225+
and not self.workflow.executor_settings.partition_config
226+
):
227+
partition_file = Path(os.getenv("SNAKEMAKE_SLURM_PARTITIONS"))
228+
self.logger.info(
229+
f"Reading SLURM partition configuration from "
230+
f"environment variable file: {partition_file}"
231+
)
232+
self._partitions = read_partition_file(partition_file)
233+
else:
234+
self._partitions = (
235+
read_partition_file(self.workflow.executor_settings.partition_config)
236+
if self.workflow.executor_settings.partition_config
237+
else None
238+
)
239+
atexit.register(self.clean_old_logs)
204240

205241
def shutdown(self) -> None:
206242
"""
@@ -305,6 +341,8 @@ def run_job(self, job: JobExecutorInterface):
305341
if job.resources.get("slurm_extra"):
306342
self.check_slurm_extra(job)
307343

344+
# NOTE removed partition from below, such that partition
345+
# selection can benefit from resource checking as the call is built up.
308346
job_params = {
309347
"run_uuid": self.run_uuid,
310348
"slurm_logfile": slurm_logfile,
@@ -698,9 +736,13 @@ def get_partition_arg(self, job: JobExecutorInterface):
698736
returns a default partition, if applicable
699737
else raises an error - implicetly.
700738
"""
739+
partition = None
701740
if job.resources.get("slurm_partition"):
702741
partition = job.resources.slurm_partition
703-
else:
742+
elif self._partitions:
743+
partition = get_best_partition(self._partitions, job, self.logger)
744+
# we didnt get a partition yet so try fallback.
745+
if not partition:
704746
if self._fallback_partition is None:
705747
self._fallback_partition = self.get_default_partition(job)
706748
partition = self._fallback_partition

0 commit comments

Comments
 (0)