-
Notifications
You must be signed in to change notification settings - Fork 108
add object store cleanup script #1734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
add object store cleanup script #1734
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this (and putting it on the right host), we have left it dead for so long 😅.
In addition to the very minor stuff I mentioned, also the "good practice" would be to rename the role from usegalaxy-eu.galaxy-cleanup to usegalaxy_eu.galaxy_cleanup. It's not a hard requirement but Ansible does it internally and it's the rule for Ansible Galaxy (even if we are not publishing this one).
roles/usegalaxy-eu.galaxy-cleanup/templates/galaxy_cleanup_objectstores.sh
Outdated
Show resolved
Hide resolved
roles/usegalaxy-eu.galaxy-cleanup/templates/galaxy_cleanup_objectstores.sh
Outdated
Show resolved
Hide resolved
Co-authored-by: José Manuel Domínguez <[email protected]>
Co-authored-by: José Manuel Domínguez <[email protected]>
…ectstores.sh Co-authored-by: José Manuel Domínguez <[email protected]>
Co-authored-by: José Manuel Domínguez <[email protected]>
Co-authored-by: José Manuel Domínguez <[email protected]>
Co-authored-by: José Manuel Domínguez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds an automated object store cleanup script to handle purging old datasets from temporary (scratch) storage. The implementation migrates the cleanup functionality to the maintenance VM and sets up a scheduled cron job to run the cleanup operations daily.
Key Changes:
- Introduces a new Ansible role
usegalaxy-eu.galaxy-cleanupwith a shell script template that executes Galaxy's pgcleanup.py script - Configures automated daily cleanup via cron job (default: 1:00 AM)
- Adds configuration for the s3_scratch_netapp01 object store with a 60-day retention policy
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| roles/usegalaxy-eu.galaxy-cleanup/templates/galaxy_cleanup_objectstores.sh | Shell script template that executes pgcleanup.py for each configured object store |
| roles/usegalaxy-eu.galaxy-cleanup/tasks/main.yml | Ansible tasks to install the cleanup script and configure the cron job |
| roles/usegalaxy-eu.galaxy-cleanup/defaults/main.yml | Default configuration variables including example structure and cron schedule |
| maintenance.yml | Integrates the galaxy-cleanup role into the maintenance playbook |
| group_vars/maintenance.yml | Configures cleanup for the s3_scratch_netapp01 object store with 60-day retention |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
roles/usegalaxy-eu.galaxy-cleanup/templates/galaxy_cleanup_objectstores.sh
Outdated
Show resolved
Hide resolved
roles/usegalaxy-eu.galaxy-cleanup/templates/galaxy_cleanup_objectstores.sh
Outdated
Show resolved
Hide resolved
roles/usegalaxy-eu.galaxy-cleanup/templates/galaxy_cleanup_objectstores.sh
Outdated
Show resolved
Hide resolved
roles/usegalaxy-eu.galaxy-cleanup/templates/galaxy_cleanup_objectstores.sh
Outdated
Show resolved
Hide resolved
…ectstores.sh Co-authored-by: José Manuel Domínguez <[email protected]>
Co-authored-by: José Manuel Domínguez <[email protected]>
Co-authored-by: José Manuel Domínguez <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
This should now be ready for merge @kysrpex |
| set -e | ||
|
|
||
| {% for item in galaxy_cleanup_objectstores | default([]) %} | ||
| {{ galaxy_venv_dir }}/bin/python {{ galaxy_server_dir }}/scripts/cleanup_datasets/pgcleanup.py -c {{ galaxy_config_file }} -o {{ item.days }} -l {{ galaxy_log_dir }} -w 128MB --object-store-id {{ item.objectstore_id }} purge_old_hdas 2>&1 | tee -a {{ galaxy_log_dir }}/cleanup-$(date --rfc-3339=seconds)-purge_old_hdas.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if item.objectstore_id is empty or wrong/does-not-exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the jinja will try to render nothing (empty string, thus becoming --object-store-id purge_old_hdas). Looking at pgcleanup.py script it would fail because there is no args.actions for parsing.
We can add and extra tasks that checks if galaxy_cleanup_objectstores is empty and galaxy_cleanup_objectstores.objectstore_id does not include "scracth". And if we still don't feel "safe" we can add this condition inside the template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This template task should not be executed if galaxy_cleanup_scratchstorage does not match the expression in the assertion task above:
galaxy_cleanup_scratchstorage | selectattr('objectstore_id', 'search', 'scratch') | list | length > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the following failures are expected in these kind of cases (so we should be fine).
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Task failed: Finalization of task args for 'ansible.builtin.copy' failed: Error while resolving value for 'content': object of type 'dict' has no attribute 'key_missing'"}
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Task failed: Finalization of task args for 'ansible.builtin.copy' failed: Error while resolving value for 'content': 'missing_var' is undefined"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a complete playbook failure is the only way we can reliably see broken roles / variables in Jenkins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we pass a faulty --object-store-id to the pgcleanup.py, do we know how it will behave?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It failed pretty quickly and just roolbacks the changes
(venv) galaxy@sn09:~$ /opt/galaxy/venv/bin/python /opt/galaxy/server/scripts/cleanup_datasets/pgcleanup.py --dry-run -c /opt/galaxy/config/galaxy.yml -o 60 -l /var/log/galaxy -w 128MB --object-store-id im-wrong purge_old_hdas 2>&1 | tee -a /var/log/galaxy/cleanup-$(date --rfc-3339=seconds)-purge_old_hdas.log
tee: '15:00:42+01:00-purge_old_hdas.log': Permission denied
2025-12-10 15:00:46,801 WARNING _update_raw_config_from_kwargs(): Option openai_api_key has been deprecated in favor of ai_api_key
2025-12-10 15:00:46,802 WARNING resolve(): Trying to resolve path for the 'email_ban_file' option but it's empty/None
2025-12-10 15:00:48,109 INFO run(): Running action 'purge_old_hdas':
2025-12-10 15:00:48,439 INFO _init(): Initializing object store for action purge_old_hdas
2025-12-10 15:00:48,440 INFO conn(): Connecting to database with URL: postgresql://galaxy:***@sn11.galaxyproject.eu/galaxy
2025-12-10 15:00:48,535 INFO conn(): Setting work_mem to 128MB
2025-12-10 15:00:48,538 INFO _dry_run_event(): Not executing event creation (increments sequence even when rolling back), using an old event ID (46003) for dry run
2025-12-10 15:00:48,539 INFO _execute(): Executing SQL
2025-12-10 15:00:48,552 INFO _execute(): Database status: SELECT 0
2025-12-10 15:00:48,552 INFO _update(): Update resulted in no changes, rolling back transaction
2025-12-10 15:00:48,552 INFO recalculate_disk_usage(): Recalculating disk usage for users whose data were purged
2025-12-10 15:00:48,553 INFO run(): Finished purge_old_hdas
|
Maybe we should tweak the wording a bit. Imho an object_store_cleanup is purging already deleted datasets, or purging delted histories, purging delted users etc ... What is done in this PR is more dangerous. It actually deletes data that is older than 60 days, even if the data is NOT deleted and purged. So maybe this PR but also the script should be named something like "clean_scratch_storage.sh" |
Co-authored-by: Mira <[email protected]>
Co-authored-by: Mira <[email protected]>
Co-authored-by: Mira <[email protected]>
|
That seems to work: |
|
@mira-miracoli really cool way to test lol. I did some more testing and: I will add the commented condition too. |
roles/usegalaxy_eu.galaxy_cleanup/templates/clean_scratch_storage.sh
Outdated
Show resolved
Hide resolved
…age.sh Co-authored-by: Mira <[email protected]>
PR is a copy of this. Should be closed after closing the PR.
Migrated script to maintenance VM.
Added Object Store cleanup for temporary (scratch) storage on crontab.