Skip to content

Conversation

@gsaudade99
Copy link
Contributor

Xref: https://github.com/usegalaxy-eu/issues/issues/521#issuecomment-2400184860

PR is a copy of this. Should be closed after closing the PR.

Migrated script to maintenance VM.
Added Object Store cleanup for temporary (scratch) storage on crontab.

@gsaudade99 gsaudade99 requested a review from kysrpex November 27, 2025 11:31
@gsaudade99 gsaudade99 self-assigned this Nov 27, 2025
@gsaudade99 gsaudade99 changed the title Feature/object store cleanup add object store cleanup script Nov 27, 2025
@kysrpex kysrpex marked this pull request as ready for review November 27, 2025 12:29
Copy link
Contributor

@kysrpex kysrpex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this (and putting it on the right host), we have left it dead for so long 😅.

In addition to the very minor stuff I mentioned, also the "good practice" would be to rename the role from usegalaxy-eu.galaxy-cleanup to usegalaxy_eu.galaxy_cleanup. It's not a hard requirement but Ansible does it internally and it's the rule for Ansible Galaxy (even if we are not publishing this one).

Copilot AI review requested due to automatic review settings December 3, 2025 15:43
gsaudade99 and others added 2 commits December 3, 2025 16:43
Co-authored-by: José Manuel Domínguez <[email protected]>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an automated object store cleanup script to handle purging old datasets from temporary (scratch) storage. The implementation migrates the cleanup functionality to the maintenance VM and sets up a scheduled cron job to run the cleanup operations daily.

Key Changes:

  • Introduces a new Ansible role usegalaxy-eu.galaxy-cleanup with a shell script template that executes Galaxy's pgcleanup.py script
  • Configures automated daily cleanup via cron job (default: 1:00 AM)
  • Adds configuration for the s3_scratch_netapp01 object store with a 60-day retention policy

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
roles/usegalaxy-eu.galaxy-cleanup/templates/galaxy_cleanup_objectstores.sh Shell script template that executes pgcleanup.py for each configured object store
roles/usegalaxy-eu.galaxy-cleanup/tasks/main.yml Ansible tasks to install the cleanup script and configure the cron job
roles/usegalaxy-eu.galaxy-cleanup/defaults/main.yml Default configuration variables including example structure and cron schedule
maintenance.yml Integrates the galaxy-cleanup role into the maintenance playbook
group_vars/maintenance.yml Configures cleanup for the s3_scratch_netapp01 object store with 60-day retention

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gsaudade99
Copy link
Contributor Author

This should now be ready for merge @kysrpex

set -e

{% for item in galaxy_cleanup_objectstores | default([]) %}
{{ galaxy_venv_dir }}/bin/python {{ galaxy_server_dir }}/scripts/cleanup_datasets/pgcleanup.py -c {{ galaxy_config_file }} -o {{ item.days }} -l {{ galaxy_log_dir }} -w 128MB --object-store-id {{ item.objectstore_id }} purge_old_hdas 2>&1 | tee -a {{ galaxy_log_dir }}/cleanup-$(date --rfc-3339=seconds)-purge_old_hdas.log
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if item.objectstore_id is empty or wrong/does-not-exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the jinja will try to render nothing (empty string, thus becoming --object-store-id purge_old_hdas). Looking at pgcleanup.py script it would fail because there is no args.actions for parsing.

We can add and extra tasks that checks if galaxy_cleanup_objectstores is empty and galaxy_cleanup_objectstores.objectstore_id does not include "scracth". And if we still don't feel "safe" we can add this condition inside the template

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This template task should not be executed if galaxy_cleanup_scratchstorage does not match the expression in the assertion task above:
galaxy_cleanup_scratchstorage | selectattr('objectstore_id', 'search', 'scratch') | list | length > 0

Copy link
Contributor

@kysrpex kysrpex Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the following failures are expected in these kind of cases (so we should be fine).

fatal: [localhost]: FAILED! => {"changed": false, "msg": "Task failed: Finalization of task args for 'ansible.builtin.copy' failed: Error while resolving value for 'content': object of type 'dict' has no attribute 'key_missing'"}

fatal: [localhost]: FAILED! => {"changed": false, "msg": "Task failed: Finalization of task args for 'ansible.builtin.copy' failed: Error while resolving value for 'content': 'missing_var' is undefined"}

Copy link
Contributor

@mira-miracoli mira-miracoli Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a complete playbook failure is the only way we can reliably see broken roles / variables in Jenkins.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pass a faulty --object-store-id to the pgcleanup.py, do we know how it will behave?

Copy link
Contributor Author

@gsaudade99 gsaudade99 Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It failed pretty quickly and just roolbacks the changes

(venv) galaxy@sn09:~$ /opt/galaxy/venv/bin/python /opt/galaxy/server/scripts/cleanup_datasets/pgcleanup.py --dry-run -c /opt/galaxy/config/galaxy.yml -o 60 -l /var/log/galaxy -w 128MB --object-store-id im-wrong purge_old_hdas 2>&1 | tee -a /var/log/galaxy/cleanup-$(date --rfc-3339=seconds)-purge_old_hdas.log
tee: '15:00:42+01:00-purge_old_hdas.log': Permission denied
2025-12-10 15:00:46,801 WARNING _update_raw_config_from_kwargs(): Option openai_api_key has been deprecated in favor of ai_api_key
2025-12-10 15:00:46,802 WARNING resolve(): Trying to resolve path for the 'email_ban_file' option but it's empty/None
2025-12-10 15:00:48,109 INFO  run(): Running action 'purge_old_hdas':
2025-12-10 15:00:48,439 INFO  _init(): Initializing object store for action purge_old_hdas
2025-12-10 15:00:48,440 INFO  conn(): Connecting to database with URL: postgresql://galaxy:***@sn11.galaxyproject.eu/galaxy
2025-12-10 15:00:48,535 INFO  conn(): Setting work_mem to 128MB
2025-12-10 15:00:48,538 INFO  _dry_run_event(): Not executing event creation (increments sequence even when rolling back), using an old event ID (46003) for dry run
2025-12-10 15:00:48,539 INFO  _execute(): Executing SQL
2025-12-10 15:00:48,552 INFO  _execute(): Database status: SELECT 0
2025-12-10 15:00:48,552 INFO  _update(): Update resulted in no changes, rolling back transaction
2025-12-10 15:00:48,552 INFO  recalculate_disk_usage(): Recalculating disk usage for users whose data were purged
2025-12-10 15:00:48,553 INFO  run(): Finished purge_old_hdas

@bgruening
Copy link
Member

Maybe we should tweak the wording a bit.

Imho an object_store_cleanup is purging already deleted datasets, or purging delted histories, purging delted users etc ...
this is imho done with https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/group_vars/sn09/sn09.yml#L73-L82

What is done in this PR is more dangerous. It actually deletes data that is older than 60 days, even if the data is NOT deleted and purged. So maybe this PR but also the script should be named something like "clean_scratch_storage.sh"

@mira-miracoli
Copy link
Contributor

That seems to work:

cat test.yml
- name: test
  hosts: localhost
  vars:
    galaxy_cleanup_scratchstorage:
      - objectstore_id: "scratch"
      - objectstore_id: "test"
  tasks:
    - name: Check that all objectstore_id include 'scratch'
      assert:
        that:
          - galaxy_cleanup_scratchstorage
            | rejectattr('objectstore_id', 'search', 'scratch')
            | list
            | length == 0
        fail_msg: "Found objectstore_id entries that do NOT contain 'scratch'"
        success_msg: "All objectstore_id values contain 'scratch'"

ansible-playbook -c local -i localhost test.yml 

PLAY [test] **********************************************************************************************************************************************************************************

TASK [Gathering Facts] ***********************************************************************************************************************************************************************
ok: [localhost]

TASK [Check that all objectstore_id include 'scratch'] ***************************************************************************************************************************************
fatal: [localhost]: FAILED! => {
    "assertion": "galaxy_cleanup_scratchstorage | rejectattr('objectstore_id', 'search', 'scratch') | list | length == 0",
    "changed": false,
    "evaluated_to": false,
    "msg": "Found objectstore_id entries that do NOT contain 'scratch'"
}

PLAY RECAP ***********************************************************************************************************************************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   

cat test.yml
- name: test
  hosts: localhost
  vars:
    galaxy_cleanup_scratchstorage:
      - objectstore_id: "scratch"
      - objectstore_id: "scratch-2"
  tasks:
    - name: Check that all objectstore_id include 'scratch'
      assert:
        that:
          - galaxy_cleanup_scratchstorage
            | rejectattr('objectstore_id', 'search', 'scratch')
            | list
            | length == 0
        fail_msg: "Found objectstore_id entries that do NOT contain 'scratch'"
        success_msg: "All objectstore_id values contain 'scratch'"

ansible-playbook -c local -i localhost test.yml 

PLAY [test] **********************************************************************************************************************************************************************************

TASK [Gathering Facts] ***********************************************************************************************************************************************************************
ok: [localhost]

TASK [Check that all objectstore_id include 'scratch'] ***************************************************************************************************************************************
ok: [localhost] => {
    "changed": false,
    "msg": "All objectstore_id values contain 'scratch'"
}

PLAY RECAP ***********************************************************************************************************************************************************************************
localhost                  : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

@gsaudade99
Copy link
Contributor Author

gsaudade99 commented Dec 10, 2025

@mira-miracoli really cool way to test lol. I did some more testing and:

➜  /tmp cat test.yml                                    
- name: test
  hosts: localhost
  vars:
    galaxy_cleanup_scratchstorage: []
#    galaxy_cleanup_scratchstorage:
#      - objectstore_id: "scratch"
#      - objectstore_id: "scratch-2"
  tasks:
    - name: Check that all objectstore_id include 'scratch'
      assert:
        that:
#          - galaxy_cleanup_scratchstorage | length > 0
          - galaxy_cleanup_scratchstorage
            | rejectattr('objectstore_id', 'search', 'scratch')
            | list
            | length == 0
        fail_msg: "Found objectstore_id entries that do NOT contain 'scratch'"
        success_msg: "All objectstore_id values contain 'scratch'"
➜  /tmp ansible-playbook -c local -i localhost test.yml 

[WARNING]: Unable to parse /tmp/localhost as an inventory source
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'

PLAY [test] ************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *************************************************************************************************************************************************************************************************************
ok: [localhost]

TASK [Check that all objectstore_id include 'scratch'] *****************************************************************************************************************************************************************************
ok: [localhost] => {
    "changed": false,
    "msg": "All objectstore_id values contain 'scratch'"
}

PLAY RECAP *************************************************************************************************************************************************************************************************************************
localhost                  : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

I will add the commented condition too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants