Skip to content

Galaxy needs the job working directory of a job to exist in order to delete it #21310

@kysrpex

Description

@kysrpex

Describe the bug

At usegalaxy.eu, we set the weight of the backend of a distributed object store to zero on October 10. One month after on November 11, we got rid of the mountpoint. That has caused this Sentry issue to pop up.

FileNotFoundError
[Errno 2] No such file or directory: '/data/jwd05e/main'

Careful examination of the stack trace from the Sentry issue, and the Galaxy DB using the job id retrieved from the Sentry issue shows that this issue affects jobs that were created before the weight of the backend was set to zero.

galaxy=> SELECT id, tool_id, create_time, update_time, state, user_id, object_store_id FROM job WHERE id='********';
    id    |                                  tool_id                                  |        create_time         |        update_time         |  state  | user_id | object_store_id 
----------+---------------------------------------------------------------------------+----------------------------+----------------------------+---------+---------+-----------------
 ******** | toolshed.g2.bx.psu.edu/repos/iuc/samtools_view/samtools_view/1.21+galaxy0 | 2025-10-05 **:**:**.****** | 2025-11-18 **:**:**.****** | deleted |  ****** | files30
(1 row)

Furthermore, it shows that Galaxy is attempting to stop the job, but in order to do so, it creates a JobWrapper object that attempts to create the job working directory (and obviously fails).

Galaxy Version and/or server at which you observed the bug
Galaxy Version: 25.0 (minor 4.dev0)
Commit: c939a15134860c19ee5e2a826060bf6df37a6bc2 (from the usegalaxy-eu/galaxy fork)

Browser and Operating System
Operating System: Linux
Browser: Firefox

To Reproduce
Steps to reproduce the behavior:

  1. Launch a job (and I guess somehow manage to keep it in the queue for a very long time, for example launching a few thousand jobs hitting the user concurrency limit, or via dependencies maybe?)
  2. Set the weight of the object store backend (distributed object store) to zero.
  3. Unmount the file system that the object store backend uses as job working directory.
  4. Delete the job.

Expected behavior
A job can be deleted even if its job working directory does not exist.

Screenshots

Additional context
The job has a job\_runner\_name (condor), but no job\_runner \_external\_id.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions