Skip to content

Conversation

@aitap
Copy link
Contributor

@aitap aitap commented Nov 13, 2025

There were two problems with the way frollapply handled interruptions:

  1. invokeRestart("abort") in the interrupt handler prevented any further error or calling handlers from running (but not exit handlers like tryCatch(finally=...) or on.exit(); those kept working). I think that it's better to return from the calling handler, letting R continue the dispatch to other possible handlers:
tryCatch(
  frollapply(1:1e6, 1, \(.) { Sys.sleep(ret <- sum(.)); ret}),
  interrupt = \(e) 'interrupted'
)
^C[1] "interrupted"
  1. While testing on an older laptop with an i5-2520M CPU, I noticed that interrupting parallel frollapply sometimes left zombie processes. The only way to reap them was to unload the parallel parallel (or quit R). I think that the parent process was too quick to call mccollect() after sending the SIGINT, so the child process did not have enough time to process the SIGINT, quit, and be ready to be collected by waitpid(). I think that the zombie processes can be avoided by forcing mccollect() to wait, but that opens opportunities for further problems. mccollect()ing a terminated process results in a warning: do we need suppressWarnings()? If the child processes are really hung (e.g. due to the rolling function causing a deadlock), a second interrupt seems to interrupt mccollect despite the suspendInterrupts() (???) and unblock the parent process:
writeLines('void foo(void) { for(;;); }', 'foo.c')
tools::Rcmd('SHLIB foo.c')
foo <- mcparallel(.C('foo')) # child process is now hung
tryCatch(
 withCallingHandlers(
  Sys.sleep(10), # some interruptible process
  interrupt = \(e) suspendInterrupts( mccollect(foo) ) # handling the interrupt will hang
 ),
 interrupt = \(e) message('interrupted') # try to handle interrupting as well
)
^C^Cinterrupted # <-- unblocks after a second interrupt
tools::pskill(foo$pid) # stop the hung process
mccollect(foo) # clean up the zombie

Instead of calling invokeRestart("abort") in the interrupt handler,
return from it. This continues the dispatch of the interrupt and lets an
outer handler catch it:

tryCatch(
  frollapply(1:1e6, 1, \(.) { Sys.sleep(ret <- sum(.)); ret}),
  interrupt = \(e) 'interrupted'
)
^C[1] "interrupted"

With invokeRestart("abort"), the interrupt cannot be handled further.
While handling an interrupt, ask mccollect() to wait for the child
process to exit (with a warning) in order to avoid producing zombies.
Otherwise a process that is too slow to react to SIGTERM will remain a
zombie until the parent process exits.
@codecov
Copy link

codecov bot commented Nov 13, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.13%. Comparing base (df7fa80) to head (d1f99b8).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7428      +/-   ##
==========================================
- Coverage   99.13%   99.13%   -0.01%     
==========================================
  Files          85       85              
  Lines       16618    16617       -1     
==========================================
- Hits        16474    16473       -1     
  Misses        144      144              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ben-schwen
Copy link
Member

Could the zombie processes also be the reason that the codecov github job gets stuck in #7288?

@github-actions
Copy link

  • HEAD=frollapply-interrupt stopped early for DT[by,verbose=TRUE] improved in #6296
  • HEAD=frollapply-interrupt slower P<0.001 for memrecycle regression fixed in #5463
    Comparison Plot

Generated via commit d1f99b8

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 4 minutes and 57 seconds
Installing different package versions 10 minutes and 12 seconds
Running and plotting the test cases 2 minutes and 41 seconds

@ben-schwen
Copy link
Member

Apparently there are some other packages having the same problem

https://github.com/cran/bettermc/blob/34da1f6180067ae6970d42c3297442b5e0621784/R/mclapply.R#L723-L739

@aitap
Copy link
Contributor Author

aitap commented Nov 13, 2025

Can't say no with 100% certainty, but I don't see how R CMD check could be producing them: the current issue is only for when the user interrupts R. So far, the difference between hang and no hang is covr:::save_trace running as part of parallel::mcexit, which looks quite harmless.

@jangorecki jangorecki added this to the 1.18.0 milestone Nov 15, 2025
Copy link
Member

@ben-schwen ben-schwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The previous code with wait=FALSE looks kind of wrong since timeout=0.

Still would like a NEWS item, maybe as addendum to the whole froll* NEWS

@jangorecki
Copy link
Member

News should not be needed as this is change to code added in dev. It could eventually explain current interrupt behavior but I think better to have it in manual than news.

@aitap
Copy link
Contributor Author

aitap commented Nov 20, 2025

Need to double check whether interrupting the first mccollect call may leave some PIDs already waited for and reused for a different process. We don't want to terminate an unrelated process because of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants