fix: exit non-zero when all slides fail in aggregate_slide_features_batch by raylim · Pull Request #131 · pathology-data-mining/Mussel

raylim · 2026-06-16T18:20:47Z

Summary

Fixes a silent failure mode where extract_features (and TITAN in particular) would exit with code 0 even when all slides failed during batch aggregation.

Changes

mussel/utils/feature_extract.py: Raise RuntimeError in aggregate_slide_features_batch when all slides fail (both the method-level and model-level aggregation paths).
mussel/cli/extract_features.py: Add a defense-in-depth check after aggregate_slide_features_batch returns — raise RuntimeError if no output .features.pt files were created.

Motivation

When TITAN (or other models) hits a CUDA OOM or similar error on every slide, the batch loop catches per-slide exceptions, logs warnings, and continues — returning without writing any outputs but still exiting 0. This silently breaks downstream pipelines that depend on the exit code.

Testing

Verified the new RuntimeError is raised when the failed-slides count equals total slides.
Existing partial-failure behavior (some slides succeed, some fail) is unchanged — a warning is logged but no exception is raised.

…atch Previously, when TITAN slide aggregation failed for every slide in a batch (e.g. CUDA OOM), extract_features would exit 0 with no output files. Nextflow then showed confusing 'exit:0 Failed Tasks' and spuriously retried the tasks using the cluster-profile retry policy. Fix (two locations): 1. feature_extract.py aggregate_slide_features_batch: raise RuntimeError when len(failed_slides) == num_slides, for both model and non-model paths. 2. extract_features.py _main_batch: defense-in-depth check after aggregate_slide_features_batch — raise if all output .pt files are missing. With errorStrategy='ignore' in nextflow.config, exit:1 is properly ignored; WDS coverage check resets the slides to PENDING for retry in the next batch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

raylim

The core intent is right and the fix is needed. Two instances of the same 0 == 0 false-positive exist — one in each of the two new guards.

The problem — both guards can trigger on empty input

The all-slides-fail condition len(failed_slides) == num_slides is mathematically true when both are zero. The two locations handle this inconsistently:

Location	Placement	Empty-list safe?
`feature_extract.py` non-model path (~line 1457)	inside `if failed_slides:`	✅ yes
`feature_extract.py` model path (~line 1688)	outside `if failed_slides:`	❌ no
`extract_features.py` defense check (~line 322)	bare `len(missing) == len(output_pt_paths)`	❌ no

For the model path, an empty slide list causes failed_slides = [], num_slides = 0, the if failed_slides: block is skipped (correctly logs "All 0 slides processed successfully"), and then 0 == 0 fires a RuntimeError("All 0 slide(s) failed…").

For the defense-in-depth check, the same thing: output_pt_paths = [] → missing = [] → 0 == 0 → spurious raise. This path is reachable in the non-model case because aggregate_slide_features_batch returns normally for an empty non-model input.

Both are triggered by an empty --slide-paths list reaching _main_batch. That's an operator mistake, but the error message "All 0 slides failed" is actively misleading and hides the real problem.

Fixes

# feature_extract.py model path — add the same guard the non-model path already has
if failed_slides and len(failed_slides) == num_slides:
    raise RuntimeError(…)

# extract_features.py defense check
if output_pt_paths and len(missing) == len(output_pt_paths):
    raise RuntimeError(…)

No other issues — the RuntimeError placement for the all-slides-fail case is correct for both non-empty paths; the model cleanup before the check is fine (happens in the same function scope); and the CLI exception propagation to a non-zero exit is handled by the existing top-level error handler.

Add 'failed_slides and' / 'output_pt_paths and' guards so that an empty run (0 slides) does not trigger the all-slides-failed RuntimeError. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

raylim commented Jun 16, 2026

View reviewed changes

Comment thread mussel/utils/feature_extract.py Outdated

Comment thread mussel/cli/extract_features.py

fix: guard against 0==0 false positive in all-slides-failed checks

cacdd8e

Add 'failed_slides and' / 'output_pt_paths and' guards so that an empty run (0 slides) does not trigger the all-slides-failed RuntimeError. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

raylim merged commit 53dbf15 into main Jun 16, 2026
3 checks passed

raylim deleted the fix/titan-exit-code branch June 16, 2026 18:31

raylim mentioned this pull request Jun 17, 2026

Fix: exit non-zero when all slides fail in aggregate_slide_features_batch #132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: exit non-zero when all slides fail in aggregate_slide_features_batch#131

fix: exit non-zero when all slides fail in aggregate_slide_features_batch#131
raylim merged 2 commits into
mainfrom
fix/titan-exit-code

raylim commented Jun 16, 2026

Uh oh!

raylim left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raylim commented Jun 16, 2026

Summary

Changes

Motivation

Testing

Uh oh!

raylim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant