fix: exit non-zero when all slides fail in aggregate_slide_features_batch#131
Conversation
…atch Previously, when TITAN slide aggregation failed for every slide in a batch (e.g. CUDA OOM), extract_features would exit 0 with no output files. Nextflow then showed confusing 'exit:0 Failed Tasks' and spuriously retried the tasks using the cluster-profile retry policy. Fix (two locations): 1. feature_extract.py aggregate_slide_features_batch: raise RuntimeError when len(failed_slides) == num_slides, for both model and non-model paths. 2. extract_features.py _main_batch: defense-in-depth check after aggregate_slide_features_batch — raise if all output .pt files are missing. With errorStrategy='ignore' in nextflow.config, exit:1 is properly ignored; WDS coverage check resets the slides to PENDING for retry in the next batch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
raylim
left a comment
There was a problem hiding this comment.
The core intent is right and the fix is needed. Two instances of the same 0 == 0 false-positive exist — one in each of the two new guards.
The problem — both guards can trigger on empty input
The all-slides-fail condition len(failed_slides) == num_slides is mathematically true when both are zero. The two locations handle this inconsistently:
| Location | Placement | Empty-list safe? |
|---|---|---|
feature_extract.py non-model path (~line 1457) |
inside if failed_slides: |
✅ yes |
feature_extract.py model path (~line 1688) |
outside if failed_slides: |
❌ no |
extract_features.py defense check (~line 322) |
bare len(missing) == len(output_pt_paths) |
❌ no |
For the model path, an empty slide list causes failed_slides = [], num_slides = 0, the if failed_slides: block is skipped (correctly logs "All 0 slides processed successfully"), and then 0 == 0 fires a RuntimeError("All 0 slide(s) failed…").
For the defense-in-depth check, the same thing: output_pt_paths = [] → missing = [] → 0 == 0 → spurious raise. This path is reachable in the non-model case because aggregate_slide_features_batch returns normally for an empty non-model input.
Both are triggered by an empty --slide-paths list reaching _main_batch. That's an operator mistake, but the error message "All 0 slides failed" is actively misleading and hides the real problem.
Fixes
# feature_extract.py model path — add the same guard the non-model path already has
if failed_slides and len(failed_slides) == num_slides:
raise RuntimeError(…)
# extract_features.py defense check
if output_pt_paths and len(missing) == len(output_pt_paths):
raise RuntimeError(…)No other issues — the RuntimeError placement for the all-slides-fail case is correct for both non-empty paths; the model cleanup before the check is fine (happens in the same function scope); and the CLI exception propagation to a non-zero exit is handled by the existing top-level error handler.
Add 'failed_slides and' / 'output_pt_paths and' guards so that an empty run (0 slides) does not trigger the all-slides-failed RuntimeError. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Fixes a silent failure mode where
extract_features(and TITAN in particular) would exit with code 0 even when all slides failed during batch aggregation.Changes
mussel/utils/feature_extract.py: RaiseRuntimeErrorinaggregate_slide_features_batchwhen all slides fail (both the method-level and model-level aggregation paths).mussel/cli/extract_features.py: Add a defense-in-depth check afteraggregate_slide_features_batchreturns — raiseRuntimeErrorif no output.features.ptfiles were created.Motivation
When TITAN (or other models) hits a CUDA OOM or similar error on every slide, the batch loop catches per-slide exceptions, logs warnings, and continues — returning without writing any outputs but still exiting 0. This silently breaks downstream pipelines that depend on the exit code.
Testing
RuntimeErroris raised when the failed-slides count equals total slides.