Skip to content

fix: exit non-zero when all slides fail in aggregate_slide_features_batch#131

Merged
raylim merged 2 commits into
mainfrom
fix/titan-exit-code
Jun 16, 2026
Merged

fix: exit non-zero when all slides fail in aggregate_slide_features_batch#131
raylim merged 2 commits into
mainfrom
fix/titan-exit-code

Conversation

@raylim

@raylim raylim commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes a silent failure mode where extract_features (and TITAN in particular) would exit with code 0 even when all slides failed during batch aggregation.

Changes

  • mussel/utils/feature_extract.py: Raise RuntimeError in aggregate_slide_features_batch when all slides fail (both the method-level and model-level aggregation paths).
  • mussel/cli/extract_features.py: Add a defense-in-depth check after aggregate_slide_features_batch returns — raise RuntimeError if no output .features.pt files were created.

Motivation

When TITAN (or other models) hits a CUDA OOM or similar error on every slide, the batch loop catches per-slide exceptions, logs warnings, and continues — returning without writing any outputs but still exiting 0. This silently breaks downstream pipelines that depend on the exit code.

Testing

  • Verified the new RuntimeError is raised when the failed-slides count equals total slides.
  • Existing partial-failure behavior (some slides succeed, some fail) is unchanged — a warning is logged but no exception is raised.

…atch

Previously, when TITAN slide aggregation failed for every slide in a batch
(e.g. CUDA OOM), extract_features would exit 0 with no output files.
Nextflow then showed confusing 'exit:0 Failed Tasks' and spuriously retried
the tasks using the cluster-profile retry policy.

Fix (two locations):
1. feature_extract.py aggregate_slide_features_batch: raise RuntimeError when
   len(failed_slides) == num_slides, for both model and non-model paths.
2. extract_features.py _main_batch: defense-in-depth check after
   aggregate_slide_features_batch — raise if all output .pt files are missing.

With errorStrategy='ignore' in nextflow.config, exit:1 is properly ignored;
WDS coverage check resets the slides to PENDING for retry in the next batch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@raylim raylim left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core intent is right and the fix is needed. Two instances of the same 0 == 0 false-positive exist — one in each of the two new guards.


The problem — both guards can trigger on empty input

The all-slides-fail condition len(failed_slides) == num_slides is mathematically true when both are zero. The two locations handle this inconsistently:

Location Placement Empty-list safe?
feature_extract.py non-model path (~line 1457) inside if failed_slides: ✅ yes
feature_extract.py model path (~line 1688) outside if failed_slides: ❌ no
extract_features.py defense check (~line 322) bare len(missing) == len(output_pt_paths) ❌ no

For the model path, an empty slide list causes failed_slides = [], num_slides = 0, the if failed_slides: block is skipped (correctly logs "All 0 slides processed successfully"), and then 0 == 0 fires a RuntimeError("All 0 slide(s) failed…").

For the defense-in-depth check, the same thing: output_pt_paths = []missing = []0 == 0 → spurious raise. This path is reachable in the non-model case because aggregate_slide_features_batch returns normally for an empty non-model input.

Both are triggered by an empty --slide-paths list reaching _main_batch. That's an operator mistake, but the error message "All 0 slides failed" is actively misleading and hides the real problem.

Fixes

# feature_extract.py model path — add the same guard the non-model path already has
if failed_slides and len(failed_slides) == num_slides:
    raise RuntimeError(…)

# extract_features.py defense check
if output_pt_paths and len(missing) == len(output_pt_paths):
    raise RuntimeError(…)

No other issues — the RuntimeError placement for the all-slides-fail case is correct for both non-empty paths; the model cleanup before the check is fine (happens in the same function scope); and the CLI exception propagation to a non-zero exit is handled by the existing top-level error handler.

Comment thread mussel/utils/feature_extract.py Outdated
Comment thread mussel/cli/extract_features.py
Add 'failed_slides and' / 'output_pt_paths and' guards so that an
empty run (0 slides) does not trigger the all-slides-failed RuntimeError.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@raylim raylim merged commit 53dbf15 into main Jun 16, 2026
3 checks passed
@raylim raylim deleted the fix/titan-exit-code branch June 16, 2026 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant