feat: add threshold statistics preprocessing step by vojtech-cifka · Pull Request #9 · RationAI/tissue-classification

vojtech-cifka · 2026-05-06T19:01:07Z

Summary

Add preprocessing/threshold_stats.py — a QC analysis script that runs on filtered tiles and produces per-split (train/test) coverage statistics and plots logged to MLflow
Compute scalar stats (mean, std, quantiles) and threshold sweep plots for per-class ROI annotation coverage, and scalar stats + histograms for tissue coverage broken down by dominant class
Add configs/preprocessing/threshold_stats.yaml, configs/experiment/preprocessing/threshold_stats_standard.yaml, and scripts/submit_threshold_stats.py
Update configs/data/dataset.yaml with filter_tiles_run_id

Notes

Runs on filter_tiles output rather than raw tiling — tiles are already guaranteed non-empty by upstream filter_tiles guards
Tissue coverage sweeps were considered but dropped — after filter_tiles the distribution is trivially flat (all tiles have non-zero tissue coverage)
Timing/debug prints removed before merge
Exemplary plots can be seen in this MLflow run

…coverage Joins per-tile ROI coverages from the tiling, tissue_stats, and qc_stats runs on (slide_id, x, y), then logs scalar stats (mean/std/min/max + quantiles) and survival curves stratified by dominant annotation class to MLflow, so coverage thresholds for downstream filtering can be picked from the train distribution.

…splits Loops over train/test, joins each split's tiling/tissue/qc artifacts via templated paths, and namespaces both metrics and artifact directories by split so train and test distributions can be compared in MLflow.

…iling + tissue Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…inear plots Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tats

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-06T19:01:15Z

Warning

Rate limit exceeded

@vojtech-cifka has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 20 minutes and 38 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 66da522b-c7af-45ca-bf06-1c69e778117b

📥 Commits

Reviewing files that changed from the base of the PR and between 7293555 and 170ed49.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (6)

configs/data/dataset.yaml
configs/experiment/preprocessing/threshold_stats_standard.yaml
configs/preprocessing/threshold_stats.yaml
preprocessing/threshold_stats.py
pyproject.toml
scripts/submit_threshold_stats.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/threshold-stats

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a new preprocessing script, threshold_stats.py, designed to analyze and visualize tile coverage statistics for dataset splits. It includes new configuration files, a job submission script, and updates to project dependencies. Review feedback highlights the need for a guard against empty arrays in statistical computations, an adjustment to the baseline for log-scale histograms to prevent Matplotlib rendering issues, and the removal of the unused duckdb dependency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vojtech-cifka and others added 30 commits April 27, 2026 19:15

feat: add dependencies on pyarrow and matplotlib

2941d5f

feat: add prints

6f9f22c

fix: correct directory names

cde8ac7

feat: add timings

d52b99d

feat: add schema print

6cee260

feat: add unique keys check

59be79f

refactor: use samples

42cc0b5

feat: add more check

d02d709

feat: use duckdb

de7b5ea

refactor: use duckdb

d342c49

fix: expect streaming reader on the input

c144b3a

refactor: use different graphs for visualization

65766ed

feat: log the y axis

4d15653

feat: add graphs which are not distinguished per class

4949b78

fix: cosmetic and plotting fixes

c135d28

fix: change paths

77cbf5a

feat: change graphs

cfcd969

refactor: raise bin count

df6e0ba

fix: remove name and branch

9bcedcf

fix: remove tissue coverage

24dd02c

feat: refresh stats run ids

06920e9

fix: bump up memory

86c509c

refactor: use filter_tiles output in threshold_stats instead of raw t…

bb070f2

…iling + tissue Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: merge master into feature/embeddings

3a5b750

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor: drop QC plots, split class coverage sweeps into per-class l…

5959198

…inear plots Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: add filter_tiles_run_id to dataset config

218ff24

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor: crop tissue sweep to varying region and switch to linear scale

84e3b51

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: crop tissue sweep using 1% drop threshold to skip flat top region

1baf775

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vojtech-cifka and others added 6 commits May 6, 2026 14:34

refactor: split tissue coverage sweep into per-dominant-class plots

cee7e3b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: lower memory demand

83cc2b5

Merge remote-tracking branch 'origin/master' into feature/threshold-s…

96563d3

…tats

feat: merge filter_tiles from master and update filter_tiles_run_id

38dd09e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: format

07f8647

refactor: remove timing prints from threshold_stats

382866c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vojtech-cifka requested a review from vejtek May 6, 2026 19:01

vojtech-cifka self-assigned this May 6, 2026

vojtech-cifka requested review from a team and JakubPekar May 6, 2026 19:01

vojtech-cifka requested review from Adames4 and removed request for JakubPekar May 6, 2026 19:01

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Comment thread preprocessing/threshold_stats.py

Comment thread preprocessing/threshold_stats.py

Comment thread pyproject.toml Outdated

vojtech-cifka and others added 2 commits May 6, 2026 21:12

chore: remove unused duckdb dependency

54f6ba2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: reove duckdb from uv lock

170ed49

Adames4 approved these changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add threshold statistics preprocessing step#9

feat: add threshold statistics preprocessing step#9
vojtech-cifka wants to merge 38 commits intomasterfrom
feature/threshold-stats

vojtech-cifka commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vojtech-cifka commented May 6, 2026

Summary

Notes

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 6, 2026 •

edited

Loading