Skip to content

feat: add threshold statistics preprocessing step#9

Open
vojtech-cifka wants to merge 38 commits intomasterfrom
feature/threshold-stats
Open

feat: add threshold statistics preprocessing step#9
vojtech-cifka wants to merge 38 commits intomasterfrom
feature/threshold-stats

Conversation

@vojtech-cifka
Copy link
Copy Markdown
Collaborator

Summary

  • Add preprocessing/threshold_stats.py — a QC analysis script that runs on filtered tiles and produces per-split (train/test) coverage statistics and plots logged to MLflow
  • Compute scalar stats (mean, std, quantiles) and threshold sweep plots for per-class ROI annotation coverage, and scalar stats + histograms for tissue coverage broken down by dominant class
  • Add configs/preprocessing/threshold_stats.yaml, configs/experiment/preprocessing/threshold_stats_standard.yaml, and scripts/submit_threshold_stats.py
  • Update configs/data/dataset.yaml with filter_tiles_run_id

Notes

  • Runs on filter_tiles output rather than raw tiling — tiles are already guaranteed non-empty by upstream filter_tiles guards
  • Tissue coverage sweeps were considered but dropped — after filter_tiles the distribution is trivially flat (all tiles have non-zero tissue coverage)
  • Timing/debug prints removed before merge
  • Exemplary plots can be seen in this MLflow run

vojtech-cifka and others added 30 commits April 27, 2026 19:15
…coverage

Joins per-tile ROI coverages from the tiling, tissue_stats, and qc_stats runs
on (slide_id, x, y), then logs scalar stats (mean/std/min/max + quantiles) and
survival curves stratified by dominant annotation class to MLflow, so coverage
thresholds for downstream filtering can be picked from the train distribution.
…splits

Loops over train/test, joins each split's tiling/tissue/qc artifacts via
templated paths, and namespaces both metrics and artifact directories by
split so train and test distributions can be compared in MLflow.
…iling + tissue

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…inear plots

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vojtech-cifka vojtech-cifka requested a review from vejtek May 6, 2026 19:01
@vojtech-cifka vojtech-cifka self-assigned this May 6, 2026
@vojtech-cifka vojtech-cifka requested review from a team and JakubPekar May 6, 2026 19:01
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

Warning

Rate limit exceeded

@vojtech-cifka has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 20 minutes and 38 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 66da522b-c7af-45ca-bf06-1c69e778117b

📥 Commits

Reviewing files that changed from the base of the PR and between 7293555 and 170ed49.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • configs/data/dataset.yaml
  • configs/experiment/preprocessing/threshold_stats_standard.yaml
  • configs/preprocessing/threshold_stats.yaml
  • preprocessing/threshold_stats.py
  • pyproject.toml
  • scripts/submit_threshold_stats.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/threshold-stats

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vojtech-cifka vojtech-cifka requested review from Adames4 and removed request for JakubPekar May 6, 2026 19:01
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new preprocessing script, threshold_stats.py, designed to analyze and visualize tile coverage statistics for dataset splits. It includes new configuration files, a job submission script, and updates to project dependencies. Review feedback highlights the need for a guard against empty arrays in statistical computations, an adjustment to the baseline for log-scale histograms to prevent Matplotlib rendering issues, and the removal of the unused duckdb dependency.

Comment thread preprocessing/threshold_stats.py
Comment thread preprocessing/threshold_stats.py
Comment thread pyproject.toml Outdated
vojtech-cifka and others added 2 commits May 6, 2026 21:12
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants