Feature/implement k fold train split by vojtech-cifka · Pull Request #3 · RationAI/tissue-classification

vojtech-cifka · 2026-04-24T21:05:14Z

Summary

Loads the tiling parquet artifact from MLflow and assigns each of the filtered tiles to one of
n_folds validation folds using StratifiedKFold on the dominant tissue class label
Labels and ROI coverage columns are derived dynamically from the parquet schema
(roi_coverage_* columns), so no code changes are needed when tissue classes change
Rare classes with fewer than n_folds tiles are collapsed into background before
splitting to prevent StratifiedKFold from crashing, with a printed warning listing
affected classes and their counts
Fold assignments, per-fold tile/slide counts, tissue proportion stats, and label
distribution table are all logged to MLflow
Output is written back to parquet with label, tissue_prop, and fold columns
added, then logged as an MLflow artifact
Hydra config + experiment overrides; cluster submission script included

Summary by CodeRabbit

New Features
- K-fold cross-validation split with stratified handling (collapses rare labels), per-fold summaries, and export of fold assignments and augmented training tiles.
- Automatic derivation of per-tile label and tissue proportion, MLflow logging, and artifact upload.
- Job submission support to run k-fold splits.
Configuration
- New k-fold split configurations (generic and 5-fold preset).
- Added dataset config field for filtering tiles by run ID.
Chores
- Added runtime dependencies: pyarrow, datasets.

coderabbitai · 2026-04-24T21:05:26Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds k‑fold stratified split: new dataset MLflow artifact reference (filter_tiles_run_id), k‑fold configs and experiment preset, duplicate pyproject dependency lines, a Hydra entrypoint that derives labels/tissue proportions from roi_coverage_*, collapses rare classes, assigns StratifiedKFold folds, logs metrics/tables to MLflow, and uploads augmented parquet.

Changes

K-fold split feature

Layer / File(s)	Summary
Data Shape / Config Fields `configs/data/dataset.yaml`	Adds `dataset.mlflow_artifacts.filter_tiles_run_id` pointing to a specific MLflow run ID for tile artifacts.
Experiment Preset `configs/experiment/split/kfold_split_5_folds.yaml`	New preset that composes dataset defaults and sets `n_folds: 5`.
Hydra / Split Parameters `configs/split/kfold_split.yaml`	New split config declaring `n_folds` (unresolved), `random_state: 42`, MLflow artifact path, run metadata, and templated hyperparams.
Runtime Dependencies `pyproject.toml`	Adds `pyarrow>=19.0.0` and `datasets>=3.0.0` entries to project dependencies (coexisting with existing `pyarrow>=19.0.1` and `datasets>=4.0.0`).
Core Implementation `split/kfold_split.py`	New Hydra entrypoint and functions: `derive_labels` (computes `label`, `tissue_prop`, `slide_id` from `roi_coverage_*`), `build_stratification_labels` (collapses classes with count < n_folds into `background` and validates), `assign_folds` (StratifiedKFold assignment), `log_fold_statistics` (per-fold metrics and two MLflow label-distribution tables), and `main` (download parquet from MLflow, compute and append `label`/`tissue_prop`/`fold`, write/upload augmented parquet).
Execution / Submission `scripts/submit_kfold_split.py`	New submission script registering a Kubernetes job that clones the repo and runs the split entrypoint with composed Hydra config.

Sequence Diagram(s)

sequenceDiagram
    participant Config as Hydra Config
    participant MLflow as MLflow Artifacts
    participant Data as Parquet Loader
    participant Split as K-fold Processor
    participant Storage as Artifact Upload

    Config->>MLflow: resolve `filter_tiles_run_id` & artifact path
    MLflow->>Data: download train tiles parquet
    Data-->>Split: provide rows with `roi_coverage_*` and `slide_id`
    Split->>Split: derive `label`, `tissue_prop`, `slide_id`
    Split->>Split: collapse rare classes and build stratification labels
    Split->>Split: assign folds via StratifiedKFold
    Split->>MLflow: log per-fold metrics and label distribution tables
    Split->>Storage: write augmented parquet and upload to MLflow artifact path

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

RationAI/tissue-classification#4: Modifies dataset MLflow artifact references for tile artifacts (related config changes).

Suggested reviewers

vejtek
JakubPekar
ejdam87

Poem

🐰 I hopped through tiles both near and far,
Peered at ROI covers, named each tiny star,
Folded the rare into background's warm veil,
Split into folds and logged every tale,
A rabbit's small hop — MLflow saved the trail.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: implementing k-fold cross-validation splitting functionality for training data.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/implement-k-fold-train-split

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a stratified k-fold splitting utility for tissue classification datasets, including new configuration files, dependency updates, and a Kubernetes job submission script. The core logic in split/kfold_split.py handles label derivation, rare label management, and MLFlow logging. Review feedback identifies potential crashes due to missing ROI columns or placeholder values in the submission script and suggests a more efficient approach for adding multiple columns to the dataset to avoid unnecessary data copying.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

split/kfold_split.py (3)
48-53: Prefer warnings.warn (or the module logger) over print for this warning.

print bypasses MLflow’s captured output levels and can’t be filtered/suppressed by consumers. warnings.warn(..., stacklevel=2) integrates with Python’s warning machinery and will still appear in stdout/stderr by default.
🧹 Suggested refactor
+import warnings
@@
-    if len(rare) > 0:
-        print(
-            f"WARNING: {len(rare)} label(s) have fewer than {n_folds} tiles and will "
-            f"be collapsed into 'background' for stratification: "
-            + ", ".join(f"{cls}({counts[unique == cls][0]})" for cls in rare),
-            flush=True,
-        )
+    if len(rare) > 0:
+        details = ", ".join(f"{cls}({counts[unique == cls][0]})" for cls in rare)
+        warnings.warn(
+            f"{len(rare)} label(s) have fewer than {n_folds} tiles and will "
+            f"be collapsed into 'background' for stratification: {details}",
+            stacklevel=2,
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@split/kfold_split.py` around lines 48 - 53, Replace the print warning in the
k-fold stratification code with the Python warnings system: import warnings at
top of split/kfold_split.py and change the print(...) call that references rare,
n_folds, counts, and unique into warnings.warn(<same formatted message>,
stacklevel=2) (optionally specifying category=UserWarning) so the message is
emitted via the warnings machinery and can be filtered/suppressed by consumers.
68-110: Minor: stats block has some small ergonomic/robustness gaps.

Lines 99–110 recompute mask, n_val, tp_mean/std already derived in the metric loop above; consider iterating once and both logging + printing in the same pass to avoid drift.

tissue_props[mask].mean()/std() will emit a numpy RuntimeWarning and yield nan if any fold ends up empty. Stratified k-fold on the collapsed labels should normally prevent this, but a small guard (if mask.any():) would keep the output clean under degenerate inputs.

print(...) here bypasses the MLflow logger; using logger (already passed into main) or the standard logging module would keep server logs consistent.

None of these are blockers — happy to leave as-is if you prefer.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@split/kfold_split.py` around lines 68 - 110, The stats block in
log_fold_statistics recomputes masks and may produce nan/warnings for empty
folds and uses print instead of the app logger; modify
log_fold_statistics(labels, tissue_props, slide_ids, folds, n_folds) to perform
a single loop over fold in range(n_folds) where you compute mask once, guard
with if mask.any(): before calling tissue_props[mask].mean()/std() to avoid
RuntimeWarning/nan for empty folds, call mlflow.log_metric and accumulate rows
for the label distribution table in that same loop (so you don't recompute), and
replace print(...) with logger.info(...) (either accept a logger parameter on
log_fold_statistics or call logging.getLogger(__name__) at top) to keep outputs
in the centralized logs while preserving the mlflow.log_table call for
label_distribution.
117-143: Clarify artifact upload strategy for TemporaryDirectory contents.

mlflow_artifact_path: kfold_split is the same for every run, but logger.log_artifacts(local_dir=output_dir, ...) uploads all files in output_dir. Currently only kfold_tiles.parquet exists, so no issue. However, any future side files dropped in output_dir (e.g., debug dumps) will silently leak into the artifact store. Consider using log_artifact() (single-file upload) on out_path instead of log_artifacts() on the directory, or keep TemporaryDirectory minimal as it is now to prevent unintended artifact pollution.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@split/kfold_split.py` around lines 117 - 143, The current code uses
TemporaryDirectory and calls logger.log_artifacts(local_dir=output_dir,
artifact_path=config.mlflow_artifact_path) which will upload every file in
output_dir and may inadvertently include future debug files; change this to
upload only the single parquet file by using logger.log_artifact(out_path,
artifact_path=config.mlflow_artifact_path) (or otherwise pass the explicit file
path) instead of logger.log_artifacts, keeping the TemporaryDirectory usage and
the out_path variable as the single source of truth for what gets uploaded.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/submit_kfold_split.py`:
- Around line 6-15: The submitted job spec contains unfilled placeholders:
change username=... in the submit_job call to a real string (e.g., your k8s
username) instead of the Ellipsis object, and replace the "+experiment=..."
token inside the script list (the uv run command) with a valid Hydra experiment
name such as "+experiment=split/kfold_split_5_folds"; alternatively, if this
file is meant as a template, add a top-of-file comment clearly stating the
fields that must be filled and provide sensible default example values for the
username and experiment placeholders so the script is runnable.

In `@split/kfold_split.py`:
- Around line 113-116: Hydra will fail because `@hydra.main`(config_name="split")
expects a primary config file; create a minimal configs/split.yaml (referenced
by the decorator on main in kfold_split.py) containing either "defaults: []" to
allow group-only composition or "defaults: [split: kfold_split]" to set the
kfold_split file as the default; add that file to the repo and then run the
entrypoint (main) with --cfg job to verify startup.
- Around line 15-56: The label array in derive_labels can get a fixed-width
Unicode dtype that silently truncates longer strings like "background"; change
the return so the labels array is created with an object dtype (e.g., use
np.array(label_ds["label"], dtype=object) or .astype(object)) to avoid
truncation, leaving tissue_prop and slide_id as before; update the np.array(...)
call for labels in derive_labels to force object dtype so collapse_rare_labels
can safely assign "background".

---

Nitpick comments:
In `@split/kfold_split.py`:
- Around line 48-53: Replace the print warning in the k-fold stratification code
with the Python warnings system: import warnings at top of split/kfold_split.py
and change the print(...) call that references rare, n_folds, counts, and unique
into warnings.warn(<same formatted message>, stacklevel=2) (optionally
specifying category=UserWarning) so the message is emitted via the warnings
machinery and can be filtered/suppressed by consumers.
- Around line 68-110: The stats block in log_fold_statistics recomputes masks
and may produce nan/warnings for empty folds and uses print instead of the app
logger; modify log_fold_statistics(labels, tissue_props, slide_ids, folds,
n_folds) to perform a single loop over fold in range(n_folds) where you compute
mask once, guard with if mask.any(): before calling
tissue_props[mask].mean()/std() to avoid RuntimeWarning/nan for empty folds,
call mlflow.log_metric and accumulate rows for the label distribution table in
that same loop (so you don't recompute), and replace print(...) with
logger.info(...) (either accept a logger parameter on log_fold_statistics or
call logging.getLogger(__name__) at top) to keep outputs in the centralized logs
while preserving the mlflow.log_table call for label_distribution.
- Around line 117-143: The current code uses TemporaryDirectory and calls
logger.log_artifacts(local_dir=output_dir,
artifact_path=config.mlflow_artifact_path) which will upload every file in
output_dir and may inadvertently include future debug files; change this to
upload only the single parquet file by using logger.log_artifact(out_path,
artifact_path=config.mlflow_artifact_path) (or otherwise pass the explicit file
path) instead of logger.log_artifacts, keeping the TemporaryDirectory usage and
the out_path variable as the single source of truth for what gets uploaded.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f7cc549-2102-44be-8a44-b5187aeb419f

📥 Commits

Reviewing files that changed from the base of the PR and between f624f23 and 67771d9.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (6)

configs/data/dataset.yaml
configs/experiment/split/kfold_split_5_folds.yaml
configs/split/kfold_split.yaml
pyproject.toml
scripts/submit_kfold_split.py
split/kfold_split.py

…-fold-train-split

Previously, rare classes (fewer than n_folds tiles) were collapsed into 'background' in-place, and the mutated labels flowed all the way into the stored 'label' column — losing the original semantic class for those samples. Split the array in two: original 'labels' stays untouched and is what gets persisted and reported, while 'stratification_labels' (a copy with rare classes merged) is used solely as the StratifiedKFold target. Log both distributions to MLflow for visibility.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

split/kfold_split.py (1)
125-127: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

configs/split.yaml primary config is still missing — runtime will fail with MissingConfigException.

@hydra.main(config_name="split") requires a file at configs/split.yaml to exist. The +split=kfold_split override injected by @with_cli_args is a config-group append that Hydra processes after locating and parsing the primary config; it cannot substitute for the missing primary file.

Create a minimal configs/split.yaml:
# `@package` _global_

defaults:
  - _self_
Then select the group via CLI (+split=kfold_split) or set a default:
defaults:
  - split: kfold_split
  - _self_
Verify startup with python -m split.kfold_split --cfg job.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@split/kfold_split.py` around lines 125 - 127, The Hydra runtime fails because
`@hydra.main`(config_name="split") requires a primary configs/split.yaml which is
missing; add a minimal configs/split.yaml (with an empty defaults or a defaults
list selecting split: kfold_split) so Hydra can load the primary config before
group overrides applied by with_cli_args(defaults=["+split=kfold_split"]);
update the defaults block to either just include _self_ or to set split:
kfold_split as the default, then verify startup by running the module (e.g.,
python -m split.kfold_split --cfg job).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@split/kfold_split.py`:
- Around line 47-58: build_stratification_labels can return a strat array where
the collapsed "background" group still has fewer than n_folds members, causing a
cryptic StratifiedKFold error in assign_folds; after computing rare and updating
strat (and before returning), compute the merged background count (sum of counts
for classes in rare plus any existing "background" count), and if that merged
count < n_folds raise a clear ValueError (or custom exception) that lists
n_folds and the affected classes/counts so callers (and assign_folds) get a
meaningful diagnostic instead of the sklearn message; reference the variables
unique, counts, rare, strat and the function build_stratification_labels when
implementing this validation.

---

Duplicate comments:
In `@split/kfold_split.py`:
- Around line 125-127: The Hydra runtime fails because
`@hydra.main`(config_name="split") requires a primary configs/split.yaml which is
missing; add a minimal configs/split.yaml (with an empty defaults or a defaults
list selecting split: kfold_split) so Hydra can load the primary config before
group overrides applied by with_cli_args(defaults=["+split=kfold_split"]);
update the defaults block to either just include _self_ or to set split:
kfold_split as the default, then verify startup by running the module (e.g.,
python -m split.kfold_split --cfg job).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9cec7524-55af-4cbd-8d90-f6d30163b193

📥 Commits

Reviewing files that changed from the base of the PR and between 67771d9 and f60295c.

📒 Files selected for processing (1)

split/kfold_split.py

After collapsing rare classes into 'background', the merged group can still have fewer than n_folds tiles, which would surface as a cryptic sklearn error from StratifiedKFold. Raise a ValueError with the n_folds value and the list of collapsed classes so the failure is actionable.

…ists Dataset[col] materialized 80M strings as a Python list before np.array walked it — multi-GB allocation, no progress, easy to mistake for a hang. Pull the columns straight from the underlying pyarrow.Table which is near-zero-copy for fixed-width and chunked-fast for strings.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

split/kfold_split.py (1)
137-140: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

configs/split.yaml primary config is still missing — Hydra will fail at startup.

@hydra.main(config_path="../configs", config_name="split", version_base=None) requires a configs/split.yaml file to exist as the composition entry point. This was flagged in a prior review and remains unresolved.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@split/kfold_split.py` around lines 137 - 140, Hydra is configured in the
kfold_split module via the `@hydra.main` decorator on main (decorator:
`@hydra.main`(config_path="../configs", config_name="split", version_base=None))
but the required composition entry file configs/split.yaml is missing; add a
top-level configs/split.yaml (the Hydra primary config) that composes or
includes the split config values expected by main (or alternatively change the
decorator to point to an existing config_name/path), ensuring the config keys
referenced by main and downstream code match the new split.yaml entries so Hydra
can start successfully.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@split/kfold_split.py`:
- Around line 58-70: The guard that checks background_count < n_folds is firing
even when no classes were collapsed; move the computation and check for
background_count inside the existing if len(rare) > 0 block so it only runs when
collapse happened (i.e., after strat[np.isin(strat, rare)] = "background").
Specifically, in the function that contains strat and rare (e.g.,
build_stratification_labels), stop computing background_count and raising the
ValueError unconditionally—compute background_count = int((strat ==
"background").sum()) and run the if background_count < n_folds: raise
ValueError(...) only inside the if len(rare) > 0: branch, and remove the
unconditional check after that branch.

---

Duplicate comments:
In `@split/kfold_split.py`:
- Around line 137-140: Hydra is configured in the kfold_split module via the
`@hydra.main` decorator on main (decorator: `@hydra.main`(config_path="../configs",
config_name="split", version_base=None)) but the required composition entry file
configs/split.yaml is missing; add a top-level configs/split.yaml (the Hydra
primary config) that composes or includes the split config values expected by
main (or alternatively change the decorator to point to an existing
config_name/path), ensuring the config keys referenced by main and downstream
code match the new split.yaml entries so Hydra can start successfully.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8fe5f47b-66be-4c5b-a465-07d8c946f542

📥 Commits

Reviewing files that changed from the base of the PR and between f60295c and 72ad0d7.

📒 Files selected for processing (1)

split/kfold_split.py

The background-count guard was running unconditionally, so an input with no rare classes and no pre-existing 'background' class would raise a spurious ValueError despite being valid for StratifiedKFold. Move the check inside the 'len(rare) > 0' branch — it only matters after a collapse actually merged classes into 'background'.

Resolved conflict in configs/data/dataset.yaml by keeping both additions: train_tiles_filename from this branch and tissue_masks_run_id from master. Regenerated uv.lock.

…-fold-train-split

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

♻️ Duplicate comments (1)

split/kfold_split.py (1)

137-138: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

configs/split.yaml primary config appears to be missing — Hydra will fail at startup.

@hydra.main(config_name="split") requires a primary config file at configs/split.yaml. Hydra does not support group-only composition without a primary config file; the +split=kfold_split CLI override alone cannot satisfy this. Only configs/split/kfold_split.yaml is visible in this PR — no configs/split.yaml.

#!/bin/bash
# Description: Check if configs/split.yaml exists and look at how other
# Hydra entrypoints in the repo pair their config_name with a primary config file.

echo "=== Check for configs/split.yaml ==="
fd -t f 'split.ya?ml$' configs/

echo ""
echo "=== All YAML files under configs/ top level (primary config candidates) ==="
fd -t f -d 1 '\.ya?ml$' configs/

echo ""
echo "=== `@hydra.main` declarations and their config_name values ==="
rg -nP --type=py '@hydra\.main\s*\(' -A2

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@split/kfold_split.py` around lines 137 - 138, Hydra is failing because
`@hydra.main`(config_name="split", ...) in split/kfold_split.py expects a primary
config file configs/split.yaml which is missing; fix by either (A) adding a
primary config file configs/split.yaml that composes the group entry (e.g.,
defaults: ["split: kfold_split"] or defaults: ["+split=kfold_split"]) so the
group-only override works, or (B) change the decorator in split/kfold_split.py
to point directly at the existing config (e.g., set config_name to the existing
file such as "split/kfold_split" or the exact primary config name you have),
ensuring the config_name used by `@hydra.main` matches an actual YAML file under
configs.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@split/kfold_split.py`:
- Around line 137-138: Hydra is failing because `@hydra.main`(config_name="split",
...) in split/kfold_split.py expects a primary config file configs/split.yaml
which is missing; fix by either (A) adding a primary config file
configs/split.yaml that composes the group entry (e.g., defaults: ["split:
kfold_split"] or defaults: ["+split=kfold_split"]) so the group-only override
works, or (B) change the decorator in split/kfold_split.py to point directly at
the existing config (e.g., set config_name to the existing file such as
"split/kfold_split" or the exact primary config name you have), ensuring the
config_name used by `@hydra.main` matches an actual YAML file under configs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6e20c773-ecac-4c5b-9c3c-8416bc2173b6

📥 Commits

Reviewing files that changed from the base of the PR and between 506b33e and 64b2000.

📒 Files selected for processing (2)

configs/data/dataset.yaml
split/kfold_split.py

vojtech-cifka added 25 commits April 18, 2026 22:00

feat: add tiling filename and run id

a70ece0

feat: add k fold split configs

c01920c

feat: add k fold script for submission

0179d5d

feat: implement k fold splitting script

b22c88a

fix: formatting

24b26b0

fix: reset indices before splitting

1ca8ebe

feat: derive label and coverage from the roi_cols

f0d37e8

fix: rename file for clarity

f9c49f5

fix: warn about the rare labels being dropped

8f6add5

feat: add tiling run URI

b1056f9

fix: typo

6d7f829

fix: remove autolog logging

e11ee52

fix: add logging

56125d1

fix: add more logging

e6d4067

fix: add even more logging

68e4057

fix: bypass parquet loading

fb0affa

fix: optimalize the parquet loading

4a50f8d

fix: process in batches

7d5fd32

fix: wrong path

c6af2b3

feat: add pyarrow and resolve dependencies

12a9c94

fix: resolve ruff

bc73a8c

refactor: remove hardcoded class names, derive them from the parquet

8febba8

chore: use github adress

6c71677

refactor: use lazy loading instead of loading in batches

0d3edbc

feat: add dependency on datasets library

67771d9

vojtech-cifka requested review from a team, JakubPekar and ejdam87 April 24, 2026 21:05

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread split/kfold_split.py

Comment thread scripts/submit_kfold_split.py

Comment thread scripts/submit_kfold_split.py

Comment thread split/kfold_split.py

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread scripts/submit_kfold_split.py

Comment thread split/kfold_split.py Outdated

Comment thread split/kfold_split.py

Vojtěch Cífka and others added 2 commits April 24, 2026 23:16

fix: add empty column check

faa324d

fix: add dtype

56d6c8f

vojtech-cifka self-assigned this Apr 25, 2026

vojtech-cifka added 3 commits April 25, 2026 15:42

fix: mypy

750d033

Merge remote-tracking branch 'origin/master' into feature/implement-k…

e50eae8

…-fold-train-split

fix(kfold_split): add type parameters to dict annotations

5216c8e

vojtech-cifka requested review from vejtek and removed request for JakubPekar April 29, 2026 12:05

vejtek previously approved these changes Apr 29, 2026

View reviewed changes

ejdam87 reviewed May 4, 2026

View reviewed changes

Comment thread split/kfold_split.py Outdated

vojtech-cifka dismissed vejtek’s stale review via f60295c May 4, 2026 18:47

refactor: keep original labels

7d9090a

vojtech-cifka requested review from ejdam87 and vejtek May 4, 2026 18:51

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread split/kfold_split.py

vojtech-cifka added 5 commits May 4, 2026 20:56

fix: format

77c1fc1

chore(kfold_split): add progress prints to localize post-Map stall

e4b98a2

chore(kfold_split): drop diagnostic progress prints

72ad0d7

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread split/kfold_split.py

vojtech-cifka and others added 4 commits May 4, 2026 22:01

Merge branch 'master' into feature/implement-k-fold-train-split

506b33e

Resolved conflict in configs/data/dataset.yaml by keeping both additions: train_tiles_filename from this branch and tissue_masks_run_id from master. Regenerated uv.lock.

Merge remote-tracking branch 'origin/master' into feature/implement-k…

caec95f

…-fold-train-split

feat: run kfold split over filtered tiles instead of all tiled tiles

64b2000

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Conversation

vojtech-cifka commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vojtech-cifka commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading