Feature/implement k fold train split#3
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds k‑fold stratified split: new dataset MLflow artifact reference ( ChangesK-fold split feature
Sequence Diagram(s)sequenceDiagram
participant Config as Hydra Config
participant MLflow as MLflow Artifacts
participant Data as Parquet Loader
participant Split as K-fold Processor
participant Storage as Artifact Upload
Config->>MLflow: resolve `filter_tiles_run_id` & artifact path
MLflow->>Data: download train tiles parquet
Data-->>Split: provide rows with `roi_coverage_*` and `slide_id`
Split->>Split: derive `label`, `tissue_prop`, `slide_id`
Split->>Split: collapse rare classes and build stratification labels
Split->>Split: assign folds via StratifiedKFold
Split->>MLflow: log per-fold metrics and label distribution tables
Split->>Storage: write augmented parquet and upload to MLflow artifact path
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a stratified k-fold splitting utility for tissue classification datasets, including new configuration files, dependency updates, and a Kubernetes job submission script. The core logic in split/kfold_split.py handles label derivation, rare label management, and MLFlow logging. Review feedback identifies potential crashes due to missing ROI columns or placeholder values in the submission script and suggests a more efficient approach for adding multiple columns to the dataset to avoid unnecessary data copying.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
split/kfold_split.py (3)
48-53: Preferwarnings.warn(or the module logger) over
warnings.warn(..., stacklevel=2)integrates with Python’s warning machinery and will still appear in stdout/stderr by default.🧹 Suggested refactor
+import warnings @@ - if len(rare) > 0: - print( - f"WARNING: {len(rare)} label(s) have fewer than {n_folds} tiles and will " - f"be collapsed into 'background' for stratification: " - + ", ".join(f"{cls}({counts[unique == cls][0]})" for cls in rare), - flush=True, - ) + if len(rare) > 0: + details = ", ".join(f"{cls}({counts[unique == cls][0]})" for cls in rare) + warnings.warn( + f"{len(rare)} label(s) have fewer than {n_folds} tiles and will " + f"be collapsed into 'background' for stratification: {details}", + stacklevel=2, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@split/kfold_split.py` around lines 48 - 53, Replace the print warning in the k-fold stratification code with the Python warnings system: import warnings at top of split/kfold_split.py and change the print(...) call that references rare, n_folds, counts, and unique into warnings.warn(<same formatted message>, stacklevel=2) (optionally specifying category=UserWarning) so the message is emitted via the warnings machinery and can be filtered/suppressed by consumers.
68-110: Minor: stats block has some small ergonomic/robustness gaps.
- Lines 99–110 recompute
mask,n_val,tp_mean/stdalready derived in the metric loop above; consider iterating once and both logging + printing in the same pass to avoid drift.tissue_props[mask].mean()/std()will emit a numpyRuntimeWarningand yieldnanif any fold ends up empty. Stratified k-fold on the collapsed labels should normally prevent this, but a small guard (if mask.any():) would keep the output clean under degenerate inputs.print(...)here bypasses the MLflow logger; usinglogger(already passed intomain) or the standardloggingmodule would keep server logs consistent.None of these are blockers — happy to leave as-is if you prefer.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@split/kfold_split.py` around lines 68 - 110, The stats block in log_fold_statistics recomputes masks and may produce nan/warnings for empty folds and uses print instead of the app logger; modify log_fold_statistics(labels, tissue_props, slide_ids, folds, n_folds) to perform a single loop over fold in range(n_folds) where you compute mask once, guard with if mask.any(): before calling tissue_props[mask].mean()/std() to avoid RuntimeWarning/nan for empty folds, call mlflow.log_metric and accumulate rows for the label distribution table in that same loop (so you don't recompute), and replace print(...) with logger.info(...) (either accept a logger parameter on log_fold_statistics or call logging.getLogger(__name__) at top) to keep outputs in the centralized logs while preserving the mlflow.log_table call for label_distribution.
117-143: Clarify artifact upload strategy for TemporaryDirectory contents.
mlflow_artifact_path: kfold_splitis the same for every run, butlogger.log_artifacts(local_dir=output_dir, ...)uploads all files inoutput_dir. Currently onlykfold_tiles.parquetexists, so no issue. However, any future side files dropped inoutput_dir(e.g., debug dumps) will silently leak into the artifact store. Consider usinglog_artifact()(single-file upload) onout_pathinstead oflog_artifacts()on the directory, or keepTemporaryDirectoryminimal as it is now to prevent unintended artifact pollution.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@split/kfold_split.py` around lines 117 - 143, The current code uses TemporaryDirectory and calls logger.log_artifacts(local_dir=output_dir, artifact_path=config.mlflow_artifact_path) which will upload every file in output_dir and may inadvertently include future debug files; change this to upload only the single parquet file by using logger.log_artifact(out_path, artifact_path=config.mlflow_artifact_path) (or otherwise pass the explicit file path) instead of logger.log_artifacts, keeping the TemporaryDirectory usage and the out_path variable as the single source of truth for what gets uploaded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/submit_kfold_split.py`:
- Around line 6-15: The submitted job spec contains unfilled placeholders:
change username=... in the submit_job call to a real string (e.g., your k8s
username) instead of the Ellipsis object, and replace the "+experiment=..."
token inside the script list (the uv run command) with a valid Hydra experiment
name such as "+experiment=split/kfold_split_5_folds"; alternatively, if this
file is meant as a template, add a top-of-file comment clearly stating the
fields that must be filled and provide sensible default example values for the
username and experiment placeholders so the script is runnable.
In `@split/kfold_split.py`:
- Around line 113-116: Hydra will fail because `@hydra.main`(config_name="split")
expects a primary config file; create a minimal configs/split.yaml (referenced
by the decorator on main in kfold_split.py) containing either "defaults: []" to
allow group-only composition or "defaults: [split: kfold_split]" to set the
kfold_split file as the default; add that file to the repo and then run the
entrypoint (main) with --cfg job to verify startup.
- Around line 15-56: The label array in derive_labels can get a fixed-width
Unicode dtype that silently truncates longer strings like "background"; change
the return so the labels array is created with an object dtype (e.g., use
np.array(label_ds["label"], dtype=object) or .astype(object)) to avoid
truncation, leaving tissue_prop and slide_id as before; update the np.array(...)
call for labels in derive_labels to force object dtype so collapse_rare_labels
can safely assign "background".
---
Nitpick comments:
In `@split/kfold_split.py`:
- Around line 48-53: Replace the print warning in the k-fold stratification code
with the Python warnings system: import warnings at top of split/kfold_split.py
and change the print(...) call that references rare, n_folds, counts, and unique
into warnings.warn(<same formatted message>, stacklevel=2) (optionally
specifying category=UserWarning) so the message is emitted via the warnings
machinery and can be filtered/suppressed by consumers.
- Around line 68-110: The stats block in log_fold_statistics recomputes masks
and may produce nan/warnings for empty folds and uses print instead of the app
logger; modify log_fold_statistics(labels, tissue_props, slide_ids, folds,
n_folds) to perform a single loop over fold in range(n_folds) where you compute
mask once, guard with if mask.any(): before calling
tissue_props[mask].mean()/std() to avoid RuntimeWarning/nan for empty folds,
call mlflow.log_metric and accumulate rows for the label distribution table in
that same loop (so you don't recompute), and replace print(...) with
logger.info(...) (either accept a logger parameter on log_fold_statistics or
call logging.getLogger(__name__) at top) to keep outputs in the centralized logs
while preserving the mlflow.log_table call for label_distribution.
- Around line 117-143: The current code uses TemporaryDirectory and calls
logger.log_artifacts(local_dir=output_dir,
artifact_path=config.mlflow_artifact_path) which will upload every file in
output_dir and may inadvertently include future debug files; change this to
upload only the single parquet file by using logger.log_artifact(out_path,
artifact_path=config.mlflow_artifact_path) (or otherwise pass the explicit file
path) instead of logger.log_artifacts, keeping the TemporaryDirectory usage and
the out_path variable as the single source of truth for what gets uploaded.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2f7cc549-2102-44be-8a44-b5187aeb419f
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
configs/data/dataset.yamlconfigs/experiment/split/kfold_split_5_folds.yamlconfigs/split/kfold_split.yamlpyproject.tomlscripts/submit_kfold_split.pysplit/kfold_split.py
Previously, rare classes (fewer than n_folds tiles) were collapsed into 'background' in-place, and the mutated labels flowed all the way into the stored 'label' column — losing the original semantic class for those samples. Split the array in two: original 'labels' stays untouched and is what gets persisted and reported, while 'stratification_labels' (a copy with rare classes merged) is used solely as the StratifiedKFold target. Log both distributions to MLflow for visibility.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
split/kfold_split.py (1)
125-127:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
configs/split.yamlprimary config is still missing — runtime will fail withMissingConfigException.
@hydra.main(config_name="split")requires a file atconfigs/split.yamlto exist. The+split=kfold_splitoverride injected by@with_cli_argsis a config-group append that Hydra processes after locating and parsing the primary config; it cannot substitute for the missing primary file.Create a minimal
configs/split.yaml:# `@package` _global_ defaults: - _self_Then select the group via CLI (
+split=kfold_split) or set a default:defaults: - split: kfold_split - _self_Verify startup with
python -m split.kfold_split --cfg job.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@split/kfold_split.py` around lines 125 - 127, The Hydra runtime fails because `@hydra.main`(config_name="split") requires a primary configs/split.yaml which is missing; add a minimal configs/split.yaml (with an empty defaults or a defaults list selecting split: kfold_split) so Hydra can load the primary config before group overrides applied by with_cli_args(defaults=["+split=kfold_split"]); update the defaults block to either just include _self_ or to set split: kfold_split as the default, then verify startup by running the module (e.g., python -m split.kfold_split --cfg job).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@split/kfold_split.py`:
- Around line 47-58: build_stratification_labels can return a strat array where
the collapsed "background" group still has fewer than n_folds members, causing a
cryptic StratifiedKFold error in assign_folds; after computing rare and updating
strat (and before returning), compute the merged background count (sum of counts
for classes in rare plus any existing "background" count), and if that merged
count < n_folds raise a clear ValueError (or custom exception) that lists
n_folds and the affected classes/counts so callers (and assign_folds) get a
meaningful diagnostic instead of the sklearn message; reference the variables
unique, counts, rare, strat and the function build_stratification_labels when
implementing this validation.
---
Duplicate comments:
In `@split/kfold_split.py`:
- Around line 125-127: The Hydra runtime fails because
`@hydra.main`(config_name="split") requires a primary configs/split.yaml which is
missing; add a minimal configs/split.yaml (with an empty defaults or a defaults
list selecting split: kfold_split) so Hydra can load the primary config before
group overrides applied by with_cli_args(defaults=["+split=kfold_split"]);
update the defaults block to either just include _self_ or to set split:
kfold_split as the default, then verify startup by running the module (e.g.,
python -m split.kfold_split --cfg job).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
After collapsing rare classes into 'background', the merged group can still have fewer than n_folds tiles, which would surface as a cryptic sklearn error from StratifiedKFold. Raise a ValueError with the n_folds value and the list of collapsed classes so the failure is actionable.
…ists Dataset[col] materialized 80M strings as a Python list before np.array walked it — multi-GB allocation, no progress, easy to mistake for a hang. Pull the columns straight from the underlying pyarrow.Table which is near-zero-copy for fixed-width and chunked-fast for strings.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
split/kfold_split.py (1)
137-140:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
configs/split.yamlprimary config is still missing — Hydra will fail at startup.
@hydra.main(config_path="../configs", config_name="split", version_base=None)requires aconfigs/split.yamlfile to exist as the composition entry point. This was flagged in a prior review and remains unresolved.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@split/kfold_split.py` around lines 137 - 140, Hydra is configured in the kfold_split module via the `@hydra.main` decorator on main (decorator: `@hydra.main`(config_path="../configs", config_name="split", version_base=None)) but the required composition entry file configs/split.yaml is missing; add a top-level configs/split.yaml (the Hydra primary config) that composes or includes the split config values expected by main (or alternatively change the decorator to point to an existing config_name/path), ensuring the config keys referenced by main and downstream code match the new split.yaml entries so Hydra can start successfully.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@split/kfold_split.py`:
- Around line 58-70: The guard that checks background_count < n_folds is firing
even when no classes were collapsed; move the computation and check for
background_count inside the existing if len(rare) > 0 block so it only runs when
collapse happened (i.e., after strat[np.isin(strat, rare)] = "background").
Specifically, in the function that contains strat and rare (e.g.,
build_stratification_labels), stop computing background_count and raising the
ValueError unconditionally—compute background_count = int((strat ==
"background").sum()) and run the if background_count < n_folds: raise
ValueError(...) only inside the if len(rare) > 0: branch, and remove the
unconditional check after that branch.
---
Duplicate comments:
In `@split/kfold_split.py`:
- Around line 137-140: Hydra is configured in the kfold_split module via the
`@hydra.main` decorator on main (decorator: `@hydra.main`(config_path="../configs",
config_name="split", version_base=None)) but the required composition entry file
configs/split.yaml is missing; add a top-level configs/split.yaml (the Hydra
primary config) that composes or includes the split config values expected by
main (or alternatively change the decorator to point to an existing
config_name/path), ensuring the config keys referenced by main and downstream
code match the new split.yaml entries so Hydra can start successfully.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
The background-count guard was running unconditionally, so an input with no rare classes and no pre-existing 'background' class would raise a spurious ValueError despite being valid for StratifiedKFold. Move the check inside the 'len(rare) > 0' branch — it only matters after a collapse actually merged classes into 'background'.
Resolved conflict in configs/data/dataset.yaml by keeping both additions: train_tiles_filename from this branch and tissue_masks_run_id from master. Regenerated uv.lock.
…-fold-train-split
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
split/kfold_split.py (1)
137-138:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
configs/split.yamlprimary config appears to be missing — Hydra will fail at startup.
@hydra.main(config_name="split")requires a primary config file atconfigs/split.yaml. Hydra does not support group-only composition without a primary config file; the+split=kfold_splitCLI override alone cannot satisfy this. Onlyconfigs/split/kfold_split.yamlis visible in this PR — noconfigs/split.yaml.#!/bin/bash # Description: Check if configs/split.yaml exists and look at how other # Hydra entrypoints in the repo pair their config_name with a primary config file. echo "=== Check for configs/split.yaml ===" fd -t f 'split.ya?ml$' configs/ echo "" echo "=== All YAML files under configs/ top level (primary config candidates) ===" fd -t f -d 1 '\.ya?ml$' configs/ echo "" echo "=== `@hydra.main` declarations and their config_name values ===" rg -nP --type=py '@hydra\.main\s*\(' -A2🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@split/kfold_split.py` around lines 137 - 138, Hydra is failing because `@hydra.main`(config_name="split", ...) in split/kfold_split.py expects a primary config file configs/split.yaml which is missing; fix by either (A) adding a primary config file configs/split.yaml that composes the group entry (e.g., defaults: ["split: kfold_split"] or defaults: ["+split=kfold_split"]) so the group-only override works, or (B) change the decorator in split/kfold_split.py to point directly at the existing config (e.g., set config_name to the existing file such as "split/kfold_split" or the exact primary config name you have), ensuring the config_name used by `@hydra.main` matches an actual YAML file under configs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@split/kfold_split.py`:
- Around line 137-138: Hydra is failing because `@hydra.main`(config_name="split",
...) in split/kfold_split.py expects a primary config file configs/split.yaml
which is missing; fix by either (A) adding a primary config file
configs/split.yaml that composes the group entry (e.g., defaults: ["split:
kfold_split"] or defaults: ["+split=kfold_split"]) so the group-only override
works, or (B) change the decorator in split/kfold_split.py to point directly at
the existing config (e.g., set config_name to the existing file such as
"split/kfold_split" or the exact primary config name you have), ensuring the
config_name used by `@hydra.main` matches an actual YAML file under configs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6e20c773-ecac-4c5b-9c3c-8416bc2173b6
📒 Files selected for processing (2)
configs/data/dataset.yamlsplit/kfold_split.py
Summary
n_foldsvalidation folds usingStratifiedKFoldon the dominant tissue class label(
roi_coverage_*columns), so no code changes are needed when tissue classes changen_foldstiles are collapsed into background beforesplitting to prevent
StratifiedKFoldfrom crashing, with a printed warning listingaffected classes and their counts
distribution table are all logged to MLflow
label,tissue_prop, andfoldcolumnsadded, then logged as an MLflow artifact
Summary by CodeRabbit
New Features
Configuration
Chores