Skip to content

Feature/implement k fold train split#3

Open
vojtech-cifka wants to merge 41 commits intomasterfrom
feature/implement-k-fold-train-split
Open

Feature/implement k fold train split#3
vojtech-cifka wants to merge 41 commits intomasterfrom
feature/implement-k-fold-train-split

Conversation

@vojtech-cifka
Copy link
Copy Markdown
Collaborator

@vojtech-cifka vojtech-cifka commented Apr 24, 2026

Summary

  • Loads the tiling parquet artifact from MLflow and assigns each of the filtered tiles to one of
    n_folds validation folds using StratifiedKFold on the dominant tissue class label
  • Labels and ROI coverage columns are derived dynamically from the parquet schema
    (roi_coverage_* columns), so no code changes are needed when tissue classes change
  • Rare classes with fewer than n_folds tiles are collapsed into background before
    splitting to prevent StratifiedKFold from crashing, with a printed warning listing
    affected classes and their counts
  • Fold assignments, per-fold tile/slide counts, tissue proportion stats, and label
    distribution table are all logged to MLflow
  • Output is written back to parquet with label, tissue_prop, and fold columns
    added, then logged as an MLflow artifact
  • Hydra config + experiment overrides; cluster submission script included

Summary by CodeRabbit

  • New Features

    • K-fold cross-validation split with stratified handling (collapses rare labels), per-fold summaries, and export of fold assignments and augmented training tiles.
    • Automatic derivation of per-tile label and tissue proportion, MLflow logging, and artifact upload.
    • Job submission support to run k-fold splits.
  • Configuration

    • New k-fold split configurations (generic and 5-fold preset).
    • Added dataset config field for filtering tiles by run ID.
  • Chores

    • Added runtime dependencies: pyarrow, datasets.

@vojtech-cifka vojtech-cifka requested review from a team, JakubPekar and ejdam87 April 24, 2026 21:05
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds k‑fold stratified split: new dataset MLflow artifact reference (filter_tiles_run_id), k‑fold configs and experiment preset, duplicate pyproject dependency lines, a Hydra entrypoint that derives labels/tissue proportions from roi_coverage_*, collapses rare classes, assigns StratifiedKFold folds, logs metrics/tables to MLflow, and uploads augmented parquet.

Changes

K-fold split feature

Layer / File(s) Summary
Data Shape / Config Fields
configs/data/dataset.yaml
Adds dataset.mlflow_artifacts.filter_tiles_run_id pointing to a specific MLflow run ID for tile artifacts.
Experiment Preset
configs/experiment/split/kfold_split_5_folds.yaml
New preset that composes dataset defaults and sets n_folds: 5.
Hydra / Split Parameters
configs/split/kfold_split.yaml
New split config declaring n_folds (unresolved), random_state: 42, MLflow artifact path, run metadata, and templated hyperparams.
Runtime Dependencies
pyproject.toml
Adds pyarrow>=19.0.0 and datasets>=3.0.0 entries to project dependencies (coexisting with existing pyarrow>=19.0.1 and datasets>=4.0.0).
Core Implementation
split/kfold_split.py
New Hydra entrypoint and functions: derive_labels (computes label, tissue_prop, slide_id from roi_coverage_*), build_stratification_labels (collapses classes with count < n_folds into background and validates), assign_folds (StratifiedKFold assignment), log_fold_statistics (per-fold metrics and two MLflow label-distribution tables), and main (download parquet from MLflow, compute and append label/tissue_prop/fold, write/upload augmented parquet).
Execution / Submission
scripts/submit_kfold_split.py
New submission script registering a Kubernetes job that clones the repo and runs the split entrypoint with composed Hydra config.

Sequence Diagram(s)

sequenceDiagram
    participant Config as Hydra Config
    participant MLflow as MLflow Artifacts
    participant Data as Parquet Loader
    participant Split as K-fold Processor
    participant Storage as Artifact Upload

    Config->>MLflow: resolve `filter_tiles_run_id` & artifact path
    MLflow->>Data: download train tiles parquet
    Data-->>Split: provide rows with `roi_coverage_*` and `slide_id`
    Split->>Split: derive `label`, `tissue_prop`, `slide_id`
    Split->>Split: collapse rare classes and build stratification labels
    Split->>Split: assign folds via StratifiedKFold
    Split->>MLflow: log per-fold metrics and label distribution tables
    Split->>Storage: write augmented parquet and upload to MLflow artifact path
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • vejtek
  • JakubPekar
  • ejdam87

Poem

🐰 I hopped through tiles both near and far,
Peered at ROI covers, named each tiny star,
Folded the rare into background's warm veil,
Split into folds and logged every tale,
A rabbit's small hop — MLflow saved the trail.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: implementing k-fold cross-validation splitting functionality for training data.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/implement-k-fold-train-split

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a stratified k-fold splitting utility for tissue classification datasets, including new configuration files, dependency updates, and a Kubernetes job submission script. The core logic in split/kfold_split.py handles label derivation, rare label management, and MLFlow logging. Review feedback identifies potential crashes due to missing ROI columns or placeholder values in the submission script and suggests a more efficient approach for adding multiple columns to the dataset to avoid unnecessary data copying.

Comment thread split/kfold_split.py
Comment thread scripts/submit_kfold_split.py
Comment thread scripts/submit_kfold_split.py
Comment thread split/kfold_split.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
split/kfold_split.py (3)

48-53: Prefer warnings.warn (or the module logger) over print for this warning.

print bypasses MLflow’s captured output levels and can’t be filtered/suppressed by consumers. warnings.warn(..., stacklevel=2) integrates with Python’s warning machinery and will still appear in stdout/stderr by default.

🧹 Suggested refactor
+import warnings
@@
-    if len(rare) > 0:
-        print(
-            f"WARNING: {len(rare)} label(s) have fewer than {n_folds} tiles and will "
-            f"be collapsed into 'background' for stratification: "
-            + ", ".join(f"{cls}({counts[unique == cls][0]})" for cls in rare),
-            flush=True,
-        )
+    if len(rare) > 0:
+        details = ", ".join(f"{cls}({counts[unique == cls][0]})" for cls in rare)
+        warnings.warn(
+            f"{len(rare)} label(s) have fewer than {n_folds} tiles and will "
+            f"be collapsed into 'background' for stratification: {details}",
+            stacklevel=2,
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@split/kfold_split.py` around lines 48 - 53, Replace the print warning in the
k-fold stratification code with the Python warnings system: import warnings at
top of split/kfold_split.py and change the print(...) call that references rare,
n_folds, counts, and unique into warnings.warn(<same formatted message>,
stacklevel=2) (optionally specifying category=UserWarning) so the message is
emitted via the warnings machinery and can be filtered/suppressed by consumers.

68-110: Minor: stats block has some small ergonomic/robustness gaps.

  • Lines 99–110 recompute mask, n_val, tp_mean/std already derived in the metric loop above; consider iterating once and both logging + printing in the same pass to avoid drift.
  • tissue_props[mask].mean()/std() will emit a numpy RuntimeWarning and yield nan if any fold ends up empty. Stratified k-fold on the collapsed labels should normally prevent this, but a small guard (if mask.any():) would keep the output clean under degenerate inputs.
  • print(...) here bypasses the MLflow logger; using logger (already passed into main) or the standard logging module would keep server logs consistent.

None of these are blockers — happy to leave as-is if you prefer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@split/kfold_split.py` around lines 68 - 110, The stats block in
log_fold_statistics recomputes masks and may produce nan/warnings for empty
folds and uses print instead of the app logger; modify
log_fold_statistics(labels, tissue_props, slide_ids, folds, n_folds) to perform
a single loop over fold in range(n_folds) where you compute mask once, guard
with if mask.any(): before calling tissue_props[mask].mean()/std() to avoid
RuntimeWarning/nan for empty folds, call mlflow.log_metric and accumulate rows
for the label distribution table in that same loop (so you don't recompute), and
replace print(...) with logger.info(...) (either accept a logger parameter on
log_fold_statistics or call logging.getLogger(__name__) at top) to keep outputs
in the centralized logs while preserving the mlflow.log_table call for
label_distribution.

117-143: Clarify artifact upload strategy for TemporaryDirectory contents.

mlflow_artifact_path: kfold_split is the same for every run, but logger.log_artifacts(local_dir=output_dir, ...) uploads all files in output_dir. Currently only kfold_tiles.parquet exists, so no issue. However, any future side files dropped in output_dir (e.g., debug dumps) will silently leak into the artifact store. Consider using log_artifact() (single-file upload) on out_path instead of log_artifacts() on the directory, or keep TemporaryDirectory minimal as it is now to prevent unintended artifact pollution.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@split/kfold_split.py` around lines 117 - 143, The current code uses
TemporaryDirectory and calls logger.log_artifacts(local_dir=output_dir,
artifact_path=config.mlflow_artifact_path) which will upload every file in
output_dir and may inadvertently include future debug files; change this to
upload only the single parquet file by using logger.log_artifact(out_path,
artifact_path=config.mlflow_artifact_path) (or otherwise pass the explicit file
path) instead of logger.log_artifacts, keeping the TemporaryDirectory usage and
the out_path variable as the single source of truth for what gets uploaded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/submit_kfold_split.py`:
- Around line 6-15: The submitted job spec contains unfilled placeholders:
change username=... in the submit_job call to a real string (e.g., your k8s
username) instead of the Ellipsis object, and replace the "+experiment=..."
token inside the script list (the uv run command) with a valid Hydra experiment
name such as "+experiment=split/kfold_split_5_folds"; alternatively, if this
file is meant as a template, add a top-of-file comment clearly stating the
fields that must be filled and provide sensible default example values for the
username and experiment placeholders so the script is runnable.

In `@split/kfold_split.py`:
- Around line 113-116: Hydra will fail because `@hydra.main`(config_name="split")
expects a primary config file; create a minimal configs/split.yaml (referenced
by the decorator on main in kfold_split.py) containing either "defaults: []" to
allow group-only composition or "defaults: [split: kfold_split]" to set the
kfold_split file as the default; add that file to the repo and then run the
entrypoint (main) with --cfg job to verify startup.
- Around line 15-56: The label array in derive_labels can get a fixed-width
Unicode dtype that silently truncates longer strings like "background"; change
the return so the labels array is created with an object dtype (e.g., use
np.array(label_ds["label"], dtype=object) or .astype(object)) to avoid
truncation, leaving tissue_prop and slide_id as before; update the np.array(...)
call for labels in derive_labels to force object dtype so collapse_rare_labels
can safely assign "background".

---

Nitpick comments:
In `@split/kfold_split.py`:
- Around line 48-53: Replace the print warning in the k-fold stratification code
with the Python warnings system: import warnings at top of split/kfold_split.py
and change the print(...) call that references rare, n_folds, counts, and unique
into warnings.warn(<same formatted message>, stacklevel=2) (optionally
specifying category=UserWarning) so the message is emitted via the warnings
machinery and can be filtered/suppressed by consumers.
- Around line 68-110: The stats block in log_fold_statistics recomputes masks
and may produce nan/warnings for empty folds and uses print instead of the app
logger; modify log_fold_statistics(labels, tissue_props, slide_ids, folds,
n_folds) to perform a single loop over fold in range(n_folds) where you compute
mask once, guard with if mask.any(): before calling
tissue_props[mask].mean()/std() to avoid RuntimeWarning/nan for empty folds,
call mlflow.log_metric and accumulate rows for the label distribution table in
that same loop (so you don't recompute), and replace print(...) with
logger.info(...) (either accept a logger parameter on log_fold_statistics or
call logging.getLogger(__name__) at top) to keep outputs in the centralized logs
while preserving the mlflow.log_table call for label_distribution.
- Around line 117-143: The current code uses TemporaryDirectory and calls
logger.log_artifacts(local_dir=output_dir,
artifact_path=config.mlflow_artifact_path) which will upload every file in
output_dir and may inadvertently include future debug files; change this to
upload only the single parquet file by using logger.log_artifact(out_path,
artifact_path=config.mlflow_artifact_path) (or otherwise pass the explicit file
path) instead of logger.log_artifacts, keeping the TemporaryDirectory usage and
the out_path variable as the single source of truth for what gets uploaded.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f7cc549-2102-44be-8a44-b5187aeb419f

📥 Commits

Reviewing files that changed from the base of the PR and between f624f23 and 67771d9.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • configs/data/dataset.yaml
  • configs/experiment/split/kfold_split_5_folds.yaml
  • configs/split/kfold_split.yaml
  • pyproject.toml
  • scripts/submit_kfold_split.py
  • split/kfold_split.py

Comment thread scripts/submit_kfold_split.py
Comment thread split/kfold_split.py Outdated
Comment thread split/kfold_split.py
@vojtech-cifka vojtech-cifka self-assigned this Apr 25, 2026
@vojtech-cifka vojtech-cifka requested review from vejtek and removed request for JakubPekar April 29, 2026 12:05
vejtek
vejtek previously approved these changes Apr 29, 2026
Comment thread split/kfold_split.py Outdated
Previously, rare classes (fewer than n_folds tiles) were collapsed into
'background' in-place, and the mutated labels flowed all the way into
the stored 'label' column — losing the original semantic class for
those samples.

Split the array in two: original 'labels' stays untouched and is what
gets persisted and reported, while 'stratification_labels' (a copy with
rare classes merged) is used solely as the StratifiedKFold target. Log
both distributions to MLflow for visibility.
@vojtech-cifka vojtech-cifka requested review from ejdam87 and vejtek May 4, 2026 18:51
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
split/kfold_split.py (1)

125-127: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

configs/split.yaml primary config is still missing — runtime will fail with MissingConfigException.

@hydra.main(config_name="split") requires a file at configs/split.yaml to exist. The +split=kfold_split override injected by @with_cli_args is a config-group append that Hydra processes after locating and parsing the primary config; it cannot substitute for the missing primary file.

Create a minimal configs/split.yaml:

# `@package` _global_

defaults:
  - _self_

Then select the group via CLI (+split=kfold_split) or set a default:

defaults:
  - split: kfold_split
  - _self_

Verify startup with python -m split.kfold_split --cfg job.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@split/kfold_split.py` around lines 125 - 127, The Hydra runtime fails because
`@hydra.main`(config_name="split") requires a primary configs/split.yaml which is
missing; add a minimal configs/split.yaml (with an empty defaults or a defaults
list selecting split: kfold_split) so Hydra can load the primary config before
group overrides applied by with_cli_args(defaults=["+split=kfold_split"]);
update the defaults block to either just include _self_ or to set split:
kfold_split as the default, then verify startup by running the module (e.g.,
python -m split.kfold_split --cfg job).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@split/kfold_split.py`:
- Around line 47-58: build_stratification_labels can return a strat array where
the collapsed "background" group still has fewer than n_folds members, causing a
cryptic StratifiedKFold error in assign_folds; after computing rare and updating
strat (and before returning), compute the merged background count (sum of counts
for classes in rare plus any existing "background" count), and if that merged
count < n_folds raise a clear ValueError (or custom exception) that lists
n_folds and the affected classes/counts so callers (and assign_folds) get a
meaningful diagnostic instead of the sklearn message; reference the variables
unique, counts, rare, strat and the function build_stratification_labels when
implementing this validation.

---

Duplicate comments:
In `@split/kfold_split.py`:
- Around line 125-127: The Hydra runtime fails because
`@hydra.main`(config_name="split") requires a primary configs/split.yaml which is
missing; add a minimal configs/split.yaml (with an empty defaults or a defaults
list selecting split: kfold_split) so Hydra can load the primary config before
group overrides applied by with_cli_args(defaults=["+split=kfold_split"]);
update the defaults block to either just include _self_ or to set split:
kfold_split as the default, then verify startup by running the module (e.g.,
python -m split.kfold_split --cfg job).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9cec7524-55af-4cbd-8d90-f6d30163b193

📥 Commits

Reviewing files that changed from the base of the PR and between 67771d9 and f60295c.

📒 Files selected for processing (1)
  • split/kfold_split.py

Comment thread split/kfold_split.py
After collapsing rare classes into 'background', the merged group can
still have fewer than n_folds tiles, which would surface as a cryptic
sklearn error from StratifiedKFold. Raise a ValueError with the
n_folds value and the list of collapsed classes so the failure is
actionable.
…ists

Dataset[col] materialized 80M strings as a Python list before np.array
walked it — multi-GB allocation, no progress, easy to mistake for a
hang. Pull the columns straight from the underlying pyarrow.Table
which is near-zero-copy for fixed-width and chunked-fast for strings.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
split/kfold_split.py (1)

137-140: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

configs/split.yaml primary config is still missing — Hydra will fail at startup.

@hydra.main(config_path="../configs", config_name="split", version_base=None) requires a configs/split.yaml file to exist as the composition entry point. This was flagged in a prior review and remains unresolved.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@split/kfold_split.py` around lines 137 - 140, Hydra is configured in the
kfold_split module via the `@hydra.main` decorator on main (decorator:
`@hydra.main`(config_path="../configs", config_name="split", version_base=None))
but the required composition entry file configs/split.yaml is missing; add a
top-level configs/split.yaml (the Hydra primary config) that composes or
includes the split config values expected by main (or alternatively change the
decorator to point to an existing config_name/path), ensuring the config keys
referenced by main and downstream code match the new split.yaml entries so Hydra
can start successfully.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@split/kfold_split.py`:
- Around line 58-70: The guard that checks background_count < n_folds is firing
even when no classes were collapsed; move the computation and check for
background_count inside the existing if len(rare) > 0 block so it only runs when
collapse happened (i.e., after strat[np.isin(strat, rare)] = "background").
Specifically, in the function that contains strat and rare (e.g.,
build_stratification_labels), stop computing background_count and raising the
ValueError unconditionally—compute background_count = int((strat ==
"background").sum()) and run the if background_count < n_folds: raise
ValueError(...) only inside the if len(rare) > 0: branch, and remove the
unconditional check after that branch.

---

Duplicate comments:
In `@split/kfold_split.py`:
- Around line 137-140: Hydra is configured in the kfold_split module via the
`@hydra.main` decorator on main (decorator: `@hydra.main`(config_path="../configs",
config_name="split", version_base=None)) but the required composition entry file
configs/split.yaml is missing; add a top-level configs/split.yaml (the Hydra
primary config) that composes or includes the split config values expected by
main (or alternatively change the decorator to point to an existing
config_name/path), ensuring the config keys referenced by main and downstream
code match the new split.yaml entries so Hydra can start successfully.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8fe5f47b-66be-4c5b-a465-07d8c946f542

📥 Commits

Reviewing files that changed from the base of the PR and between f60295c and 72ad0d7.

📒 Files selected for processing (1)
  • split/kfold_split.py

Comment thread split/kfold_split.py
vojtech-cifka and others added 4 commits May 4, 2026 22:01
The background-count guard was running unconditionally, so an input
with no rare classes and no pre-existing 'background' class would
raise a spurious ValueError despite being valid for StratifiedKFold.
Move the check inside the 'len(rare) > 0' branch — it only matters
after a collapse actually merged classes into 'background'.
Resolved conflict in configs/data/dataset.yaml by keeping both additions:
train_tiles_filename from this branch and tissue_masks_run_id from master.
Regenerated uv.lock.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
split/kfold_split.py (1)

137-138: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

configs/split.yaml primary config appears to be missing — Hydra will fail at startup.

@hydra.main(config_name="split") requires a primary config file at configs/split.yaml. Hydra does not support group-only composition without a primary config file; the +split=kfold_split CLI override alone cannot satisfy this. Only configs/split/kfold_split.yaml is visible in this PR — no configs/split.yaml.

#!/bin/bash
# Description: Check if configs/split.yaml exists and look at how other
# Hydra entrypoints in the repo pair their config_name with a primary config file.

echo "=== Check for configs/split.yaml ==="
fd -t f 'split.ya?ml$' configs/

echo ""
echo "=== All YAML files under configs/ top level (primary config candidates) ==="
fd -t f -d 1 '\.ya?ml$' configs/

echo ""
echo "=== `@hydra.main` declarations and their config_name values ==="
rg -nP --type=py '@hydra\.main\s*\(' -A2
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@split/kfold_split.py` around lines 137 - 138, Hydra is failing because
`@hydra.main`(config_name="split", ...) in split/kfold_split.py expects a primary
config file configs/split.yaml which is missing; fix by either (A) adding a
primary config file configs/split.yaml that composes the group entry (e.g.,
defaults: ["split: kfold_split"] or defaults: ["+split=kfold_split"]) so the
group-only override works, or (B) change the decorator in split/kfold_split.py
to point directly at the existing config (e.g., set config_name to the existing
file such as "split/kfold_split" or the exact primary config name you have),
ensuring the config_name used by `@hydra.main` matches an actual YAML file under
configs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@split/kfold_split.py`:
- Around line 137-138: Hydra is failing because `@hydra.main`(config_name="split",
...) in split/kfold_split.py expects a primary config file configs/split.yaml
which is missing; fix by either (A) adding a primary config file
configs/split.yaml that composes the group entry (e.g., defaults: ["split:
kfold_split"] or defaults: ["+split=kfold_split"]) so the group-only override
works, or (B) change the decorator in split/kfold_split.py to point directly at
the existing config (e.g., set config_name to the existing file such as
"split/kfold_split" or the exact primary config name you have), ensuring the
config_name used by `@hydra.main` matches an actual YAML file under configs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6e20c773-ecac-4c5b-9c3c-8416bc2173b6

📥 Commits

Reviewing files that changed from the base of the PR and between 506b33e and 64b2000.

📒 Files selected for processing (2)
  • configs/data/dataset.yaml
  • split/kfold_split.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants