feat: tile embeddings pipeline with lazy parquet loading by vojtech-cifka · Pull Request #7 · RationAI/tissue-classification

vojtech-cifka · 2026-04-30T20:40:09Z

Summary

Implement preprocessing/embeddings.py to compute tile embeddings via the RationAI embed service using Ray Data, reading from filter_tiles output
(annotation- and tissue-filtered tiles) rather than raw tiling output
Stream tiles lazily from parquet with override_num_blocks derived from parquet metadata; join slide metadata per row, read pixels via
read_slide_tiles, and dispatch async embed calls through an actor pool
Add inline zero-coverage filtering before embedding and retry logic on embed failure with guaranteed tile data release
Update preprocessing/tile_masks.py to generate masks from filtered tiles for both train and test splits (previously only train, from raw tiling
output)
Add scripts/submit_embeddings.py for launching the job on the cluster (CPU-only, with shm for Ray's object store)
Add configs/preprocessing/embeddings.yaml and configs/experiment/preprocessing/embeddings_virchow2_05mpp.yaml

Summary by CodeRabbit

New Features
- Embeddings preprocessing pipeline with Virchow2 model support, configurable runtime parameters, and automated output/artifact handling
- Command-line job submission script to launch embedding jobs
Chores
- Added MLflow artifact tracking field for dataset tiling runs
- New preprocessing and experiment configuration files
- Updated project dependencies to include PyTorch, torchvision, timm, and einops

Adds preprocessing/embeddings.py with Virchow2 model wrapper and per-slide TileDataset that reads tiles from WSIs via OpenSlide. Tiles and slide metadata are fetched from the tiling MLflow run; embeddings are saved as per-slide parquet files and logged to MLflow. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add base image, HF_TOKEN export, --frozen sync, and PROJECTS storage mount. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Consistent with how all other preprocessing run IDs are stored. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-30T20:40:21Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an embeddings preprocessing pipeline: new configs, a distributed embedding implementation using Ray and an async model client, MLflow artifact wiring, a job submission script, and PyTorch-related dependencies. (50 words)

Changes

Embeddings Preprocessing Pipeline

Layer / File(s)	Summary
Data Shape / Config `configs/data/dataset.yaml`	Adds `dataset.mlflow_artifacts.tiling_run_id` to reference tiling run artifacts.
Config Surface `configs/preprocessing/embeddings.yaml`	New embeddings preprocessing config: `model` (required), `output_dir`, `concurrency`, `block_size`, `rows_per_file`, and `metadata.hyperparams` referencing `model`, `concurrency`, `block_size`.
Experiment Composition `configs/experiment/preprocessing/embeddings_virchow2_05mpp.yaml`	New Hydra experiment config composing `/data: dataset`, declaring package `_global_`, and setting `model: virchow2`.
Core Implementation `preprocessing/embeddings.py`	Adds `EmbedTiles` async callable (initializes `AsyncClient`, embeds `row["tile"]` via `models.embed_image`, replaces `tile` with flattened `embedding`) and a Hydra `main` that for each split (`train`,`test`) downloads MLflow artifacts, reads `slides.parquet`, builds/joins a Ray dataset from `tiles.parquet`, computes per-row `tile` via `read_slide_tiles`, maps `EmbedTiles` with an `ActorPoolStrategy` using `config.concurrency`, writes `<output_dir>/<split>/slides.parquet` and `<output_dir>/<split>/tiles/`, and logs artifacts to MLflow. Adds script entry to initialize Ray and call `main`.
Runtime / Submission `scripts/submit_embeddings.py`	New script that submits an embeddings preprocessing job via `submit_job`, cloning repository, syncing deps with `uv`, and running `preprocessing.embeddings` with experiment override and storage bucket settings.
Dependencies / Packaging `pyproject.toml`	Adds `torch>=2.0.0`, `torchvision>=0.15.0`, `timm>=1.0.0`, `einops>=0.8.0` to project dependencies and updates `rationai-sdk` source URL to the GitHub repository.

Sequence Diagram

sequenceDiagram
    participant Runner as Runner
    participant MLflow as MLflow
    participant Storage as Storage
    participant Ray as Ray
    participant Model as AsyncClient

    Runner->>MLflow: download "<split>_split" artifacts
    Runner->>Storage: read slides.parquet and tiles.parquet
    Runner->>Ray: build dataset from tiles.parquet and join slide metadata
    Runner->>Ray: repartition by block_size and map EmbedTiles (ActorPoolStrategy)
    Ray->>Model: embed_image(tile) [concurrent actors]
    Model-->>Ray: return flattened embedding
    Ray->>Storage: write slides.parquet and tiles/<embeddings>
    Runner->>MLflow: log split directory as artifact

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I nibble code and hop through lines,

tiles turn to vectors in parallel signs.
Ray hums softly, MLflow keeps score,
embeddings sprout— I bound, encore! 🎈

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: tile embeddings pipeline with lazy parquet loading' accurately captures the main change: a new embeddings preprocessing pipeline that uses lazy parquet loading via Ray Data.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/embeddings

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a tile embedding preprocessing pipeline, featuring Hydra configurations for model selection and a Ray-based script for distributed tile processing. It also includes a job submission script and updates the project's deep learning dependencies. Review feedback identifies critical placeholder values in the submission script that must be replaced to avoid runtime errors and suggests relocating Ray initialization into the main function to improve modularity and testability.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

preprocessing/embeddings.py (2)

29-37: 🏗️ Heavy lift

Consider adding error handling for API failures.

The __call__ method has no error handling. Given the scale (~80M tiles per PR notes) and the 200s timeout, transient network failures or service errors are likely. A single unhandled exception will cause the Ray task to fail and potentially retry the entire block.

Consider adding retry logic with exponential backoff for transient errors:

♻️ Suggested retry wrapper

+import asyncio
+from httpx import HTTPStatusError, TimeoutException
+
 class EmbedTiles:
     def __init__(self, model: str, concurrency: int) -> None:
         self.model = model
         self.client = AsyncClient(
             limits=httpx.Limits(
                 max_connections=concurrency, max_keepalive_connections=concurrency
             ),
             timeout=200,
         )
+        self.max_retries = 3

     async def __call__(self, row: dict[str, Any]) -> dict[str, Any]:
-        embedding = (
-            (await self.client.models.embed_image(self.model, row["tile"]))
-            .reshape(-1)
-            .tolist()
-        )
+        for attempt in range(self.max_retries):
+            try:
+                embedding = (
+                    (await self.client.models.embed_image(self.model, row["tile"]))
+                    .reshape(-1)
+                    .tolist()
+                )
+                break
+            except (TimeoutException, HTTPStatusError) as e:
+                if attempt == self.max_retries - 1:
+                    raise
+                await asyncio.sleep(2 ** attempt)
         del row["tile"]
         row["embedding"] = embedding
         return row

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@preprocessing/embeddings.py` around lines 29 - 37, The __call__ method needs
retry/error handling around the call to
self.client.models.embed_image(self.model, row["tile"]) to avoid failing the Ray
task on transient API/network errors; wrap that call in a retry loop with
exponential backoff (e.g., max_attempts, base_delay doubling each retry), catch
transient exceptions (network errors, timeouts, 5xx responses), log each retry
attempt, and re-raise only after max attempts; also only delete row["tile"] and
assign row["embedding"] after a successful embed to avoid losing data if all
retries fail.

64-67: 💤 Low value

Potential KeyError if tile references unknown slide.

The lambda assumes every row["slide_id"] exists in slide_info. If a tile record references a slide not present in slides.parquet, this will raise a KeyError and fail the pipeline.

If data integrity is guaranteed upstream (tiling pipeline), this is fine. Otherwise, consider defensive handling:

lambda row, si: {**row, **si.get(row["slide_id"], {})}

Or validate at the start of processing that all slide IDs in tiles exist in slides.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@preprocessing/embeddings.py` around lines 64 - 67, The map lambda that merges
tile rows with slide info will raise a KeyError if row["slide_id"] is missing
from slide_info; update the lambda used in the map call (the inline lambda that
references slide_info/si and row["slide_id"]) to defensively handle missing keys
(e.g., look up slide_info with a safe-get and merge an empty dict when absent)
or add a validation step before mapping that ensures all tile slide IDs exist in
slide_info and fail early with a clear error; modify the lambda in the map
invocation or add a precheck function to perform this validation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/submit_embeddings.py`:
- Around line 4-18: The submit_job call uses placeholder values that break
execution: replace the Ellipsis passed to username in submit_job(...) with a
real username string or a clear placeholder like "<YOUR_USERNAME>", and update
the script list element that currently contains "+experiment=..." to either a
valid Hydra experiment path (e.g.,
"+experiment=preprocessing/embeddings_virchow2_05mpp") or a clear placeholder
like "+experiment=<EXPERIMENT_NAME>"; ensure these changes are applied to the
submit_job invocation and/or add a brief inline comment next to username and the
+experiment entry to indicate they must be filled before running.

---

Nitpick comments:
In `@preprocessing/embeddings.py`:
- Around line 29-37: The __call__ method needs retry/error handling around the
call to self.client.models.embed_image(self.model, row["tile"]) to avoid failing
the Ray task on transient API/network errors; wrap that call in a retry loop
with exponential backoff (e.g., max_attempts, base_delay doubling each retry),
catch transient exceptions (network errors, timeouts, 5xx responses), log each
retry attempt, and re-raise only after max attempts; also only delete
row["tile"] and assign row["embedding"] after a successful embed to avoid losing
data if all retries fail.
- Around line 64-67: The map lambda that merges tile rows with slide info will
raise a KeyError if row["slide_id"] is missing from slide_info; update the
lambda used in the map call (the inline lambda that references slide_info/si and
row["slide_id"]) to defensively handle missing keys (e.g., look up slide_info
with a safe-get and merge an empty dict when absent) or add a validation step
before mapping that ensures all tile slide IDs exist in slide_info and fail
early with a clear error; modify the lambda in the map invocation or add a
precheck function to perform this validation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec9a215b-78b2-4847-bb29-2e83db6712d0

📥 Commits

Reviewing files that changed from the base of the PR and between b4ed482 and 0bf7ff5.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (7)

configs/data/dataset.yaml
configs/experiment/preprocessing/embeddings_05mpp.yaml
configs/experiment/preprocessing/embeddings_virchow2_05mpp.yaml
configs/preprocessing/embeddings.yaml
preprocessing/embeddings.py
pyproject.toml
scripts/submit_embeddings.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

preprocessing/embeddings.py (1)
86-91: 💤 Low value

Consider making actor pool parameters configurable.

max_size=4 and max_tasks_in_flight_per_actor=8 are hardcoded while config.concurrency (512) controls the HTTP connection pool. This creates a mismatch that could be confusing for tuning. The effective parallelism is: 4 actors × 8 tasks × async concurrency.

Consider exposing max_size and max_tasks_in_flight_per_actor in the config for easier tuning without code changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/embeddings.py` around lines 86 - 91, The ActorPoolStrategy
hardcodes max_size=4 and max_tasks_in_flight_per_actor=8 which mismatches
config.concurrency; update the config object to add two new fields (e.g.,
actor_pool_max_size with default 4 and actor_pool_max_tasks_in_flight_per_actor
with default 8) and replace the literals in the call to
ray.data.ActorPoolStrategy inside embeddings.py (the
compute=ray.data.ActorPoolStrategy(...) invocation) to use
config.actor_pool_max_size and config.actor_pool_max_tasks_in_flight_per_actor
so pool sizing is configurable and consistent with config.concurrency.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@preprocessing/embeddings.py`:
- Around line 19-37: EmbedTiles currently leaks httpx connections and lacks
error handling: add explicit async cleanup and handle embed_image failures.
Implement an async close/shutdown method on the EmbedTiles class that calls
self.client.aclose() (and call it from actor teardown), and add a fallback
__del__ that schedules closing to avoid lingering pools if shutdown isn’t
invoked; reference the AsyncClient instance self.client and the class
EmbedTiles. Wrap the embed_image call inside async __call__ in a try/except (or
a small retry loop) to catch network/service errors, log the exception and
either retry a few times or return a deterministic error marker in the row
(e.g., set row["error"] or row["embedding"]=None) instead of letting the
exception crash the actor; reference the __call__ method and the embed_image
invocation. Ensure you remove row["tile"] only after successful embedding to
avoid losing data on failure.

---

Nitpick comments:
In `@preprocessing/embeddings.py`:
- Around line 86-91: The ActorPoolStrategy hardcodes max_size=4 and
max_tasks_in_flight_per_actor=8 which mismatches config.concurrency; update the
config object to add two new fields (e.g., actor_pool_max_size with default 4
and actor_pool_max_tasks_in_flight_per_actor with default 8) and replace the
literals in the call to ray.data.ActorPoolStrategy inside embeddings.py (the
compute=ray.data.ActorPoolStrategy(...) invocation) to use
config.actor_pool_max_size and config.actor_pool_max_tasks_in_flight_per_actor
so pool sizing is configurable and consistent with config.concurrency.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c5d9d565-abb6-4546-8a58-a94ad917ef50

📥 Commits

Reviewing files that changed from the base of the PR and between 0bf7ff5 and ec40499.

📒 Files selected for processing (1)

preprocessing/embeddings.py

Resolved conflicts in configs/data/dataset.yaml and pyproject.toml by keeping both sets of additions: tissue_masks_run_id from master and torch/torchvision/timm/einops deps from this branch. Regenerated uv.lock.

Wraps the embed_image call in a retry loop (up to 3 attempts with exponential backoff) so transient network errors don't cause Ray to retry the entire block. Re-raises after all attempts are exhausted so failures stay visible. Moves del row["tile"] into a finally block to free tile pixel data promptly even when an exception occurs.

Skip tiles with no annotation coverage and no tissue coverage before feeding them into the Ray pipeline, using PyArrow predicate pushdown to avoid materialising the full 80M-row dataset in memory. Tissue stats run ID stored in dataset.yaml; referenced from the embeddings config via tile_filters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pc.or_ is an array kernel; expression combination uses the | operator. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tering Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… splits Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Ray actor logs and progress bars provide cold-start visibility; the bespoke first_row_logged flag was a tuning aid no longer needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces hand-rolled retry loop in EmbedTiles with a tenacity-decorated helper. Retries are now scoped to httpx.HTTPError (network/timeout/status) so programming bugs surface immediately instead of being retried 3x. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Ray Data progress bars surface rows/sec and in-flight counts, so the periodic counter log and the latency/in_flight bookkeeping it required are redundant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reading three projected columns doesn't need 8 GB; default scheduling lets Ray pack more readers per node. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@Retry

The @Retry decorator at class scope captures an AsyncRetrying instance whose internal threading.local() makes EmbedTiles unpicklable, which breaks Ray Data's actor serialization. Constructing the retryer in __init__ moves that state onto the worker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Without the 8 GB memory reservation on read_parquet, downstream stages stall and no embeddings are produced. Restoring until we understand the scheduling interaction. This reverts commit 287408f. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ors" Tenacity-based retries broke embedding production under high actor concurrency (likely shared AsyncRetrying state and narrowed exception filter dropping retryable non-httpx errors). Restoring the manual loop, which was last known to work. This reverts commit 82163ac. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Restore the last-known-working version of EmbedTiles after several review-driven refactors broke embedding production. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three-column projection doesn't justify 8 GB per task; let Ray's default scheduling pack readers per node. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Build AsyncRetrying in __init__ so the actor stays picklable (threading.local inside tenacity can't cross the wire if captured at class scope). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ors" This reverts commit ea437e5.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

vojtech-cifka and others added 22 commits April 22, 2026 23:17

fix(submit): align embeddings job script with project conventions

53526ea

Add base image, HF_TOKEN export, --frozen sync, and PROJECTS storage mount. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: move tiling_run_id into experiment config

db5c017

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: move tiling_run_id into dataset.mlflow_artifacts

b5ff141

Consistent with how all other preprocessing run IDs are stored. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(submit): update clone URL to GitHub

e50f6fb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor: use configs that support different models

7698c76

refactor: use new approach for generating embeddings

f08ef00

Merge branch 'master' into feature/embeddings

fa16999

refactor: make some update and update dependencies

ba96068

fix: tweak submission script and concurrency

1689d3e

fix: resolve path names

1958069

fix: tiling splits filenames

ea79dd4

fix: point rationai sdk to the github repo

7e32bfb

refactor: optimize data loading

5392751

feat: limit RAM consumption

d5e2c02

feat: dynamically split blocks

b2cb642

feat: raise the amount of block

4adbb56

feat: add prints

85a914c

feat: change the cpu balance

06c818a

fix: change concurrency

014d11f

fix: lower concurrency

f3e4cf4

feat: add timout, raise concurrency

606b8e4

vojtech-cifka requested a review from vejtek April 30, 2026 20:40

vojtech-cifka self-assigned this Apr 30, 2026

vojtech-cifka requested review from a team and ejdam87 April 30, 2026 20:40

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread scripts/submit_embeddings.py

Comment thread scripts/submit_embeddings.py

Comment thread preprocessing/embeddings.py

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread scripts/submit_embeddings.py

coderabbitai Bot reviewed May 3, 2026

View reviewed changes

Comment thread preprocessing/embeddings.py

vojtech-cifka force-pushed the feature/embeddings branch from 0bf7ff5 to 606b8e4 Compare May 4, 2026 10:42

vojtech-cifka and others added 11 commits May 4, 2026 18:08

fix: format

d36dd98

fix: use more memory

531f5f8

Merge branch 'master' into feature/embeddings

0cdcebe

Resolved conflicts in configs/data/dataset.yaml and pyproject.toml by keeping both sets of additions: tissue_masks_run_id from master and torch/torchvision/timm/einops deps from this branch. Regenerated uv.lock.

fix: use | operator for PyArrow expression OR, not pc.or_

60a9971

pc.or_ is an array kernel; expression combination uses the | operator. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: edge cases

76dafae

refactor: use filter_tiles output in embeddings instead of inline fil…

59a6516

…tering Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add temporary filter run id for testing

d42561b

Merge remote-tracking branch 'origin/master' into feature/embeddings

1434be8

feat: generate tile masks from filtered tiles for both train and test…

e109dc5

… splits Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

matejpekar requested changes May 6, 2026

View reviewed changes

vojtech-cifka and others added 16 commits May 6, 2026 22:27

fix: change block size to the power two

1d99fcb

chore: remove first-row debug log from EmbedTiles

ef74b6c

Ray actor logs and progress bars provide cold-start visibility; the bespoke first_row_logged flag was a tuning aid no longer needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: drop per-actor throughput log from EmbedTiles

434aa43

Ray Data progress bars surface rows/sec and in-flight counts, so the periodic counter log and the latency/in_flight bookkeeping it required are redundant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: drop oversized memory reservation on read_parquet

287408f

Reading three projected columns doesn't need 8 GB; default scheduling lets Ray pack more readers per node. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Revert preprocessing/embeddings.py to commit 1d99fcb

ec7a63a

Restore the last-known-working version of EmbedTiles after several review-driven refactors broke embedding production. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: drop oversized memory reservation on read_parquet

7306103

Three-column projection doesn't justify 8 GB per task; let Ray's default scheduling pack readers per node. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: drop first_row_logged debug print

d1a2289

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

refactor: use tenacity for embed retries, narrow to httpx errors

ea437e5

Build AsyncRetrying in __init__ so the actor stays picklable (threading.local inside tenacity can't cross the wire if captured at class scope). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Revert "refactor: use tenacity for embed retries, narrow to httpx err…

3a8bacd

…ors" This reverts commit ea437e5.

chore: drop per-actor throughput log, ray reports it

1c2615f

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chore: switch ray data progress to tqdm for non-tty envs

ad3a7c5

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix: format

7a07f41

vojtech-cifka requested a review from matejpekar May 7, 2026 10:10

Conversation

vojtech-cifka commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vojtech-cifka commented Apr 30, 2026 •

edited

Loading

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading