feat(learning): teleop + dataset collection + dataprep pipeline by ruthwikdasyam · Pull Request #2446 · dimensionalOS/dimos

ruthwikdasyam · 2026-06-10T02:02:40Z

Problem

We had no path from a teleop session to a trainable dataset — robot streams were recorded ad hoc and there was no standard, format-agnostic way to segment episodes and export them for policy learning.

Closes DIM-XXX

Solution

Adds an end-to-end data-collection → dataset-prep pipeline:

Collection (dimos/learning/collection/): EpisodeMonitorModule turns Quest buttons/keyboard into start/save/discard episode events; CollectionRecorder captures the obs/action/status streams to
a SQLite session DB. Two ready blueprints: learning-collect-quest-xarm7 and learning-collect-quest-piper.
DataPrep (dimos/learning/dataprep/): reads the session DB, extracts episodes (episode_status or explicit ranges), time-syncs obs/action streams onto a common timeline, and writes LeRobot v2 or
HDF5 datasets with streaming per-feature stats + a dimos_meta.json sidecar. Pure core / impure build split for testability.
CLI: dimos dataprep build and dimos dataprep inspect.

How to Test

# 1. Collect (drive the arm; A=start, B=save, X=discard)
dimos run learning-collect-quest-xarm7

# 2. Build a LeRobot dataset from the session
dimos dataprep build -c dimos/learning/dataprep/example_config.json

# 3. Inspect it
dimos dataprep inspect data/datasets/session

Contributor License Agreement

I have read and approved the CLA.

codecov · 2026-06-10T02:11:49Z

Codecov Report

❌ Patch coverage is 85.82677% with 162 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
dimos/learning/dataprep/build.py	32.30%	43 Missing and 1 partial ⚠️
dimos/learning/dataprep/formats/test_hdf5.py	15.38%	33 Missing ⚠️
dimos/learning/dataprep/core.py	86.66%	18 Missing and 6 partials ⚠️
dimos/learning/collection/episode_monitor.py	79.64%	18 Missing and 5 partials ⚠️
dimos/learning/dataprep/formats/lerobot.py	91.37%	12 Missing and 8 partials ⚠️
dimos/learning/dataprep/formats/_stats.py	91.30%	4 Missing and 2 partials ⚠️
dimos/learning/dataprep/formats/test_lerobot.py	96.33%	4 Missing ⚠️
dimos/robot/cli/dimos.py	60.00%	4 Missing ⚠️
dimos/learning/collection/blueprint.py	88.23%	1 Missing and 1 partial ⚠️
dimos/learning/collection/test_episode_monitor.py	98.78%	0 Missing and 1 partial ⚠️
... and 1 more

@@            Coverage Diff             @@
##             main    #2446      +/-   ##
==========================================
+ Coverage   70.81%   71.03%   +0.22%     
==========================================
  Files         862      874      +12     
  Lines       77475    78617    +1142     
  Branches     6882     7011     +129     
==========================================
+ Hits        54862    55849     +987     
- Misses      20818    20947     +129     
- Partials     1795     1821      +26

Flag	Coverage Δ
OS-ubuntu-24.04-arm	`63.06% <73.03%> (+0.13%)`	⬆️
OS-ubuntu-latest	`66.05% <85.59%> (+0.30%)`	⬆️
Py-3.10	`66.04% <85.59%> (+0.29%)`	⬆️
Py-3.11	`66.04% <85.59%> (+0.29%)`	⬆️
Py-3.12	`66.04% <85.59%> (+0.29%)`	⬆️
Py-3.13	`66.04% <85.59%> (+0.29%)`	⬆️
Py-3.14	`66.06% <85.59%> (+0.30%)`	⬆️
Py-3.14t	`66.04% <85.59%> (+0.29%)`	⬆️
SelfHosted-Large	`30.09% <24.28%> (-0.10%)`	⬇️
SelfHosted-Linux	`37.66% <26.34%> (-0.23%)`	⬇️
SelfHosted-macOS	`36.45% <26.34%> (-0.16%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
dimos/learning/collection/recorder.py	`100.00% <100.00%> (ø)`
dimos/learning/dataprep/formats/test_stats.py	`100.00% <100.00%> (ø)`
dimos/robot/all_blueprints.py	`100.00% <ø> (ø)`
dimos/robot/test_all_blueprints.py	`87.50% <ø> (ø)`
dimos/teleop/quest/quest_types.py	`56.41% <100.00%> (+5.76%)`	⬆️
dimos/learning/collection/test_episode_monitor.py	`98.78% <98.78%> (ø)`
dimos/learning/dataprep/test_core.py	`99.39% <99.39%> (ø)`
dimos/learning/collection/blueprint.py	`88.23% <88.23%> (ø)`
dimos/learning/dataprep/formats/test_lerobot.py	`96.33% <96.33%> (ø)`
dimos/robot/cli/dimos.py	`63.14% <60.00%> (-0.08%)`	⬇️
... and 6 more

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mustafab0 · 2026-06-11T21:48:39Z

+    def _on_buttons(self, msg: Buttons) -> None:
+        """Rising-edge detect against `config.button_map`; advance state machine."""
+        ts = time.time()
+        for event_name, alias_or_attr in self.config.button_map.items():
+            attr = BUTTON_ALIASES.get(alias_or_attr, alias_or_attr)
+            try:
+                pressed = bool(getattr(msg, attr))
+            except AttributeError:
+                continue
+            prev = self._prev_bits.get(attr, False)
+            self._prev_bits[attr] = pressed
+            if pressed and not prev:  # rising edge
+                self._transition(event_name, ts)


There is a concept of pre-roll and post-roll in data collection. Where we send a stream of 0s before activation (rising edge) in this case and also after deactivation.

This helps mark the exact start and stop point of an episode just from the data.

0s are good for Twist commands, for joint position probably need to stream current joint positions

on_buttons method is a callback function, to read button presses.

Irrespective of data we are collecting, this marks start and stop checkpoints in the stream of data - which helps us to trim entire session into episodes.

mustafab0 · 2026-06-11T21:50:10Z

+    episodes_saved: int
+    episodes_discarded: int
+    current_episode_start_ts: float | None
+    last_event: Literal["start", "save", "discard", "init"] = "init"


A little confused here. Why is the EpisodeStatus counting episodes_saved/discarded ?

A single session would have multiple episodes inside it

Yes. once we start a blueprint, we collect multiple episodes.
when we keep collecting episodes.. This msg will be live indication of how many we collected, and how many discarded.

…sk labels

…tly dropping the obs

…oist inner imports

ruthwikdasyam added 18 commits April 27, 2026 14:22

initial commit: dataprep step spec

ab7f7fe

temp: learning spec files

7c9f535

temp spec update

a10dce1

learning pipeline spec

779d100

Merge branch 'dev' into ruthwik/learning/1

20fcb4b

tested commit slop

0cdbe55

Merge branch 'main' into ruthwik/learning/1

c28d90e

feat: xarm7 inference

353d0b4

Merge remote-tracking branch 'origin/main' into ruthwik/datacollection

3382826

remove training and inference codes

fd0c05a

Merge branch 'main' into ruthwik/datacollection

1630d6b

docs: remove readme

95fa239

feat: dataprep folder

475565b

feat: add recorder

38d434c

fix: ore-commit

b39ceaf

fix: episodeextractor default

6a48475

refactor: dimos dataprep subcommand with build and inspect

6fc0cff

fix: pre-commit fixes

45bebf6

mustafab0 reviewed Jun 11, 2026

View reviewed changes

ruthwikdasyam added 8 commits June 15, 2026 16:24

feat: live logs of episode status

6505837

fix: dataprep status_stream default, rgb→bgr, drop button recording

76f6719

fix: pre-commit checks

06b1c8a

feat: dataprep action-shift + collection status log, fixes

feb93c6

fix: db path + cam sim support

8b0da13

session_db file name with datetime

d422708

fix: episode toggle button

d10b955

fix: dataprep float32 + lerobot timestamp/stats fixes

66a31d6

ruthwikdasyam changed the title ~~Ruthwik/datacollection~~ feat(learning): teleop + dataset collection + dataprep pipeline Jun 18, 2026

ruthwikdasyam added 11 commits June 18, 2026 13:07

fix: greptile comments

b4e84ab

fix: dataprep fps sync + episode index/leak/lock fixes, monitor tests

621289a

fix: add blueprints to self hosted list

0b30ef1

fix: redundant transport descriptions

2e61377

feat: questaliases

da17bea

refactor: source-stamp EpisodeStatus.ts, drop redundant start_ts

d006d62

misc: todo for later

1a71102

writer and inspector format validate

8eca873

misc: simplification nearest check

97c9726

misc: comments instructions

28f85ab

fix: None retun for tests

c69d4dc

github-actions Bot added the ready-to-merge Required CI checks have passed on this PR label Jun 19, 2026

feat: lerobot v3.0

a1497dc

github-actions Bot removed the ready-to-merge Required CI checks have passed on this PR label Jun 19, 2026

[autofix.ci] apply automated fixes

6cf9c7e

greptile-apps Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread dimos/learning/dataprep/formats/lerobot.py Outdated

github-actions Bot added the ready-to-merge Required CI checks have passed on this PR label Jun 19, 2026

fix: greptile issues

01c1a57

github-actions Bot added ready-to-merge Required CI checks have passed on this PR and removed ready-to-merge Required CI checks have passed on this PR labels Jun 19, 2026

fix: address greptile review — writer resource guard + per-episode ta…

d1f8916

…sk labels

github-actions Bot added ready-to-merge Required CI checks have passed on this PR and removed ready-to-merge Required CI checks have passed on this PR labels Jun 20, 2026

greptile-apps Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread dimos/learning/dataprep/build.py Outdated

fix(dataprep): reject shared obs/action feature keys instead of silen…

d542767

…tly dropping the obs

github-actions Bot removed the ready-to-merge Required CI checks have passed on this PR label Jun 20, 2026

Merge branch 'main' into ruthwik/datacollection

c9a8c05

github-actions Bot added the ready-to-merge Required CI checks have passed on this PR label Jun 20, 2026

test(learning): drop __new__ shell for mocker-patched construction; h…

132ac71

…oist inner imports

github-actions Bot removed the ready-to-merge Required CI checks have passed on this PR label Jun 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(learning): teleop + dataset collection + dataprep pipeline#2446

feat(learning): teleop + dataset collection + dataprep pipeline#2446
ruthwikdasyam wants to merge 51 commits into
mainfrom
ruthwik/datacollection

ruthwikdasyam commented Jun 10, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

mustafab0 Jun 11, 2026

Uh oh!

ruthwikdasyam Jun 15, 2026

Uh oh!

mustafab0 Jun 11, 2026 •

edited

Loading

Uh oh!

ruthwikdasyam Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ruthwikdasyam commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

How to Test

Contributor License Agreement

Uh oh!

codecov Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mustafab0 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

ruthwikdasyam Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

mustafab0 Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruthwikdasyam Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ruthwikdasyam commented Jun 10, 2026 •

edited

Loading

codecov Bot commented Jun 10, 2026 •

edited

Loading

mustafab0 Jun 11, 2026 •

edited

Loading

ruthwikdasyam Jun 15, 2026 •

edited

Loading