Skip to content

feat(learning): teleop + dataset collection + dataprep pipeline#2446

Open
ruthwikdasyam wants to merge 51 commits into
mainfrom
ruthwik/datacollection
Open

feat(learning): teleop + dataset collection + dataprep pipeline#2446
ruthwikdasyam wants to merge 51 commits into
mainfrom
ruthwik/datacollection

Conversation

@ruthwikdasyam

@ruthwikdasyam ruthwikdasyam commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Problem

We had no path from a teleop session to a trainable dataset — robot streams were recorded ad hoc and there was no standard, format-agnostic way to segment episodes and export them for policy learning.

Closes DIM-XXX

Solution

Adds an end-to-end data-collection → dataset-prep pipeline:

  • Collection (dimos/learning/collection/): EpisodeMonitorModule turns Quest buttons/keyboard into start/save/discard episode events; CollectionRecorder captures the obs/action/status streams to
    a SQLite session DB. Two ready blueprints: learning-collect-quest-xarm7 and learning-collect-quest-piper.
  • DataPrep (dimos/learning/dataprep/): reads the session DB, extracts episodes (episode_status or explicit ranges), time-syncs obs/action streams onto a common timeline, and writes LeRobot v2 or
    HDF5 datasets with streaming per-feature stats + a dimos_meta.json sidecar. Pure core / impure build split for testability.
  • CLI: dimos dataprep build and dimos dataprep inspect.

How to Test

# 1. Collect (drive the arm; A=start, B=save, X=discard)
dimos run learning-collect-quest-xarm7

# 2. Build a LeRobot dataset from the session
dimos dataprep build -c dimos/learning/dataprep/example_config.json

# 3. Inspect it
dimos dataprep inspect data/datasets/session

Contributor License Agreement

  • I have read and approved the CLA.

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.82677% with 162 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
dimos/learning/dataprep/build.py 32.30% 43 Missing and 1 partial ⚠️
dimos/learning/dataprep/formats/test_hdf5.py 15.38% 33 Missing ⚠️
dimos/learning/dataprep/core.py 86.66% 18 Missing and 6 partials ⚠️
dimos/learning/collection/episode_monitor.py 79.64% 18 Missing and 5 partials ⚠️
dimos/learning/dataprep/formats/lerobot.py 91.37% 12 Missing and 8 partials ⚠️
dimos/learning/dataprep/formats/_stats.py 91.30% 4 Missing and 2 partials ⚠️
dimos/learning/dataprep/formats/test_lerobot.py 96.33% 4 Missing ⚠️
dimos/robot/cli/dimos.py 60.00% 4 Missing ⚠️
dimos/learning/collection/blueprint.py 88.23% 1 Missing and 1 partial ⚠️
dimos/learning/collection/test_episode_monitor.py 98.78% 0 Missing and 1 partial ⚠️
... and 1 more
@@            Coverage Diff             @@
##             main    #2446      +/-   ##
==========================================
+ Coverage   70.81%   71.03%   +0.22%     
==========================================
  Files         862      874      +12     
  Lines       77475    78617    +1142     
  Branches     6882     7011     +129     
==========================================
+ Hits        54862    55849     +987     
- Misses      20818    20947     +129     
- Partials     1795     1821      +26     
Flag Coverage Δ
OS-ubuntu-24.04-arm 63.06% <73.03%> (+0.13%) ⬆️
OS-ubuntu-latest 66.05% <85.59%> (+0.30%) ⬆️
Py-3.10 66.04% <85.59%> (+0.29%) ⬆️
Py-3.11 66.04% <85.59%> (+0.29%) ⬆️
Py-3.12 66.04% <85.59%> (+0.29%) ⬆️
Py-3.13 66.04% <85.59%> (+0.29%) ⬆️
Py-3.14 66.06% <85.59%> (+0.30%) ⬆️
Py-3.14t 66.04% <85.59%> (+0.29%) ⬆️
SelfHosted-Large 30.09% <24.28%> (-0.10%) ⬇️
SelfHosted-Linux 37.66% <26.34%> (-0.23%) ⬇️
SelfHosted-macOS 36.45% <26.34%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
dimos/learning/collection/recorder.py 100.00% <100.00%> (ø)
dimos/learning/dataprep/formats/test_stats.py 100.00% <100.00%> (ø)
dimos/robot/all_blueprints.py 100.00% <ø> (ø)
dimos/robot/test_all_blueprints.py 87.50% <ø> (ø)
dimos/teleop/quest/quest_types.py 56.41% <100.00%> (+5.76%) ⬆️
dimos/learning/collection/test_episode_monitor.py 98.78% <98.78%> (ø)
dimos/learning/dataprep/test_core.py 99.39% <99.39%> (ø)
dimos/learning/collection/blueprint.py 88.23% <88.23%> (ø)
dimos/learning/dataprep/formats/test_lerobot.py 96.33% <96.33%> (ø)
dimos/robot/cli/dimos.py 63.14% <60.00%> (-0.08%) ⬇️
... and 6 more

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines +132 to +144
def _on_buttons(self, msg: Buttons) -> None:
"""Rising-edge detect against `config.button_map`; advance state machine."""
ts = time.time()
for event_name, alias_or_attr in self.config.button_map.items():
attr = BUTTON_ALIASES.get(alias_or_attr, alias_or_attr)
try:
pressed = bool(getattr(msg, attr))
except AttributeError:
continue
prev = self._prev_bits.get(attr, False)
self._prev_bits[attr] = pressed
if pressed and not prev: # rising edge
self._transition(event_name, ts)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a concept of pre-roll and post-roll in data collection. Where we send a stream of 0s before activation (rising edge) in this case and also after deactivation.

This helps mark the exact start and stop point of an episode just from the data.

0s are good for Twist commands, for joint position probably need to stream current joint positions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on_buttons method is a callback function, to read button presses.

Irrespective of data we are collecting, this marks start and stop checkpoints in the stream of data - which helps us to trim entire session into episodes.

Comment on lines +55 to +58
episodes_saved: int
episodes_discarded: int
current_episode_start_ts: float | None
last_event: Literal["start", "save", "discard", "init"] = "init"

@mustafab0 mustafab0 Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little confused here. Why is the EpisodeStatus counting episodes_saved/discarded ?

A single session would have multiple episodes inside it

@ruthwikdasyam ruthwikdasyam Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. once we start a blueprint, we collect multiple episodes.
when we keep collecting episodes.. This msg will be live indication of how many we collected, and how many discarded.

@ruthwikdasyam ruthwikdasyam changed the title Ruthwik/datacollection feat(learning): teleop + dataset collection + dataprep pipeline Jun 18, 2026
@github-actions github-actions Bot added the ready-to-merge Required CI checks have passed on this PR label Jun 19, 2026
@github-actions github-actions Bot removed the ready-to-merge Required CI checks have passed on this PR label Jun 19, 2026
Comment thread dimos/learning/dataprep/formats/lerobot.py Outdated
@github-actions github-actions Bot added the ready-to-merge Required CI checks have passed on this PR label Jun 19, 2026
@github-actions github-actions Bot added ready-to-merge Required CI checks have passed on this PR and removed ready-to-merge Required CI checks have passed on this PR labels Jun 19, 2026
@github-actions github-actions Bot added ready-to-merge Required CI checks have passed on this PR and removed ready-to-merge Required CI checks have passed on this PR labels Jun 20, 2026
Comment thread dimos/learning/dataprep/build.py Outdated
@github-actions github-actions Bot removed the ready-to-merge Required CI checks have passed on this PR label Jun 20, 2026
@github-actions github-actions Bot added the ready-to-merge Required CI checks have passed on this PR label Jun 20, 2026
@github-actions github-actions Bot removed the ready-to-merge Required CI checks have passed on this PR label Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants