feat(audio): add AudioModule for issue #1932#2507
Conversation
Adds mic audio capture and chunked publishing as AudioStamped on an Out stream, mirroring CameraModule. Validated on macOS Apple Silicon at 50 Hz / 20 ms frames with both synthetic (sine tone) and real mic sources. - dimos/msgs/audio_msgs/AudioStamped.py: Python overlay wrapping foxglove_msgs.RawAudio for LCM encode/decode, with from_pcm() and to_numpy() helpers. Flags that builtin_interfaces.Time (not std_msgs.Header) is the wire type, so frame_id is not preserved. - dimos/hardware/sensors/audio/module.py: AudioModule(Module) with AudioConfig(ModuleConfig), async def main() lifecycle, @rpc start/stop, @Skill record_clip. - examples/audio/validate_audio_module.py: LCM round-trip assert + live stream rate/timestamp validation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR adds a full audio pipeline to dimos, centered on
Confidence Score: 5/5Safe to merge with the caveat that all new audio modules share some class-level mutable defaults that would corrupt multi-instance deployments. All new findings are non-blocking quality issues. The two most notable are: dimos/hardware/sensors/audio/module.py deserves a second pass on the class-level attribute declarations for SpeakerModule and FunVoiceEffectsModule. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant MIC as Microphone / Synthetic
participant AM as AudioModule
participant STT as SpeechToTextModule
participant TTS as TextToSpeechModule
participant FVE as FunVoiceEffectsModule
participant SPK as SpeakerModule
MIC->>AM: PCM frames (PortAudio callback / asyncio loop)
AM->>AM: "wrap → AudioStamped (ts=time.time())"
AM-->>STT: mic_audio Out stream
STT->>STT: VAD + AEC filter
STT->>STT: whisper transcription
STT-->>TTS: speech_text Out stream
TTS->>TTS: pyttsx3 / macos-say / OpenAI TTS
TTS-->>FVE: tts_audio_raw Out stream
TTS-->>STT: tts_reference_audio (AEC ref)
FVE->>FVE: pitch / ringmod / bitcrush / echo
FVE-->>SPK: tts_audio Out stream
SPK->>SPK: sd.OutputStream.write()
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant MIC as Microphone / Synthetic
participant AM as AudioModule
participant STT as SpeechToTextModule
participant TTS as TextToSpeechModule
participant FVE as FunVoiceEffectsModule
participant SPK as SpeakerModule
MIC->>AM: PCM frames (PortAudio callback / asyncio loop)
AM->>AM: "wrap → AudioStamped (ts=time.time())"
AM-->>STT: mic_audio Out stream
STT->>STT: VAD + AEC filter
STT->>STT: whisper transcription
STT-->>TTS: speech_text Out stream
TTS->>TTS: pyttsx3 / macos-say / OpenAI TTS
TTS-->>FVE: tts_audio_raw Out stream
TTS-->>STT: tts_reference_audio (AEC ref)
FVE->>FVE: pitch / ringmod / bitcrush / echo
FVE-->>SPK: tts_audio Out stream
SPK->>SPK: sd.OutputStream.write()
Reviews (8): Last reviewed commit: "fix: use wall-clock audio timestamps and..." | Re-trigger Greptile |
| @skill | ||
| def record_clip(self, seconds: float = 1.0) -> bytes: | ||
| """Record and return a clip of raw PCM audio. | ||
|
|
||
| Collects frames from the live audio stream for `seconds` seconds and | ||
| returns them concatenated as raw S16LE PCM bytes. | ||
| """ | ||
| import threading | ||
|
|
||
| buf: list[bytes] = [] | ||
| done = threading.Event() | ||
| collected = [0.0] | ||
|
|
||
| def on_frame(msg: AudioStamped) -> None: | ||
| buf.append(msg.data) | ||
| collected[0] += self.config.frame_ms / 1000.0 | ||
| if collected[0] >= seconds: | ||
| done.set() | ||
|
|
||
| unsub = self.audio.subscribe(on_frame) | ||
| done.wait(timeout=seconds + 2.0) | ||
| unsub() | ||
| return b"".join(buf) |
There was a problem hiding this comment.
record_clip silently returns empty bytes if the module is not running
If record_clip is called before start() or after stop(), no frames will ever arrive, done.wait will time out after seconds + 2.0 seconds, and the method returns b"" with no error or log message. Callers have no way to distinguish a successful empty recording from a misconfigured call. At minimum, a log warning on timeout (or a raised exception) would surface the problem.
| def __repr__(self) -> str: | ||
| n_samples = len(self.data) // (2 if "16" in self.sample_format else 4) | ||
| return ( | ||
| f"AudioStamped(rate={self.sample_rate}, ch={self.channels}, " | ||
| f"fmt={self.sample_format}, samples={n_samples}, ts={self.ts:.6f})" | ||
| ) |
There was a problem hiding this comment.
The
n_samples heuristic does not divide by self.channels, so for multi-channel audio the repr reports total interleaved samples (e.g. 320 for 20 ms of stereo 16 kHz) rather than samples per channel (160). The existing byte-width check ("16" in self.sample_format) also silently falls back to 4 bytes/sample for any unknown format string, which could produce a nonsensical count.
| def __repr__(self) -> str: | |
| n_samples = len(self.data) // (2 if "16" in self.sample_format else 4) | |
| return ( | |
| f"AudioStamped(rate={self.sample_rate}, ch={self.channels}, " | |
| f"fmt={self.sample_format}, samples={n_samples}, ts={self.ts:.6f})" | |
| ) | |
| def __repr__(self) -> str: | |
| bytes_per_sample = 2 if "16" in self.sample_format else 4 | |
| n_frames = len(self.data) // (bytes_per_sample * self.channels) | |
| return ( | |
| f"AudioStamped(rate={self.sample_rate}, ch={self.channels}, " | |
| f"fmt={self.sample_format}, frames={n_frames}, ts={self.ts:.6f})" | |
| ) |
| audio metadata. Serialises to/from foxglove_msgs.RawAudio on the wire. | ||
| """ | ||
|
|
||
| msg_name = "foxglove_msgs.RawAudio" # wire type used for LCM |
There was a problem hiding this comment.
we don't use foxglove, where does this come from?
There was a problem hiding this comment.
Foxglove is not a new dependency; it's already mirrored into dimos_lcm, and RawAudio is the only audio type in there, so I reused it. Left a note that it's a stand-in pending a native Header-bearing type.
| def lcm_encode(self) -> bytes: | ||
| """Encode to foxglove_msgs.RawAudio wire bytes. | ||
|
|
||
| NOTE: frame_id and seq from self.header are NOT preserved (the wire |
There was a problem hiding this comment.
ros2 header has no seq
why not preserve frame_id?
There was a problem hiding this comment.
My comment is wrong; I will fix this issue. Frame_id does exist, but based on the RawAudio format, it only carries a timestamp; there is no frame_id on the wire type to put in. Preserving a frame_id means adding a header-bearing audio type to dimos-lcm, and we can discuss it today
- Remove all mentions of `seq` (ROS2 std_msgs/Header has no seq field) - Reword frame_id note: dropped because RawAudio has no frame_id field on the wire, not by design choice - Mark foxglove_msgs.RawAudio as a temporary stand-in pending team decision on a native Header-bearing LCM type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| ts: float | None = None, | ||
| ) -> AudioStamped: | ||
| """Construct from raw PCM bytes.""" | ||
| t = ts if ts is not None else time.monotonic() |
There was a problem hiding this comment.
The
from_pcm factory's fallback timestamp uses time.monotonic(), which returns an opaque system-relative counter (seconds since boot) rather than a Unix wall-clock time. Any caller that omits the ts argument — including external consumers of this public API — will create an AudioStamped whose ts field is near 0–86400 rather than near the Unix epoch (~1.7 × 10⁹). This makes Timestamped.dt() return a date in 1970 and breaks cross-stream alignment with any module that uses time.time().
| t = ts if ts is not None else time.monotonic() | |
| t = ts if ts is not None else time.time() |
Add demo_audio blueprint to module.py and regenerate all_blueprints.py so AudioModule is accessible via: dimos run demo-audio (blueprint) dimos run audio-module (standalone module) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| (TextToSpeechModule, "text", "speech_text"), | ||
| (TextToSpeechModule, "audio", "tts_audio"), | ||
| (SpeakerModule, "audio", "tts_audio"), | ||
| ] | ||
| ) | ||
|
|
||
|
|
||
| demo_audio = autoconnect( | ||
| AudioModule.blueprint(), |
There was a problem hiding this comment.
TextToSpeechModule publishes frames with time.monotonic() timestamps
Every chunk published from _worker_loop uses time.monotonic() as its timestamp. time.monotonic() returns a system-relative counter (seconds since boot), not a Unix wall-clock time. Downstream consumers calling Timestamped.dt() will get dates in 1970, and cross-stream alignment with any module that uses time.time() (e.g., CameraModule) will fail. Replace with time.time() to match the rest of the stack.
Adds mic audio capture and chunked publishing as AudioStamped on an Out stream, mirroring CameraModule. Validated on macOS Apple Silicon at 50 Hz / 20 ms frames with both synthetic (sine tone) and real mic sources.
Problem
Closes DIM-XXX
Solution
How to Test
Contributor License Agreement