Skip to content

[MLE-5159] docs(audio-ws): correct response format from WAV to Raw PCM + fix sample voice#252

Merged
rishabh-bhargava merged 2 commits intomainfrom
fix/MLE-5159-ws-audio-format-docs
Apr 27, 2026
Merged

[MLE-5159] docs(audio-ws): correct response format from WAV to Raw PCM + fix sample voice#252
rishabh-bhargava merged 2 commits intomainfrom
fix/MLE-5159-ws-audio-format-docs

Conversation

@rishabh-bhargava
Copy link
Copy Markdown
Contributor

Summary

The WS docs at https://docs.together.ai/reference/audio-speech-websocket are misleading on three independent points; this PR fixes all three.

  1. Audio format claim — the docs advertise Format: WAV (PCM s16le), but the WS streams raw PCM s16le bytes with no RIFF/WAVE header. A developer who saves the bytes with a .wav extension gets a file that no standard player will open (afplay returns Error: AudioFileOpen failed ('typ?')). Updated to Format: Raw PCM (s16le, mono).
  2. Sample save extension — both Python and JavaScript samples wrote to output.wav. Updated to output.pcm (and the print/console messages match).
  3. Sample voice — both samples used voice=tara, which belongs to Orpheus, not Kokoro. Running the docs sample literally returns immediately with Voice 'tara' is not available for model 'hexgrad/Kokoro-82M'. Available voices: af_heart, .... Updated to voice=af_heart. Also added a session.created guard in the Python sample so a future failure-on-first-event doesn't crash the script with KeyError: 'session' before the user can see what went wrong.

Linear: MLE-5159

Test Plan

  • Reproduced the original failure: copy-paste docs Python sample → crashes with KeyError: 'session' because the first server event is tts.failed. No output file produced.
  • After fixing voice + adding the session guard: sample runs end-to-end. Writes 257012 bytes to output.pcm for the three example sentences (≈ 5.35 s of audio at 24 kHz s16le mono).
  • Wrapped via ffmpeg -f s16le -ar 24000 -ac 1 -i output.pcm output.wav → plays cleanly via afplay (exit 0). Confirms the fix matches reality.
  • Empirically confirmed (Cartesia + attempted Kokoro) that the WS adapter never produces RIFF/WAVE-framed bytes — the format claim was wrong for every model on this endpoint, not just Minimax.

Deploy chain

After this PR merges, the existing sync-openapi-spec-to-docs.yml workflow auto-opens a sync PR against togethercomputer/mintlify-docs. Once that secondary PR is approved and merged, Mintlify rebuilds and the changes go live at docs.together.ai.

🤖 Generated with Claude Code

rishabh-bhargava and others added 2 commits April 27, 2026 00:00
… PCM"

The Together WS endpoint streams raw PCM s16le samples with no RIFF/WAVE
header, base64-wrapped per audio_output.delta event. The previous
"WAV (PCM s16le)" claim led developers to write the bytes to a .wav
file and find that no player accepts them (afplay, QuickTime, VLC all
reject the file because there is no WAV magic).

Updates the audio format description and the two code samples
(Python, Node.js) to save to .pcm rather than .wav, matching the
actual on-the-wire format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d event

The voice 'tara' belongs to Orpheus, not Kokoro. Kokoro's default
voice 'af_heart' is the popular choice and exists in the catalog.
Running the sample as written produced an immediate
conversation.item.tts.failed (Voice 'tara' is not available for
model 'hexgrad/Kokoro-82M').

The Python sample compounded that with an unconditional
session_data['session']['id'] access on the first message — when
the first message is tts.failed instead of session.created, that
crashes with KeyError before any code can react. Added a guard so
the sample fails gracefully with the actual error message.

JS sample already gated on message.type === 'session.created' so
no event-handling change is needed there.

Verified end-to-end: with the fixes applied, the sample now writes
257012 bytes (≈ 5.35 s of raw PCM s16le @ 24 kHz mono) to output.pcm.
ffmpeg wraps it cleanly and afplay plays it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 27, 2026

✱ Stainless preview builds

This PR will update the togetherai SDKs with the following commit messages.

go

chore(internal): regenerate SDK with no functional changes

openapi

docs(api): update audio speech websocket format and code samples

python

chore(internal): regenerate SDK with no functional changes

terraform

chore(internal): add together-go SDK dependency, update dependencies

typescript

chore(internal): regenerate SDK with no functional changes
togetherai-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

⚠️ togetherai-python studio · code

Your SDK build had at least one "warning" diagnostic.
generate ⚠️build ⏭️lint ⏭️test ⏭️

⚠️ togetherai-typescript studio · conflict

Your SDK build had at least one warning diagnostic.

⚠️ togetherai-go studio · code

Your SDK build had a failure in the test CI job, which is a regression from the base state.
generate ✅build ⏭️lint ✅test ❗

go get github.com/stainless-sdks/togetherai-go@6d5cbb9adc3b6198a6fb196707b740feb8019c78
togetherai-terraform studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅lint ✅test ✅


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-04-27 16:27:35 UTC

@rishabh-bhargava rishabh-bhargava merged commit 235c9d0 into main Apr 27, 2026
6 checks passed
@rishabh-bhargava rishabh-bhargava deleted the fix/MLE-5159-ws-audio-format-docs branch April 27, 2026 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants