Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .changeset/openai-transcription-diarization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
'@tanstack/ai': minor
'@tanstack/ai-client': minor
'@tanstack/ai-openai': minor
---

Add OpenAI transcription diarization support with `diarized_json` output, speaker-labeled segments, diarization model validation, chunking strategy options, and docs.
33 changes: 32 additions & 1 deletion docs/adapters/openai.md
Original file line number Diff line number Diff line change
Expand Up @@ -313,15 +313,46 @@ import { audioFile } from "./audio";
const result = await generateTranscription({
adapter: openaiTranscription("whisper-1"),
audio: audioFile,
responseFormat: "verbose_json",
prompt: "Technical terms: API, SDK",
modelOptions: {
temperature: 0,
timestamp_granularities: ["word", "segment"],
},
});

// Access the transcribed text
console.log(result.text);
```

### Speaker Diarization

Use `gpt-4o-transcribe-diarize` for speaker-labeled transcripts:

```typescript
import { generateTranscription } from "@tanstack/ai";
import { openaiTranscription } from "@tanstack/ai-openai";
import { meetingAudioFile } from "./audio";

const result = await generateTranscription({
adapter: openaiTranscription("gpt-4o-transcribe-diarize"),
audio: meetingAudioFile,
modelOptions: {
known_speaker_names: ["agent", "customer"],
known_speaker_references: [
"data:audio/wav;base64,...",
"data:audio/wav;base64,...",
],
},
});

for (const segment of result.segments ?? []) {
console.log(segment.speaker, segment.start, segment.end, segment.text);
}
```

When no response format is specified, `gpt-4o-transcribe-diarize` requests default to `response_format: "diarized_json"` and `chunking_strategy: "auto"`; passing a top-level `responseFormat` of `"json"` or `"text"` opts out of speaker segments. `known_speaker_names` and `known_speaker_references` must be provided together (up to 4, matching lengths). OpenAI does not support `prompt`, `include`, or `timestamp_granularities` with diarized transcription.

## Environment Variables

Set your API key in environment variables:
Expand Down Expand Up @@ -370,7 +401,7 @@ Creates an OpenAI text-to-speech adapter.

### `openaiTranscription(model, config?)` / `createOpenaiTranscription(model, apiKey, config?)`

Creates an OpenAI transcription adapter (Whisper).
Creates an OpenAI transcription adapter for Whisper, GPT-4o transcription, and GPT-4o diarized transcription models.

### `openaiVideo(model, config?)` / `createOpenaiVideo(model, apiKey, config?)`

Expand Down
2 changes: 1 addition & 1 deletion docs/comparison/vercel-ai-sdk.md
Original file line number Diff line number Diff line change
Expand Up @@ -541,7 +541,7 @@ const result = await generateSpeech({
})
```

**Transcription** - `generateTranscription()` supports 5 output formats (json, text, srt, verbose_json, vtt), word-level timestamps with confidence scores, and four providers (OpenAI, Grok, ElevenLabs, fal.ai), with speaker diarization via OpenAI's `gpt-4o-transcribe-diarize` model.
**Transcription** - `generateTranscription()` supports common output formats (json, text, srt, verbose_json, vtt), word-level timestamps with confidence scores, and four providers (OpenAI, Grok, ElevenLabs, fal.ai), with speaker diarization via OpenAI's `gpt-4o-transcribe-diarize` model.

```ts
import { generateTranscription } from '@tanstack/ai'
Expand Down
5 changes: 3 additions & 2 deletions docs/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,7 @@
"label": "Transcription",
"to": "media/transcription",
"addedAt": "2026-04-15",
"updatedAt": "2026-07-01"
"updatedAt": "2026-07-03"
},
{
"label": "Audio Recording",
Expand Down Expand Up @@ -503,7 +503,8 @@
{
"label": "OpenAI",
"to": "adapters/openai",
"addedAt": "2026-04-15"
"addedAt": "2026-04-15",
"updatedAt": "2026-07-03"
},
{
"label": "Anthropic",
Expand Down
2 changes: 1 addition & 1 deletion docs/media/generation-hooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,7 @@ The `generate` function accepts a `TranscriptionGenerateInput`:
| `audio` | `string \| File \| Blob \| ArrayBuffer` | Audio data -- base64 string, File, Blob, or ArrayBuffer (required) |
| `language` | `string` | Language in ISO-639-1 format (e.g., `"en"`) |
| `prompt` | `string` | Optional prompt to guide the transcription |
| `responseFormat` | `'json' \| 'text' \| 'srt' \| 'verbose_json' \| 'vtt'` | Output format |
| `responseFormat` | `'json' \| 'text' \| 'srt' \| 'verbose_json' \| 'vtt'` | Common output format |
| `modelOptions` | `Record<string, any>` | Model-specific options |

## useSummarize
Expand Down
57 changes: 50 additions & 7 deletions docs/media/transcription.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Transcription
id: transcription
order: 4
description: "Transcribe audio to text with OpenAI Whisper, GPT-4o-transcribe, Groq Whisper, and fal.ai STT models via TanStack AI's generateTranscription() API."
description: "Transcribe audio to text with OpenAI Whisper and GPT-4o transcription models (including speaker diarization), Groq Whisper, and fal.ai STT models via TanStack AI's generateTranscription() API."
keywords:
- tanstack ai
- transcription
Expand All @@ -24,7 +24,7 @@ TanStack AI provides support for audio transcription (speech-to-text) through de
Audio transcription is handled by transcription adapters that follow the same tree-shakeable architecture as other adapters in TanStack AI.

Currently supported:
- **OpenAI**: Whisper-1, GPT-4o-transcribe, GPT-4o-mini-transcribe
- **OpenAI**: Whisper-1, GPT-4o-transcribe, GPT-4o-mini-transcribe, GPT-4o-transcribe-diarize
- **Groq**: whisper-large-v3-turbo, whisper-large-v3
- **fal.ai**: Whisper, Wizper, speech-to-text turbo, ElevenLabs speech-to-text

Expand Down Expand Up @@ -139,6 +139,8 @@ for (const segment of result.segments ?? []) {
|--------|------|-------------|
| `audio` | `File \| string` | Audio data (File object or base64 string) - required |
| `language` | `string` | Language code (e.g., "en", "es", "fr") |
| `prompt` | `string` | Optional prompt to guide transcription style or terms. Not supported with `gpt-4o-transcribe-diarize`. |
| `responseFormat` | `'json' \| 'text' \| 'srt' \| 'verbose_json' \| 'vtt'` | Common output format |

### Supported Languages

Expand Down Expand Up @@ -175,15 +177,20 @@ const result = await generateTranscription({
prompt: 'Technical terms: API, SDK, CLI', // Top-level: guide transcription
modelOptions: {
temperature: 0, // Lower = more deterministic (provider option)
timestamp_granularities: ['word', 'segment'],
},
})
```

| Option | Type | Description |
|--------|------|-------------|
| `temperature` | `number` | Sampling temperature (0 to 1) |
| `timestamp_granularities` | `Array<'word' \| 'segment'>` | Timestamp granularity to populate (requires top-level `responseFormat: 'verbose_json'`) |
| `timestamp_granularities` | `Array<'word' \| 'segment'>` | Timestamp granularity to populate (`whisper-1` only; requires top-level `responseFormat: 'verbose_json'`) |
| `include` | `string[]` | Additional values to include in the response (e.g., `logprobs`) |
| `response_format` | `'json' \| 'text' \| 'srt' \| 'verbose_json' \| 'vtt' \| 'diarized_json'` | Raw OpenAI response format. Use `diarized_json` here for speaker-labeled diarization output. |
| `chunking_strategy` | `'auto' \| { type: 'server_vad', ... } \| null` | Audio chunking strategy (any model; unset transcribes the audio as a single block). Required by OpenAI for `gpt-4o-transcribe-diarize` inputs longer than 30 seconds — the adapter defaults it to `'auto'` for that model |
| `known_speaker_names` | `string[]` | Up to four speaker labels for diarization |
| `known_speaker_references` | `string[]` | 2-10 second data URL audio samples matching `known_speaker_names` |

> `responseFormat` and `prompt` are **top-level** options on `generateTranscription`, not `modelOptions` keys.
Expand All @@ -197,6 +204,36 @@ const result = await generateTranscription({
| `verbose_json` | Detailed JSON with timestamps and segments |
| `vtt` | WebVTT subtitle format |

OpenAI's `gpt-4o-transcribe-diarize` also supports `modelOptions.response_format: 'diarized_json'` for speaker-labeled segments.

### Speaker Diarization

Use `gpt-4o-transcribe-diarize` when you need speaker labels. When no response format is specified, TanStack AI defaults the request to `response_format: 'diarized_json'` and sends `chunking_strategy: 'auto'` unless you provide a chunking strategy yourself. Passing a top-level `responseFormat: 'json'` or `'text'` opts out of speaker segments.

```typescript
import { generateTranscription } from '@tanstack/ai'
import { openaiTranscription } from '@tanstack/ai-openai'
import { meetingAudioFile } from './audio'

const result = await generateTranscription({
adapter: openaiTranscription('gpt-4o-transcribe-diarize'),
audio: meetingAudioFile,
modelOptions: {
known_speaker_names: ['agent', 'customer'],
known_speaker_references: [
'data:audio/wav;base64,...',
'data:audio/wav;base64,...',
],
},
})

for (const segment of result.segments ?? []) {
console.log(segment.speaker, segment.start, segment.end, segment.text)
}
```

OpenAI accepts up to four known speaker references; `known_speaker_names` and `known_speaker_references` must be provided together with matching lengths. The diarization model does not support `prompt`, `include`, or `timestamp_granularities`; the adapter rejects those combinations before making the API request.

## Response Format

The transcription result includes:
Expand Down Expand Up @@ -499,9 +536,14 @@ import { transcribeStreamFn } from '../lib/server-functions'

function AudioTranscriber() {
const { generate, result, isLoading } = useTranscription({
fetcher: (input) => transcribeStreamFn({
data: { ...input, audio: input.audio as string },
}),
fetcher: (input) => {
if (typeof input.audio !== 'string') {
throw new Error('Expected base64 or data URL audio')
}
return transcribeStreamFn({
data: { ...input, audio: input.audio },
})
},
})
// ... same UI as above
}
Expand Down Expand Up @@ -586,5 +628,6 @@ const adapter = createOpenaiTranscription('whisper-1', 'your-openai-api-key')

5. **Prompting**: Use the `prompt` option to provide context or expected vocabulary (e.g., technical terms, names).

6. **Timestamps**: Request `verbose_json` format and enable `timestamp_granularities: ['word', 'segment']` when you need timing information for captions or synchronization.
6. **Timestamps**: Request `responseFormat: 'verbose_json'` and set `modelOptions.timestamp_granularities` when you need timing information for captions or synchronization.

7. **Diarization**: Use `gpt-4o-transcribe-diarize` with `modelOptions.response_format: 'diarized_json'` output for multi-speaker audio. Keep `chunking_strategy: 'auto'` unless you need custom VAD tuning.
26 changes: 25 additions & 1 deletion examples/ts-react-chat/src/lib/audio-providers.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
* and audio generation flows.
*/

import type { TranscriptionGenerateInput } from '@tanstack/ai-client'

export type SpeechProviderId =
| 'openai'
| 'gemini'
Expand Down Expand Up @@ -87,13 +89,22 @@ export const SPEECH_PROVIDERS: ReadonlyArray<SpeechProviderConfig> = [
},
]

export type TranscriptionProviderId = 'openai' | 'fal' | 'grok' | 'elevenlabs'
export type TranscriptionProviderId =
| 'openai'
| 'openai-diarize'
| 'fal'
| 'grok'
| 'elevenlabs'

export interface TranscriptionProviderConfig {
id: TranscriptionProviderId
label: string
model: string
description: string
transcriptionOptions?: Pick<
TranscriptionGenerateInput,
'responseFormat' | 'modelOptions'
>
}

export const TRANSCRIPTION_PROVIDERS: ReadonlyArray<TranscriptionProviderConfig> =
Expand All @@ -104,6 +115,19 @@ export const TRANSCRIPTION_PROVIDERS: ReadonlyArray<TranscriptionProviderConfig>
model: 'whisper-1',
description: 'OpenAI Whisper transcription with optional streaming.',
},
{
id: 'openai-diarize',
label: 'OpenAI Diarize',
model: 'gpt-4o-transcribe-diarize',
description:
'OpenAI diarized transcription with speaker-labeled segments.',
transcriptionOptions: {
modelOptions: {
response_format: 'diarized_json',
chunking_strategy: 'auto',
},
},
},
{
id: 'fal',
label: 'Fal Whisper',
Expand Down
2 changes: 2 additions & 0 deletions examples/ts-react-chat/src/lib/server-audio-adapters.ts
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ export function buildTranscriptionAdapter(
switch (config.id) {
case 'openai':
return openaiTranscription(config.model as 'whisper-1')
case 'openai-diarize':
return openaiTranscription(config.model as 'gpt-4o-transcribe-diarize')
case 'fal':
return falTranscription(config.model)
case 'grok':
Expand Down
14 changes: 13 additions & 1 deletion examples/ts-react-chat/src/lib/server-fns.ts
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,11 @@ const SPEECH_PROVIDER_SCHEMA = z
.optional()

const TRANSCRIPTION_PROVIDER_SCHEMA = z
.enum(['openai', 'fal', 'grok', 'elevenlabs'])
.enum(['openai', 'openai-diarize', 'fal', 'grok', 'elevenlabs'])
.optional()

const TRANSCRIPTION_RESPONSE_FORMAT_SCHEMA = z
.enum(['json', 'text', 'srt', 'verbose_json', 'vtt'])
.optional()
Comment thread
coderabbitai[bot] marked this conversation as resolved.

const AUDIO_PROVIDER_SCHEMA = z
Expand Down Expand Up @@ -144,6 +148,8 @@ export const transcribeFn = createServerFn({ method: 'POST' })
z.object({
audio: z.string(),
language: z.string().optional(),
responseFormat: TRANSCRIPTION_RESPONSE_FORMAT_SCHEMA,
modelOptions: z.record(z.string(), z.any()).optional(),
provider: TRANSCRIPTION_PROVIDER_SCHEMA,
}),
)
Expand All @@ -162,6 +168,8 @@ export const transcribeFn = createServerFn({ method: 'POST' })
adapter,
audio: data.audio,
language: data.language,
responseFormat: data.responseFormat,
modelOptions: data.modelOptions,
})
})

Expand Down Expand Up @@ -316,6 +324,8 @@ export const transcribeStreamFn = createServerFn({ method: 'POST' })
z.object({
audio: z.string(),
language: z.string().optional(),
responseFormat: TRANSCRIPTION_RESPONSE_FORMAT_SCHEMA,
modelOptions: z.record(z.string(), z.any()).optional(),
provider: TRANSCRIPTION_PROVIDER_SCHEMA,
}),
)
Expand All @@ -335,6 +345,8 @@ export const transcribeStreamFn = createServerFn({ method: 'POST' })
adapter,
audio: data.audio,
language: data.language,
responseFormat: data.responseFormat,
modelOptions: data.modelOptions,
stream: true,
}),
)
Expand Down
13 changes: 11 additions & 2 deletions examples/ts-react-chat/src/routes/api.transcribe.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,18 @@ import {
} from '../lib/server-audio-adapters'

const TRANSCRIPTION_PROVIDER_SCHEMA = z
.enum(['openai', 'fal', 'grok', 'elevenlabs'])
.enum(['openai', 'openai-diarize', 'fal', 'grok', 'elevenlabs'])
.optional()

const TRANSCRIPTION_RESPONSE_FORMAT_SCHEMA = z
.enum(['json', 'text', 'srt', 'verbose_json', 'vtt'])
.optional()

const TRANSCRIBE_BODY_SCHEMA = z.object({
audio: z.string().min(1),
language: z.string().optional(),
responseFormat: TRANSCRIPTION_RESPONSE_FORMAT_SCHEMA,
modelOptions: z.record(z.string(), z.any()).optional(),
provider: TRANSCRIPTION_PROVIDER_SCHEMA,
})

Expand Down Expand Up @@ -55,7 +61,8 @@ export const Route = createFileRoute('/api/transcribe')({
})
}

const { audio, language, provider } = parsed.data
const { audio, language, responseFormat, modelOptions, provider } =
parsed.data

try {
const adapter = buildTranscriptionAdapter(provider ?? 'openai')
Expand All @@ -64,6 +71,8 @@ export const Route = createFileRoute('/api/transcribe')({
adapter,
audio,
language,
responseFormat,
modelOptions,
stream: true,
})

Expand Down
Loading
Loading