Consecutive

Consecutive Translation WebSocket API (v2)

The Consecutive Translation WebSocket endpoint provides real-time speech-to-speech translation in a single persistent connection. It combines transcription, translation, and speech synthesis into a unified streaming workflow.

Endpoint

WebSocket: wss://api.sanaslt.com/v2/consecutive

Authentication

Authentication is provided via query parameters. You only need one of the two:

Parameter
Description

token

JWT Bearer token from Supabase

api_key

API key

Example connection URLs:

wss://api.sanaslt.com/v2/consecutive?token=<jwt_token>
wss://api.sanaslt.com/v2/consecutive?api_key=<api_key>

Protocol Overview

spinner

Connection Flow

  1. Connect — Establish WebSocket connection with authentication

  2. Configure — Send a config message with language settings

  3. Ready — Wait for ready message from server

  4. Stream — Send audio chunks and receive real-time results

  5. Complete — Send stop message or wait for VAD-triggered completion (confirmation is received as a speech_stop message).


Client Messages (Client → Server)

config

Initial configuration message. Must be sent immediately after connection.

Field
Type
Required
Default
Description

type

string

Must be "config"

lang_in

string

Source language code (e.g., "en-US", "es-ES", "fr-FR")

lang_out

string

Target language code

input_sample_rate

integer

16000

Input audio sample rate. Allowed: 8000, 16000, 24000

output_sample_rate

integer

16000

Output audio sample rate. Allowed: 8000, 16000, 24000

glossary

string[] | null

null

List of terms to preserve during translation

can_lang_swap

boolean

false

Allow automatic language swap based on detected speech. Enable this if you want translation to work either way (both lang_in and lang_out are treated as potential input languages and text is translated to the opposite language)

audio

Audio data chunk. Send continuously after receiving ready.

Field
Type
Required
Description

type

string

Must be "audio"

data

string

Base64-encoded 16-bit signed PCM audio (mono)

Audio Format Requirements:

  • Encoding: 16-bit signed PCM (little-endian)

  • Channels: Mono (1 channel)

  • Sample rate: Must match input_sample_rate from config

stop

Signal to finalize the session and receive remaining results.


Server Messages (Server → Client)

ready

Indicates the server is ready to receive audio.

transcription

Real-time transcription results from the input audio. Complete word will not change and only be sent once. You should store them on the client side for display. Partial words will keep getting updated and resent until complete.

Field
Type
Description

type

string

"transcription"

complete

Word[]

Finalized words (will not change and sent once only)

partial

Word[]

In-progress words (may be updated)

word object:

Field
Type
Description

word

string

The transcribed word (includes whitespace if applicable)

start

float

Start time in seconds (relative to audio start)

end

float

End time in seconds (relative to audio start)

probability

float | null

Confidence score (0.0–1.0)

translation

Translated text from the transcription. Currently sent once transcription is complete, but in the future may also be sent as live updates with a partial field added, similar to transcriptions.

Field
Type
Description

type

string

"translation"

complete

Word[]

Finalized words (will not change and sent once only)

partial

Word[]

In-progress words (may be updated)

audio

Synthesized audio from the translated text.

Field
Type
Description

type

string

"audio"

data

string

Base64-encoded 16-bit signed PCM audio (mono) at output_sample_rate

languages

Language detection notification. Sent when the system detects the spoken language. This is especially useful when can_lang_swap is enable so you know which languages have been detected as the input and output languages.

Field
Type
Description

type

string

"languages"

lang_in

string

Detected/confirmed input language

lang_out

string

Target output language

speech_delimiter

Timing information for synchronization between transcription, translation, and audio.

Field
Type
Description

type

string

"speech_delimiter"

time

float

Timestamp in seconds since the start of the output audio

transcription

Delimiter

Position in transcription stream of the transcription corresponding to the first unspoken translation

translation

Delimiter

Position in translation stream of the first unspoken translation (all translations before that have already been spoken in the output audio at time time)

Delimiter Object:

Field
Type
Description

utterance_idx

integer

Index of the utterance (always 0 for this API, as the consecutive API is single-utterance by nature)

word_idx

integer

Index of the word within the utterance

char_idx

integer

Character index within the word

speech_stop

Indicates the end of an utterance from the server (triggered by Voice Activity Detection, or by manual stop). Tells the client that the server will not accept any more input audio, and is now processing final results and getting ready to send translation and output speech.

Field
Type
Description

type

string

"speech_stop"

utterance_idx

integer

Index of the completed utterance

error

Error notification.

Field
Type
Description

type

string

"error"

message

string

Human-readable error description

code

integer

Error code (optional)

Supported Languages

See Languages.

Error Handling

Connection Errors

Close Code
Reason
Description

1008

Missing authentication

No token or api_key provided

1008

Unauthorized

Invalid or expired credentials

Message Errors

Error messages are sent as JSON with type: "error":

Common errors:

  • Invalid config message — Config message failed validation

  • Timeout — Server timed out waiting for audio or response


Best Practices

  1. Audio Chunking — Send audio in small chunks (20 ms) for lower latency

  2. Real-time Pacing — Match send rate to actual audio duration for optimal results

  3. Handle Partial Results — Display partial transcriptions for responsive UX, but only persist complete words

  4. Buffer Output Audio — Accumulate audio chunks slightly before playback to prevent stuttering (usually this is already handled by most client-side playback libraries)

  5. Graceful Shutdown — Send stop message before closing connection for clean finalization

  6. Reconnection Logic — Implement exponential backoff for connection failures

Last updated