Consecutive
Consecutive Translation WebSocket API (v2)
The Consecutive Translation WebSocket endpoint provides real-time speech-to-speech translation in a single persistent connection. It combines transcription, translation, and speech synthesis into a unified streaming workflow.
Endpoint
WebSocket: wss://api.sanaslt.com/v2/consecutiveAuthentication
Authentication is provided via query parameters. You only need one of the two:
token
JWT Bearer token from Supabase
api_key
API key
Example connection URLs:
wss://api.sanaslt.com/v2/consecutive?token=<jwt_token>
wss://api.sanaslt.com/v2/consecutive?api_key=<api_key>Protocol Overview
Connection Flow
Connect — Establish WebSocket connection with authentication
Configure — Send a
configmessage with language settingsReady — Wait for
readymessage from serverStream — Send audio chunks and receive real-time results
Complete — Send
stopmessage or wait for VAD-triggered completion (confirmation is received as aspeech_stopmessage).
Client Messages (Client → Server)
config
Initial configuration message. Must be sent immediately after connection.
type
string
✅
—
Must be "config"
lang_in
string
✅
—
Source language code (e.g., "en-US", "es-ES", "fr-FR")
lang_out
string
✅
—
Target language code
input_sample_rate
integer
❌
16000
Input audio sample rate. Allowed: 8000, 16000, 24000
output_sample_rate
integer
❌
16000
Output audio sample rate. Allowed: 8000, 16000, 24000
glossary
string[] | null
❌
null
List of terms to preserve during translation
can_lang_swap
boolean
❌
false
Allow automatic language swap based on detected speech. Enable this if you want translation to work either way (both lang_in and lang_out are treated as potential input languages and text is translated to the opposite language)
audio
Audio data chunk. Send continuously after receiving ready.
type
string
✅
Must be "audio"
data
string
✅
Base64-encoded 16-bit signed PCM audio (mono)
Audio Format Requirements:
Encoding: 16-bit signed PCM (little-endian)
Channels: Mono (1 channel)
Sample rate: Must match
input_sample_ratefrom config
stop
Signal to finalize the session and receive remaining results.
Server Messages (Server → Client)
ready
Indicates the server is ready to receive audio.
transcription
Real-time transcription results from the input audio. Complete word will not change and only be sent once. You should store them on the client side for display. Partial words will keep getting updated and resent until complete.
type
string
"transcription"
complete
Word[]
Finalized words (will not change and sent once only)
partial
Word[]
In-progress words (may be updated)
word object:
word
string
The transcribed word (includes whitespace if applicable)
start
float
Start time in seconds (relative to audio start)
end
float
End time in seconds (relative to audio start)
probability
float | null
Confidence score (0.0–1.0)
translation
Translated text from the transcription. Currently sent once transcription is complete, but in the future may also be sent as live updates with a partial field added, similar to transcriptions.
type
string
"translation"
complete
Word[]
Finalized words (will not change and sent once only)
partial
Word[]
In-progress words (may be updated)
audio
Synthesized audio from the translated text.
type
string
"audio"
data
string
Base64-encoded 16-bit signed PCM audio (mono) at output_sample_rate
languages
Language detection notification. Sent when the system detects the spoken language. This is especially useful when can_lang_swap is enable so you know which languages have been detected as the input and output languages.
type
string
"languages"
lang_in
string
Detected/confirmed input language
lang_out
string
Target output language
speech_delimiter
Timing information for synchronization between transcription, translation, and audio.
type
string
"speech_delimiter"
time
float
Timestamp in seconds since the start of the output audio
transcription
Delimiter
Position in transcription stream of the transcription corresponding to the first unspoken translation
translation
Delimiter
Position in translation stream of the first unspoken translation (all translations before that have already been spoken in the output audio at time time)
Delimiter Object:
utterance_idx
integer
Index of the utterance (always 0 for this API, as the consecutive API is single-utterance by nature)
word_idx
integer
Index of the word within the utterance
char_idx
integer
Character index within the word
speech_stop
Indicates the end of an utterance from the server (triggered by Voice Activity Detection, or by manual stop). Tells the client that the server will not accept any more input audio, and is now processing final results and getting ready to send translation and output speech.
type
string
"speech_stop"
utterance_idx
integer
Index of the completed utterance
error
Error notification.
type
string
"error"
message
string
Human-readable error description
code
integer
Error code (optional)
Supported Languages
See Languages.
Error Handling
Connection Errors
1008
Missing authentication
No token or api_key provided
1008
Unauthorized
Invalid or expired credentials
Message Errors
Error messages are sent as JSON with type: "error":
Common errors:
Invalid config message — Config message failed validation
Timeout — Server timed out waiting for audio or response
Best Practices
Audio Chunking — Send audio in small chunks (20 ms) for lower latency
Real-time Pacing — Match send rate to actual audio duration for optimal results
Handle Partial Results — Display partial transcriptions for responsive UX, but only persist complete words
Buffer Output Audio — Accumulate audio chunks slightly before playback to prevent stuttering (usually this is already handled by most client-side playback libraries)
Graceful Shutdown — Send
stopmessage before closing connection for clean finalizationReconnection Logic — Implement exponential backoff for connection failures
Last updated

