Consecutive

Consecutive Translation WebSocket API (v2)

The Consecutive Translation WebSocket endpoint provides real-time speech-to-speech translation in a single persistent connection. It combines transcription, translation, and speech synthesis into a unified streaming workflow.

Endpoint

WebSocket: wss://api.sanaslt.com/v2/consecutive

Authentication

Authentication is provided via query parameters. You only need one of the two:

Parameter

Description

token

JWT Bearer token from Supabase

api_key

API key

Example connection URLs:

wss://api.sanaslt.com/v2/consecutive?token=<jwt_token>
wss://api.sanaslt.com/v2/consecutive?api_key=<api_key>

Protocol Overview

Connection Flow

Connect — Establish WebSocket connection with authentication
Configure — Send a config message with language settings
Ready — Wait for ready message from server
Stream — Send audio chunks and receive real-time results
Complete — Send stop message or wait for VAD-triggered completion (confirmation is received as a speech_stop message).

Client Messages (Client → Server)

config

Initial configuration message. Must be sent immediately after connection.

{
  "type": "config",
  "lang_in": "en-US",
  "lang_out": "es-ES",
  "input_sample_rate": 16000,
  "output_sample_rate": 24000,
  "glossary": null,
  "can_lang_swap": false
}

Field

Type

Required

Default

Description

type

string

✅

—

Must be "config"

lang_in

string

✅

—

Source language code (e.g., "en-US", "es-ES", "fr-FR")

lang_out

string

✅

—

Target language code

input_sample_rate

integer

❌

16000

Input audio sample rate. Allowed: 8000, 16000, 24000

output_sample_rate

integer

❌

16000

Output audio sample rate. Allowed: 8000, 16000, 24000

glossary

string[] | null

❌

null

List of terms to preserve during translation

can_lang_swap

boolean

❌

false

Allow automatic language swap based on detected speech. Enable this if you want translation to work either way (both lang_in and lang_out are treated as potential input languages and text is translated to the opposite language)

audio

Audio data chunk. Send continuously after receiving ready.

{
  "type": "audio",
  "data": "<base64_encoded_pcm_data>"
}

Field

Type

Required

Description

type

string

✅

Must be "audio"

data

string

✅

Base64-encoded 16-bit signed PCM audio (mono)

Audio Format Requirements:

Encoding: 16-bit signed PCM (little-endian)
Channels: Mono (1 channel)
Sample rate: Must match input_sample_rate from config

stop

Signal to finalize the session and receive remaining results.

{
  "type": "stop"
}

Server Messages (Server → Client)

ready

Indicates the server is ready to receive audio.

{
  "type": "ready"
}

transcription

Real-time transcription results from the input audio. Complete word will not change and only be sent once. You should store them on the client side for display. Partial words will keep getting updated and resent until complete.

{
  "type": "transcription",
  "complete": [
    {
      "word": "Hello ",
      "start": 0.0,
      "end": 0.5,
      "probability": 0.98
    }
  ],
  "partial": [
    {
      "word": "world",
      "start": 0.5,
      "end": 0.8,
      "probability": 0.85
    }
  ]
}

Field

Type

Description

type

string

"transcription"

complete

Word[]

Finalized words (will not change and sent once only)

partial

Word[]

In-progress words (may be updated)

word object:

Field

Type

Description

word

string

The transcribed word (includes whitespace if applicable)

start

float

Start time in seconds (relative to audio start)

end

float

End time in seconds (relative to audio start)

probability

float | null

Confidence score (0.0–1.0)

translation

Translated text from the transcription. Currently sent once transcription is complete, but in the future may also be sent as live updates with a partial field added, similar to transcriptions.

{
  "type": "translation",
  "complete": [
    {
      "word": "Hola ",
      "start": 0.0,
      "end": 0.5,
      "probability": 0.95
    }
  ],
  "partial": [
    {
      "word": "mundo",
      "start": 0.5,
      "end": 0.8,
      "probability": 0.85
    }
  ]
}

Field

Type

Description

type

string

"translation"

complete

Word[]

Finalized words (will not change and sent once only)

partial

Word[]

In-progress words (may be updated)

audio

Synthesized audio from the translated text.

{
  "type": "audio",
  "data": "<base64_encoded_pcm_data>"
}

Field

Type

Description

type

string

"audio"

data

string

Base64-encoded 16-bit signed PCM audio (mono) at output_sample_rate

languages

Language detection notification. Sent when the system detects the spoken language. This is especially useful when can_lang_swap is enable so you know which languages have been detected as the input and output languages.

{
  "type": "languages",
  "lang_in": "en-US",
  "lang_out": "es-ES"
}

Field

Type

Description

type

string

"languages"

lang_in

string

Detected/confirmed input language

lang_out

string

Target output language

speech_delimiter

Timing information for synchronization between transcription, translation, and audio.

{
  "type": "speech_delimiter",
  "time": 1.5,
  "transcription": {
    "utterance_idx": 0,
    "word_idx": 3,
    "char_idx": 15
  },
  "translation": {
    "utterance_idx": 0,
    "word_idx": 2,
    "char_idx": 12
  }
}

Field

Type

Description

type

string

"speech_delimiter"

time

float

Timestamp in seconds since the start of the output audio

transcription

Delimiter

Position in transcription stream of the transcription corresponding to the first unspoken translation

translation

Delimiter

Position in translation stream of the first unspoken translation (all translations before that have already been spoken in the output audio at time time)

Delimiter Object:

Field

Type

Description

utterance_idx

integer

Index of the utterance (always 0 for this API, as the consecutive API is single-utterance by nature)

word_idx

integer

Index of the word within the utterance

char_idx

integer

Character index within the word

speech_stop

Indicates the end of an utterance from the server (triggered by Voice Activity Detection, or by manual stop). Tells the client that the server will not accept any more input audio, and is now processing final results and getting ready to send translation and output speech.

{
  "type": "speech_stop",
  "utterance_idx": 0
}

Field

Type

Description

type

string

"speech_stop"

utterance_idx

integer

Index of the completed utterance

error

Error notification.

{
  "type": "error",
  "message": "Invalid config message",
  "code": 1000
}

Field

Type

Description

type

string

"error"

message

string

Human-readable error description

code

integer

Error code (optional)

Supported Languages

See Languages.

Error Handling

Connection Errors

Close Code

Reason

Description

1008

Missing authentication

No token or api_key provided

1008

Unauthorized

Invalid or expired credentials

Message Errors

Error messages are sent as JSON with type: "error":

{
  "type": "error",
  "message": "Invalid config message"
}

Common errors:

Invalid config message — Config message failed validation
Timeout — Server timed out waiting for audio or response

Best Practices

Audio Chunking — Send audio in small chunks (20 ms) for lower latency
Real-time Pacing — Match send rate to actual audio duration for optimal results
Handle Partial Results — Display partial transcriptions for responsive UX, but only persist complete words
Buffer Output Audio — Accumulate audio chunks slightly before playback to prevent stuttering (usually this is already handled by most client-side playback libraries)
Graceful Shutdown — Send stop message before closing connection for clean finalization
Reconnection Logic — Implement exponential backoff for connection failures

PreviousModels

Last updated 0 minutes ago

Good night

hashtagConsecutive Translation WebSocket API (v2)

hashtagEndpoint

hashtagAuthentication

hashtagProtocol Overview

hashtagConnection Flow

hashtagClient Messages (Client → Server)

hashtagconfig

hashtagaudio

hashtagstop

hashtagServer Messages (Server → Client)

hashtagready

hashtagtranscription

hashtagtranslation

hashtagaudio

hashtaglanguages

hashtagspeech_delimiter

hashtagspeech_stop

hashtagerror

hashtagSupported Languages

hashtagError Handling

hashtagConnection Errors

hashtagMessage Errors

hashtagBest Practices

Consecutive Translation WebSocket API (v2)

Endpoint

Authentication

Protocol Overview

Connection Flow

Client Messages (Client → Server)

config

audio

stop

Server Messages (Server → Client)

ready

transcription

translation

audio

languages

speech_delimiter

speech_stop

error

Supported Languages

Error Handling

Connection Errors

Message Errors

Best Practices