Skip to main content

Overview

ElevenLabs is a specialized audio provider for text-to-speech and speech-to-text operations. Bifrost performs conversions including:
  • Model ID mapping - Uses provider model identifier directly
  • Voice configuration - Maps voice settings (stability, similarity, boost, speed, style)
  • Response format conversion - Speech format handling (MP3, Opus, PCM/WAV)
  • Timestamp support - Character-level timing alignment for TTS
  • Transcription with alignment - Word and character-level timing, diarization, and additional formats
  • Pronunciation dictionaries - Support for custom pronunciation rules
  • Voice quality parameters - Stability, similarity boost, and speaker boost controls

Supported Operations

OperationNon-StreamingStreamingEndpoint
Speech (TTS)/v1/text-to-speech/{voice_id}
Transcriptions (STT)-/v1/speech-to-text
List Models-/v1/models
Chat Completions-
Responses API-
Text Completions-
Embeddings-
Image Generation-
Unsupported Operations (❌): Chat Completions, Responses API, Text Completions, and Embeddings are not supported by ElevenLabs (audio-focused provider). These return UnsupportedOperationError.Note: ElevenLabs also supports a “Speech with Timestamps” endpoint at /v1/text-to-speech/{voice_id}/with-timestamps (non-streaming only) for enhanced timestamp information.

Setup & Configuration

Configure ElevenLabs as a provider.
ElevenLabs provider dashboard
  1. Navigate to Models > Model Providers. Look for ElevenLabs under Configured Providers. If it is missing, click on Add New Provider and select ElevenLabs.
  2. Click Add Key or edit an existing key.
  3. Set a name for your key.
  4. Paste your API key directly or use an environment variable (for example, env.ELEVENLABS_API_KEY).
  5. Set Allowed Models to All Models (default) or the specific model allowlist you want this key to serve.
  6. Save the provider configuration.
For text-to-speech calls, the Bifrost model is the ElevenLabs voice ID unless you pass a provider-specific voice override in the request.

1. Speech (Text-to-Speech)

Request Parameters

Core Parameters

ParameterMappingNotes
input.inputtextThe text to convert to speech (required)
modelmodel_idModel identifier (e.g., "eleven_multilingual_v2")
response_formatQuery param output_formatSpeech format (see Response Format)

Voice Configuration

Voice settings are optional and controlled via params:
ParameterElevenLabs MappingDefaultRange
speedvoice_settings.speed1.00.5-2.0
extra_params.stabilityvoice_settings.stability0.50-1.0
extra_params.similarity_boostvoice_settings.similarity_boost0.750-1.0
extra_params.use_speaker_boostvoice_settings.use_speaker_boosttrueboolean
extra_params.stylevoice_settings.style00-1.0

Advanced Parameters

Use extra_params for ElevenLabs-specific TTS features:
curl -X POST http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "eleven_multilingual_v2",
    "input": {"input": "Hello, how are you?"},
    "voice": "21m00Tcm4TlvDq8ikWAM",
    "response_format": "mp3",
    "stability": 0.5,
    "similarity_boost": 0.75,
    "use_speaker_boost": true,
    "style": 0,
    "speed": 1.0,
    "language_code": "en",
    "seed": 42,
    "previous_text": "Context text",
    "next_text": "Future context",
    "apply_text_normalization": "auto"
  }'

Advanced TTS Parameters

ParameterTypeDescription
language_codestringLanguage code (e.g., “en”, “es”)
seedintegerReproducible output (0-4294967295)
previous_textstringPrevious text context for consistency
next_textstringNext text context for consistency
previous_request_idsstring[]Previous request IDs for continuity
next_request_idsstring[]Next request IDs for continuity
apply_text_normalizationstringText normalization mode: "auto", "on", "off"
apply_language_text_normalizationbooleanApply language-specific text normalization

Response Format

FormatOutputQualityBitrate
mp3MP3High128 kbps @ 44100 Hz
opusOpusHigh128 kbps @ 48000 Hz
wav / pcmPCM WAVLossless16-bit @ 44100 Hz
Defaults to MP3 format if not specified. Format is passed via query parameter output_format.

Timestamps Support

To get character-level timing alignment, enable with_timestamps:
{
  "with_timestamps": true
}
When enabled, the endpoint /v1/text-to-speech/{voice_id}/with-timestamps is used and the response includes:
  • audio_base64 - Audio data as base64-encoded string
  • alignment.char_start_times_ms - Character start times in milliseconds
  • alignment.char_end_times_ms - Character end times in milliseconds
  • alignment.characters - Array of characters
  • normalized_alignment - Same as alignment but for normalized text

Response Conversion

Non-Timestamp Response

{
  "audio": "<binary audio data>"
}

Timestamp Response

{
  "audio_base64": "<base64 encoded audio>",
  "alignment": {
    "char_start_times_ms": [0, 150, 280, ...],
    "char_end_times_ms": [150, 280, 420, ...],
    "characters": ["H", "e", "l", "l", "o", ...]
  },
  "normalized_alignment": {
    "char_start_times_ms": [...],
    "char_end_times_ms": [...],
    "characters": [...]
  }
}

Streaming

Streaming speech returns audio in chunks as they are generated:
{
  "type": "audio.delta",
  "audio": "<binary audio chunk>"
}
Final chunk:
{
  "type": "audio.done"
}

2. Transcription (Speech-to-Text)

Request Parameters

Input Source

Choose one of the following (mutually exclusive):
ParameterTypeDescription
input.filebytesAudio file content (WAV, MP3, etc.)
extra_params.cloud_storage_urlstringURL to cloud-hosted audio file
Error: Providing both or neither will result in error.

Core Parameters

ParameterMappingDescription
modelmodel_idModel identifier (required)
params.languagelanguage_codeLanguage code (ISO 639-1, e.g., “en”)

Advanced Parameters

Use extra_params for transcription-specific features:
curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=eleven_latest" \
  -F "language_code=en" \
  -F "tag_audio_events=true" \
  -F "num_speakers=2" \
  -F "timestamps_granularity=word" \
  -F "diarize=true" \
  -F "diarization_threshold=0.5" \
  -F "temperature=0.1" \
  -F "seed=42" \
  -F "use_multi_channel=true" \
  -F "webhook=true" \
  -F "webhook_id=webhook-123"

Transcription Options

ParameterTypeDescription
tag_audio_eventsbooleanTag audio events (background noise, music, etc.)
num_speakersintegerExpected number of speakers (for diarization)
timestamps_granularitystringTimestamp level: "none", "word", "character"
diarizebooleanIdentify different speakers
diarization_thresholdfloatSpeaker diarization sensitivity (0.0-1.0)
file_formatstringInput format: "pcm_s16le_16", "other"
temperaturefloatTranscription temperature (0.0-1.0)
seedintegerReproducible transcription
use_multi_channelbooleanProcess multi-channel audio separately
webhookbooleanEnable webhook for async processing
webhook_idstringWebhook endpoint ID
webhook_metadataobject/stringAdditional webhook metadata
cloud_storage_urlstringURL to cloud-hosted audio (alternative to file)

Additional Formats

Request multiple output formats simultaneously:
{
  "additional_formats": [
    {
      "format": "segmented_json",
      "include_speakers": true,
      "include_timestamps": true,
      "segment_on_silence_longer_than_s": 1.0,
      "max_segment_duration_s": 30.0
    },
    {
      "format": "srt",
      "max_segment_duration_s": 30.0
    }
  ]
}
Supported formats: segmented_json, docx, pdf, txt, html, srt

Response Conversion

Basic Transcription

{
  "transcript": {
    "language_code": "en",
    "language_probability": 0.95,
    "text": "Full transcribed text...",
    "words": [
      {
        "text": "Hello",
        "start": 0.0,
        "end": 0.5,
        "type": "word",
        "speaker_id": "speaker_1",
        "logprob": -0.05
      }
    ]
  }
}

With Diarization

When diarize: true, the response includes speaker identification:
{
  "transcript": {
    "text": "Hello how are you?",
    "words": [
      {
        "text": "Hello",
        "speaker_id": "speaker_1"
      },
      {
        "text": "how",
        "speaker_id": "speaker_2"
      }
    ]
  }
}

With Timestamps

Character-level timing when timestamps_granularity: "character":
{
  "words": [
    {
      "text": "Hello",
      "characters": [
        {"text": "H", "start": 0.0, "end": 0.1},
        {"text": "e", "start": 0.1, "end": 0.2}
      ]
    }
  ]
}

With Additional Formats

{
  "transcript": { ... },
  "additional_formats": [
    {
      "requested_format": "srt",
      "file_extension": "srt",
      "content_type": "text/plain",
      "is_base64_encoded": false,
      "content": "1\n00:00:00,000 --> 00:00:01,000\nHello\n\n2\n..."
    }
  ]
}

Caveats

Severity: High Behavior: Voice ID must be provided for TTS requests Impact: Request fails without voice configuration Code: elevenlabs.go:198-208
Severity: High Behavior: Either file or cloud_storage_url must be provided (not both) Impact: Request fails with ambiguous input Code: elevenlabs.go:471-478
Severity: Low Behavior: Response formats (MP3, Opus, WAV) mapped via format string Impact: Format parameter passed as query string to endpoint Code: elevenlabs.go:712-715, utils.go:5-35
Severity: Low Behavior: Timestamp requests use /with-timestamps endpoint variant Impact: Switches endpoint based on with_timestamps flag Code: elevenlabs.go:195-205
Severity: Low Behavior: Transcription uses multipart/form-data, not JSON Impact: File and parameters sent as form fields Code: elevenlabs.go:480-690

3. List Models

Request Parameters

ParameterTypeDescription
(none)-No parameters required
Returns available models with their capabilities and language support.

Response Conversion

{
  "models": [
    {
      "model_id": "eleven_multilingual_v2",
      "name": "Eleven Multilingual v2",
      "description": "Multilingual speech synthesis",
      "serves_pro_voices": true,
      "token_cost_factor": 1.0,
      "can_do_text_to_speech": true,
      "can_do_voice_conversion": true,
      "can_use_style": true,
      "can_use_speaker_boost": true,
      "languages": [
        {"language_id": "en", "name": "English"},
        {"language_id": "es", "name": "Spanish"}
      ],
      "requires_alpha_access": false,
      "max_characters_request_free_user": 1000,
      "max_characters_request_subscribed_user": 100000,
      "maximum_text_length_per_request": 5000,
      "model_rates": {
        "character_cost_multiplier": 1.0
      }
    }
  ]
}