Skip to content

Text-to-Speech

TTS converts LLM output into audio that drives the talking-head backend. Start with Edge TTS for the lightest local evaluation, then switch providers when you need production voices, cloning, or provider-specific voice quality.

Provider options

Provider Best for Required configuration
edge First run, CPU evaluation, no API key OPENTALKING_TTS_PROVIDER=edge
dashscope Chinese realtime TTS and voice cloning DASHSCOPE_API_KEY plus DashScope TTS settings
cosyvoice Custom voice service or CosyVoice deployment CosyVoice WebSocket URL/settings
elevenlabs Hosted multilingual voices ElevenLabs API key and voice id

Edge TTS default

.env
OPENTALKING_TTS_PROVIDER=edge
OPENTALKING_TTS_VOICE=zh-CN-XiaoxiaoNeural

Edge TTS still needs ffmpeg because OpenTalking decodes provider audio into PCM for the synthesis backend.

DashScope Qwen realtime TTS

.env
OPENTALKING_TTS_PROVIDER=dashscope
DASHSCOPE_API_KEY=<dashscope-api-key>
OPENTALKING_QWEN_TTS_MODEL=qwen3-tts-flash-realtime
OPENTALKING_QWEN_TTS_REUSE_WS=1

ElevenLabs

.env
OPENTALKING_TTS_PROVIDER=elevenlabs
OPENTALKING_TTS_ELEVENLABS_API_KEY=<elevenlabs-api-key>
OPENTALKING_TTS_ELEVENLABS_VOICE_ID=<voice-id>
OPENTALKING_TTS_ELEVENLABS_MODEL_ID=eleven_flash_v2_5

Verification

Create a mock session first, then call /speak with fixed text. This verifies TTS without depending on a real talking-head model.

terminal
SID=<session-id>
curl -s -X POST "http://127.0.0.1:8000/sessions/$SID/speak" \
  -H 'content-type: application/json' \
  -d '{"text":"Hello, this is an OpenTalking TTS test."}'