Skip to content

Speech Generation Models

Speech generation models are usually integrated as TTS providers. They convert LLM output into audio that drives the talking-head backend. This page is only for model selection and navigation; weight preparation, startup, verification, and troubleshooting live in the model pages.

Provider Options

Provider Type Best for Entry
edge Hosted / online First run, CPU evaluation, no API key .env provider config
dashscope Hosted API Chinese realtime TTS, voice cloning, DashScope deployments .env provider config
cosyvoice Self-hosted service Existing CosyVoice WebSocket / HTTP service Service-specific docs
elevenlabs Hosted API Hosted multilingual voices .env provider config
local_cosyvoice Local deployment Local Chinese TTS, built-in voices, and cloned voices CosyVoice
indextts Local deployment / OmniRT Controllable dubbing, emotion control, and voice cloning IndexTTS
local_f5_tts Local deployment Local F5-TTS Base voice cloning F5-TTS
local_qwen3_tts Local deployment Local Qwen3-TTS Base voice cloning Qwen3-TTS

Local Model Entries

Each local model page contains use cases, weight preparation, startup commands, verification commands, and common errors.