Models¶
This module explains how to make the full OpenTalking model chain runnable, not only the talking-head backend. A usable digital-human session depends on five parts:
flowchart LR
STT[Speech recognition<br/>optional voice input]
LLM[LLM<br/>decides what to say]
TTS[TTS<br/>text to audio]
Avatar[Avatar assets<br/>image / frames / template]
Head[Talking-head backend<br/>audio to video]
WebRTC[WebRTC<br/>browser delivery]
STT --> LLM --> TTS --> Head --> WebRTC
Avatar --> Head
Recommended defaults¶
| Layer | Default for first run | When to change it |
|---|---|---|
| LLM | DashScope OpenAI-compatible endpoint | Use OpenAI, vLLM, Ollama, or DeepSeek when those are already standard in your environment. |
| STT | DashScope Paraformer realtime | Keep it unless you need a different realtime ASR provider. |
| TTS | Edge TTS | Use DashScope, CosyVoice, or ElevenLabs for production voices and voice cloning. |
| Avatar assets | Built-in examples | Prepare model-specific assets before selecting Wav2Lip, QuickTalk, FlashHead, or FlashTalk. |
| Talking-head backend | mock first, then Wav2Lip compatibility path |
Use QuickTalk local, FlashHead direct WS, or OmniRT for heavier models. |
Setup order¶
- Run Quickstart with
mock. - Check the Support Matrix to choose the right path.
- Configure LLM and STT.
- Choose and verify TTS.
- Prepare Avatar assets.
- Start a talking-head model.
- Verify
/models, create a session, and test through the browser.
Keep model execution decoupled from OpenTalking itself: lightweight models should use
local or direct_ws where possible, while OmniRT remains the recommended backend
for heavyweight, multi-card, remote, or NPU deployments.