Quickstart¶
This guide walks through a complete end-to-end conversation with a digital human using the mock synthesis path. The mock path requires no GPU and no pre-downloaded model weights, making it suitable for first-time installation and CI environments.
The resulting environment exposes a web interface at http://localhost:5173 where audio
input is streamed through speech recognition, a language model, and text-to-speech, with
synthesized video frames delivered over WebRTC.
Prerequisites¶
| Component | Minimum version | Purpose |
|---|---|---|
| Python | 3.10+ (3.11 recommended) | Server runtime |
| Node.js | 18 | React frontend toolchain |
| ffmpeg | Recent stable release | Audio decoding for the TTS pipeline |
| DashScope API key | — | Required for the default language model (qwen-flash) and speech recognition (paraformer-realtime-v2). Apply at bailian.console.aliyun.com. |
GPU and NPU resources are not required for the quickstart. CUDA or Ascend hardware is only necessary when switching to a real talking-head model in Step 5.
1. Install from source¶
git clone https://github.com/datascale-ai/opentalking.git
cd opentalking
uv sync --extra dev --python 3.11
source .venv/bin/activate
cp .env.example .env
If you need the compatibility fallback instead:
python3 -m venv .venv
source .venv/bin/activate
pip install --index-url https://pypi.tuna.tsinghua.edu.cn/simple -e ".[dev]"
cp .env.example .env
Notes:
- The lockfile is validated with Python 3.11.
- When PyAV resolves to a wheel, only runtime
ffmpegis required. - If you move to an unvalidated Python or PyAV combination and trigger a source build, you will also need
ffmpeg 7,pkg-config, and a C compiler.
2. Configure required credentials¶
Configure the following two variables in .env. All remaining settings have working
defaults and may be left unchanged.
Both variables must be set to the same DashScope API key. OPENTALKING_LLM_API_KEY is
consumed by the language model client; DASHSCOPE_API_KEY is read directly by the
DashScope SDK used for speech recognition.
Alternative language model providers
Any OpenAI-compatible endpoint may be used in place of DashScope. When switching
providers, also update OPENTALKING_LLM_BASE_URL and OPENTALKING_LLM_MODEL. See
Configuration.
3. Start the services¶
The script starts two processes:
- OpenTalking unified server at
http://127.0.0.1:8000, providing the FastAPI endpoints for sessions, avatars, server-sent events, and WebRTC signaling. - Frontend development server at
http://localhost:5173, serving the Vite-built React client.
The mock synthesis backend runs in-process and does not require OmniRT or any external inference service.
4. Initiate a conversation¶
Open http://localhost:5173 in a Chromium-based browser. WebRTC support is required.
- Select
demo-avatarfrom the avatar list. - Select
mockfrom the model selector. - Click the microphone icon and begin speaking. The user interface streams transcripts, model output, synthesized audio, and rendered video frames in real time.
The mock backend returns a placeholder image for each audio chunk, allowing end-to-end validation of the pipeline before a real model is integrated.
5. Enable a talking-head model¶
Once the mock path has been verified, the system may be reconfigured to use a real talking-head model. The complete per-model weight download and startup procedures are documented in Models. The shortest paths are:
Lightweight lip-synchronization model suitable for a single NVIDIA 3090-class GPU. The preferred deployment direction is local or direct single-model backend; the current quickstart uses OmniRT as the runnable compatibility path until the local Wav2Lip adapter is bundled.
# Run from a separate terminal. OmniRT must be checked out next to opentalking/.
bash scripts/quickstart/start_omnirt_wav2lip.sh --device cuda
Add the following entry to .env:
Restart start_all.sh and select wav2lip in the model selector. For
China-friendly download alternatives, see
Models → Wav2Lip.
SoulX FlashTalk-14B end-to-end talking-head model, requiring an NVIDIA 4090 or A100-class GPU.
Set the OmniRT endpoint:
Select flashtalk in the model selector. For FlashTalk weight directories,
CUDA/Ascend startup, and domestic mirror links, see
Models → FlashTalk.
NPU evaluation is best done from source on the host CANN environment. Source
CANN first, then run bash scripts/deploy_ascend_910b.sh. See
From source → Ascend 910B.
6. Verify and shut down¶
Verify the running services:
The output reports the state of the unified server, frontend, and OmniRT. To stop all processes started by the quickstart:
Troubleshooting¶
The following table lists common installation issues and their resolutions.
| Symptom | Resolution |
|---|---|
ffmpeg: not found during TTS decoding |
Install ffmpeg. On macOS: brew install ffmpeg. On Debian/Ubuntu: apt install ffmpeg. |
| Language model returns HTTP 401 | Ensure OPENTALKING_LLM_API_KEY and DASHSCOPE_API_KEY are both set to the same DashScope key. |
| Browser reports WebRTC is unavailable | Use a Chromium-based browser. Safari requires OPENTALKING_API_HOST=127.0.0.1 and a matching CORS origin. |
| Port 8000 is already in use | Override the bound ports: bash scripts/quickstart/start_mock.sh --api-port 8010 --web-port 5180. |
| OmniRT exits during startup | Inspect the log file referenced in the OmniRT script output (typically ~/logs/omnirt-wav2lip.log). |
Next steps¶
- Configuration — reference for all environment variables and YAML fields.
- Models — end-to-end setup for each supported model backend.
- Deployment — multi-process deployment, Docker Compose, and production guidance.
- Architecture — system internals and event bus schema.
- API Reference — complete HTTP and WebSocket endpoint documentation.