Skip to content

OpenTalking

The open-source orchestration layer for real-time digital humans.

OpenTalking is not a talking-head model. It is the layer that integrates a talking-head model with everything else a production conversational digital human requires: streaming speech recognition, large language models, text-to-speech synthesis, WebRTC delivery, and per-session control. Plug in the model and provider combination that fits the deployment; the orchestration contract stays the same.

Get started in five minutes Stable docs URL View on GitHub

The site defaults to Chinese. English lives at https://datascale-ai.github.io/opentalking/en/.


What OpenTalking is for

Building a digital human application that talks and listens in real time involves roughly a dozen moving parts: speech recognition with end-pointing, a streaming language model client, sentence-level text-to-speech synthesis, audio decoding, talking-head rendering, WebRTC track management, barge-in handling, and session state. OpenTalking implements all of these as a single FastAPI service, exposes a small REST and WebSocket interface, and delegates synthesis to the configured model backend for each session.

If the question is "I have a wav2lip checkpoint, how do I serve a real-time chat experience on top?" — OpenTalking is the answer. If the question is "how should the model itself run?", choose a backend: local adapter, direct WebSocket service, OmniRT, or a mock path for tests.

Key capabilities

  • Realtime conversation pipeline

    ASR, LLM, TTS, talking-head rendering, and WebRTC delivery in one interruptible session loop.

  • Pluggable model backends

    Resolve synthesis by model + backend: mock, local, direct_ws, or omnirt.

  • Unified API

    REST, SSE, WebSocket, and WebRTC signaling are exposed through one FastAPI service.

  • Replaceable providers

    Use OpenAI-compatible LLM endpoints and switch TTS across Edge, DashScope, CosyVoice, or ElevenLabs.

  • Avatar and voice assets

    Manage avatar bundles, custom portraits, voice catalog entries, and cloned voices.

  • Flexible deployment

    Run unified, split API/Worker, Docker Compose, or remote GPU/NPU model services.

Pick your starting point

  • Quickstart


    Five-minute walkthrough from source checkout to a working end-to-end session using the mock synthesis path.

    Quickstart →

  • Configuration


    Reference for every environment variable and YAML field, with default values and precedence rules.

    Configuration →

  • Deployment


    Deployment topologies covering single-process, API/Worker split, Docker Compose, and Ascend 910B.

    Deployment →

  • API Reference


    Complete REST, Server-Sent Events, and WebSocket reference for all endpoints.

    API Reference →

  • Model Adapter


    Integration guide for adding a new talking-head model to OpenTalking.

    Model Adapter →

  • Architecture


    System architecture, session lifecycle, and event bus reference.

    Architecture →

Minimal example

terminal
git clone https://github.com/datascale-ai/opentalking.git
cd opentalking

uv sync --extra dev --python 3.11
source .venv/bin/activate
cp .env.example .env

# Configure OPENTALKING_LLM_API_KEY and DASHSCOPE_API_KEY in .env, then:
bash scripts/quickstart/start_mock.sh

After startup, open http://localhost:5173, select the demo-avatar and mock model, and initiate a conversation. To enable real talking-head synthesis, configure the selected model's backend, for example:

configs/default.yaml
models:
  wav2lip:
    backend: omnirt
  quicktalk:
    backend: local
  flashhead:
    backend: direct_ws

OMNIRT_ENDPOINT is required only for models using backend: omnirt; the client-side workflow remains unchanged.

System architecture

flowchart LR
    Browser([Browser])
    API[FastAPI<br/>HTTP / WS / WebRTC]
    Worker[Pipeline driver<br/>LLM &rarr; TTS &rarr; synthesis]
    Backend[(Synthesis backend<br/>local / direct_ws / OmniRT)]
    LLM[(LLM endpoint<br/>OpenAI-compatible)]
    TTS[(TTS<br/>Edge / DashScope / ElevenLabs)]

    Browser -->|HTTP / SSE / WebRTC| API
    API <-->|Redis or in-memory bus| Worker
    Worker --> LLM
    Worker --> TTS
    Worker --> Backend

The complete system view — components, deployment topologies, session lifecycle, and event bus schema — is documented in Architecture.

Where OpenTalking fits

Concern OpenTalking Synthesis backend Hosted LLM TTS provider
Session lifecycle and state
HTTP and WebSocket API
WebRTC signaling and tracks
Speech recognition (via DashScope)
Sentence-level streaming pipeline
Barge-in propagation
Voice catalog and cloning
Avatar bundle management
Talking-head model weights and inference
GPU and NPU scheduling and batching ✓ when backend supports it
Chat completion inference
Speech synthesis

Use cases

  • Conversational assistants with a visual avatar for customer support, sales, or onboarding.
  • Live broadcast applications where the avatar responds to viewer interactions in real time.
  • Educational software that requires both speech recognition and synthesized speech feedback.
  • Internal productivity tools that pair a corporate language model with a digital human interface.
  • Research and prototyping for talking-head generation, where the orchestration layer is needed but not the focus of the work.

Community and support

License

OpenTalking is released under the Apache License, Version 2.0. Talking-head model weights and external model services are governed by their individual licenses; consult the respective model repositories or backend deployments for details.