Skip to content

Architecture

OpenTalking is an orchestration framework for real-time digital humans. It coordinates speech recognition, language models, text-to-speech synthesis, talking-head rendering, and WebRTC delivery within a unified session model. A small HTTP and WebSocket surface is exposed on top.

This page describes the system-level architecture: components, deployment topologies, session lifecycle, the event bus, and the pluggable synthesis backend boundary.

Architecture Overview

OpenTalking Current Code Architecture

Components

flowchart TB
    subgraph Frontend
      Web[apps/web<br/>React + Vite]
    end
    subgraph "OpenTalking server"
      API[apps/api<br/>FastAPI routes<br/>Session state]
      Worker[apps/worker<br/>Pipeline driver]
      Bus[(Redis<br/>or in-memory)]
    end
    subgraph External
      LLM[Language model endpoint<br/>OpenAI-compatible]
      TTSExt[Text-to-speech<br/>Edge / DashScope / ElevenLabs]
      STT[Speech recognition<br/>DashScope Paraformer]
      Synth[(Synthesis backend<br/>mock / local / direct_ws / omnirt)]
    end

    Web -->|HTTP / SSE / WebRTC| API
    Web -.->|Audio stream WS| API
    API <-->|pub/sub| Bus
    Worker <-->|pub/sub| Bus
    Worker --> LLM
    Worker --> TTSExt
    Worker --> STT
    Worker --> Synth
Component Responsibility Source
apps/api HTTP and WebSocket entry, session CRUD, WebRTC signaling, SSE event stream. apps/api/routes/
apps/worker Drives one or more sessions through the render pipeline. apps/worker/, opentalking/worker/
apps/unified Combines API, Worker, and an in-memory bus in one process. Intended for development. apps/unified/
opentalking/ library Providers, adapters, interfaces, types, configuration, bus, and pipeline code. opentalking/ (flat layout)
Synthesis backend Per-model runtime selected by the backend resolver: mock, local, direct_ws, or omnirt. opentalking/providers/synthesis/
OmniRT (external) Heavyweight, multi-card, GPU/NPU, and remote inference when a model uses backend: omnirt. datascale-ai/omnirt
Redis (external) Pub/sub bus between the API and one or more Workers. Optional in unified mode.

Deployment topologies

flowchart LR
    subgraph "Topology A: Unified"
      direction TB
      U[apps/unified]
    end
    subgraph "Topology B: API + Worker"
      direction TB
      Aa[apps/api] --- Rr[(Redis)] --- Ww[apps/worker]
    end
    subgraph "Topology C: API + multiple Workers"
      direction TB
      Ab[apps/api] --- Rb[(Redis)]
      Rb --- Wb1[Worker 1]
      Rb --- Wb2[Worker 2]
      Rb --- Wb3[Worker N]
    end

Command-line specifics are documented in Deployment.

Session lifecycle

sequenceDiagram
    participant Browser
    participant API
    participant Worker
    participant LLM
    participant TTS
    participant Model as Synthesis backend
    participant Redis

    Browser->>API: POST /sessions (avatar_id, model)
    API->>Redis: publish session.created
    Worker->>Redis: subscribe session events
    Browser->>API: POST /sessions/{id}/webrtc/offer
    API->>Worker: SDP exchange
    Worker-->>Browser: WebRTC track established

    Browser->>API: POST /sessions/{id}/chat
    API->>Redis: publish chat.request
    Worker->>LLM: stream tokens
    LLM-->>Worker: token deltas
    Worker->>TTS: synthesize each sentence
    TTS-->>Worker: audio bytes
    Worker->>Model: drive with audio
    Model-->>Worker: video frames
    Worker-->>Browser: AV frames over WebRTC
    Worker->>Redis: publish events (transcript, llm, tts, frame)
    API->>Browser: forward via SSE

    Browser->>API: POST /sessions/{id}/interrupt
    API->>Redis: publish cancel
    Worker->>Worker: drain and return to idle

Event bus

The bus carries one logical channel per session and a global control channel. Events are JSON-encoded; schemas are defined in opentalking/core/types/events.py and opentalking/events/schemas.py.

Event Producer Consumer
session.created / session.terminated API Worker
chat.request / speak.request API Worker
cancel API Worker
transcript Worker API → SSE
llm (token deltas) Worker API → SSE
tts (lifecycle markers) Worker API → SSE
frame (video frame timing) Worker API → SSE
error Worker API → SSE

In unified mode the bus is in-memory; in split mode the bus uses Redis pub/sub.

Boundaries

The following concerns are out of scope for OpenTalking:

Concern Location
Model execution and weight loading Selected synthesis backend (local, direct_ws, or omnirt)
GPU and NPU scheduling, batching, multi-card orchestration OmniRT when backend: omnirt; otherwise the selected backend's runtime
Language model hosting User-selected endpoint (DashScope, OpenAI, vLLM, Ollama, etc.)
Text-to-speech hosting User-selected provider (or local Edge TTS)
User authentication and account management Out of scope; integrate via an upstream gateway.
WebRTC TURN Out of scope; deploy coturn or an equivalent.

The apps/ entry points depend on the backend resolver rather than hard-coded model sets. Lightweight models can remain in-process through backend: local, single-model services can use backend: direct_ws, and production-grade heavy models can use OmniRT without changing the session API.

Library layout (opentalking/)

opentalking/
├── core/
│   ├── interfaces/        # Protocols (ModelAdapter, TTSAdapter, ...)
│   ├── types/             # AudioChunk, VideoFrameData, SessionState
│   ├── config.py          # Settings model (environment to typed configuration)
│   ├── model_config.py    # Per-model runtime config and backend selection
│   └── bus.py             # Pub/sub abstraction (Redis or in-memory)
├── providers/
│   ├── llm/               # OpenAI-compatible streaming clients
│   ├── stt/               # DashScope Paraformer realtime adapter
│   ├── tts/               # Edge, DashScope, CosyVoice, ElevenLabs
│   ├── rtc/               # aiortc WebRTC adapter
│   └── synthesis/         # Backend resolver, availability, OmniRT/direct WS clients
├── models/
│   ├── registry.py        # @register_model and local adapter lookup
│   └── quicktalk/         # Local talking-head adapter (reference implementation)
├── pipeline/
│   ├── session/           # Session runner
│   ├── speak/             # LLM -> TTS -> synthesis pipeline
│   └── recording/         # Recording and offline export
├── runtime/               # Worker process glue, bus, timing
├── avatar/                # Avatar bundle loader and validator
├── voice/                 # Voice cloning catalog
├── media/                 # Frame/avatar media utilities
└── events/                # Event emitter and API event schemas

Further reading

  • Render pipeline — stage-by-stage description of the audio-to-video chain.
  • Model adapter — integration interface for new synthesis backends.
  • Developing — repository layout, test workflow, and debugging techniques.