Skip to content

Benchmark

This page explains how OpenTalking records end-to-end experience metrics and how it references inference baselines from external model backends. OpenTalking is the orchestration layer, so benchmark data is separated into two categories:

Type Directly owned by OpenTalking Examples
End-to-end experience metrics Yes First-frame latency, TTS first packet, event stream, WebRTC playback, audio/video sync.
Model inference baseline No, provided by the selected backend OmniRT FlashTalk, Wav2Lip, QuickTalk local adapter render throughput.

The current content follows the benchmark conventions from docs/zh/benchmark on the main branch.

Running the Full E2E Benchmark

Use the following script for the full end-to-end benchmark:

scripts/run_opentalking_e2e_benchmark.sh

This script reads input assets according to the benchmark configuration, starts the relevant services, and collects results.

Enter OpenTalking:

cd /root/test/opentalking
source .venv/bin/activate

Prepare script permissions:

chmod +x scripts/run_opentalking_e2e_benchmark.sh
chmod +x scripts/start_unified.sh
chmod +x scripts/quickstart/start_omnirt_quicktalk.sh

Confirm default benchmark inputs exist:

ls -lh configs/benchmark/input/reference.png
ls -lh configs/benchmark/input/ttsmaker-file.mp3

To replace the test avatar or audio, simply replace the two files above, or modify the input paths in configs/benchmark/opentalking-e2e.yaml. For general deployment verification, the repository's built-in benchmark inputs are sufficient.

Set low-VRAM environment variables:

export CUDA_VISIBLE_DEVICES=0
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128
export OPENTALKING_BENCHMARK_PYTHON="$PWD/.venv/bin/python"

export OPENTALKING_QUICKTALK_HUBERT_DEVICE=cpu
export OPENTALKING_QUICKTALK_RESOLUTION=160
export OPENTALKING_PREWARM_AVATARS=0

export OMNIRT_QUICKTALK_RUNTIME=1
export OMNIRT_QUICKTALK_DEVICE=cuda:0
export OMNIRT_QUICKTALK_HUBERT_DEVICE=cpu
export OMNIRT_QUICKTALK_BATCH_SIZE=1
export OMNIRT_QUICKTALK_WORKER_CACHE_MAX=1

Run the benchmark:

bash scripts/run_opentalking_e2e_benchmark.sh \
  --tester xxx \
  --model quicktalk \
  --backend omnirt \
  --gpu-index 0 \
  --timeout 300

Find results:

find /root/test/opentalking -name "result.json" -o -name "result.csv" -o -name "report.md" -o -name "*.tar.gz"

Notes

run_opentalking_e2e_benchmark.sh is the full end-to-end entry point. It is more suitable for final deployment verification than running model benchmarks individually, as it covers OpenTalking, OmniRT, QuickTalk runtime, input processing, service startup, request pipeline, and result statistics.


WSL2 VRAM Statistics Fix

On WSL2, the following command may not return per-process VRAM usage:

nvidia-smi --query-compute-apps=pid,used_memory

As a result, the benchmark may show:

idle VRAM: 0.0
peak inference VRAM: 0.0

Recommended approach: when PID-level queries return empty, fall back to full-GPU VRAM:

nvidia-smi --id=0 --query-gpu=memory.used --format=csv,noheader,nounits

Calculation:

peak inference VRAM = max(current memory.used - baseline memory.used)

Notes:

  • This is not per-process VRAM;
  • This is the delta of full-GPU VRAM relative to baseline during the benchmark run;
  • Do not run other CUDA programs during the benchmark.

Metrics

Metric Meaning Owner
session_create_ms Time from session creation request to API response. OpenTalking
asr_partial_latency_ms Latency from user speech to the first partial transcript. OpenTalking + STT provider
llm_first_token_ms Latency from text request to first LLM token. OpenTalking + LLM endpoint
tts_first_pcm_ms Latency from sentence submission to first PCM/audio bytes. OpenTalking + TTS provider
avatar_first_frame_ms Latency from audio submission to first available avatar frame. OpenTalking + synthesis backend
render_fps Video-frame generation throughput of the synthesis backend. synthesis backend
webrtc_first_frame_ms Time until the browser receives the first playable video frame. OpenTalking + WebRTC
av_drift_ms Audio/video timeline offset during playback. OpenTalking
queue_depth Worker or external model-service queue depth. OpenTalking / backend
steady_chunk_ms Steady-state chunk inference time. synthesis backend

Tested Combinations

Path Hardware / state Data Notes
Wav2Lip quickstart NVIDIA 3090 path singer example around 28 frames / 0.83-0.85s, about 33 FPS From README quickstart notes; useful as a lightweight model reference.
QuickTalk local adapter RTX 3090 720x900 / 25fps, about 35 FPS, about 3.8 GiB GPU memory From README consumer-GPU reference.
FlashTalk via OmniRT Ascend 910B2 x8, warm full-audio 937 frames / 37.377s, about 25 FPS External OmniRT/model-service baseline, not direct OpenTalking inference.
FlashTalk steady chunk Ascend 910B2 x8, warm chunk 29-frame chunk around 30 FPS equivalent External inference baseline; should be separated from end-to-end first-response latency.

FPS

FPS should be split into:

  • render_fps: frame-generation throughput of the model or synthesis backend.
  • Playback FPS: actual browser or WebRTC playback frame rate.

High model FPS does not guarantee a good end-to-end experience. TTS, queueing, network, WebRTC, and browser decoding also matter.

First-frame Latency

Record first-frame latency in stages:

  • session_create_ms
  • tts_first_pcm_ms
  • avatar_first_frame_ms
  • webrtc_first_frame_ms

One single “first-frame latency” number is not enough to locate bottlenecks.

Startup Time

Always label startup state:

  • Cold start: process startup, model load, weight load, avatar preprocessing, and cache build.
  • Warm state: model and cache are ready.
  • Steady chunk: continuous generation throughput after initialization.

End-to-end Latency

End-to-end latency should be measured from user input to visible browser output. Text, speech, and uploaded-audio tests have different starting points, so record the exact boundary.

Resource Usage

Record GPU/NPU model, driver version, peak and steady memory usage, CPU limits, model version, quantization, caching, and warmup state.

Test Method

QuickTalk Local Adapter

Terminal
source .venv/bin/activate
python apps/cli/quicktalk_bench.py \
  --asset-root /path/to/quicktalk/assets \
  --template-video /path/to/template.mp4 \
  --audio /path/to/input.wav \
  --output outputs/benchmarks/quicktalk-output.mp4 \
  --device cuda:0

The output JSON includes:

  • init_seconds
  • audio_feature_seconds
  • first_frame_seconds
  • render_seconds
  • render_fps
  • mux_seconds

OpenTalking End-to-end Flow

Terminal
curl -fsS http://127.0.0.1:8000/health
curl -fsS http://127.0.0.1:8000/models | jq

Record the OpenTalking commit, non-secret config, hardware, selected avatar_id, model, backend, input audio, first token, TTS first packet, avatar first frame, browser first frame, and audio/video sync result.

External Model Services

OmniRT, FlashHead direct WebSocket, or other model-service data should be generated by their own benchmark tools. OpenTalking documentation only references those results and records OpenTalking-side orchestration, queueing, and playback behavior.

Result Template

### <model> / <backend> / <hardware> / <date>

- OpenTalking commit:
- backend commit or service version:
- hardware:
- model and weights:
- avatar:
- input audio:
- cold start or warm state:
- `session_create_ms`:
- `llm_first_token_ms`:
- `tts_first_pcm_ms`:
- `avatar_first_frame_ms`:
- `webrtc_first_frame_ms`:
- `render_fps`:
- `av_drift_ms`:
- notes:

How to Interpret Results

  • For user experience, prioritize first response and audio/video sync, not only model FPS.
  • For model-service throughput, prioritize steady chunks and queue depth, not only one cold run.
  • External backend benchmarks must be clearly labeled as external.
  • A Mock run only proves orchestration works; it does not prove talking-head performance.

Benchmark Results Reference

Key metrics to focus on:

Metric RTX 3050 Laptop Reference Meaning
Output resolution 540×900 / 25fps Final output spec
Cold start 6.0 s Service and model initialization time
Warmup 20.8 s First load and inference preparation time
TTFA 1661 ms Time to first audio
TTFV 2833 ms Time to first video frame
First-turn total latency 4109 ms User input to first-turn response completion
Steady FPS 19.1 Steady-state video generation frame rate
RTF 1.17 Greater than 1 means slightly slower than real-time
VRAM usage 1.4 GiB VRAM after WSL2 fallback

Conclusion: RTX 3050 Laptop can run the QuickTalk pipeline, but real-time performance is limited. It is suitable for deployment verification and feature demos; for stable 25fps+, use RTX 3060 / 4060 or higher.


Test Results

Date Model Technique Backend Hardware OS Driver commit (opentalking + omnirt) Input Resolution FPS Chunk size Cold start/s Warmup/s TTFA/ms TTFV/ms First-turn total/ms Steady FPS Idle VRAM/GB Peak VRAM/GB
2026/5/20 wav2lip mouth inpainting omnirt RTX 3090 Linux x86_64 glibc2.31 driver 570.133.07 a3047eab + 64c92ed1 audio+image 498×832 30 933ms 4.096 12.043 1374.507 1625.962 3002.526 37.269 7.928 7.928
2026/5/20 quicktalk mouth inpainting omnirt RTX 3090 Linux x86_64 glibc2.31 driver 570.133.07 a3047eab + 64c92ed1 audio+image 540×900 25 1120ms 5.702 17.856 1551.773 1800.524 3356.019 29.23 1.662 1.662
2026/5/20 musetalk mouth inpainting omnirt RTX 3090 Linux x86_64 glibc2.31 driver 570.133.07 a3047eab + 64c92ed1 audio+image 512×512 25 1000ms 21.927 10.233 1464.464 1769.484 3235.518 28.868 5.078 5.078
2026/5/22 wav2lip mouth inpainting omnirt RTX 4090 Linux x86_64 glibc2.39 driver 570.211.01 f16f7868 + 9a35e675 audio+image 498×832 30 933ms 4.23 27.321 1730.871 1955.629 3689.764 31.542 8.133 8.133
2026/5/22 quicktalk mouth inpainting omnirt RTX 4090 Linux x86_64 glibc2.39 driver 570.211.01 f16f7868 + 9a35e675 audio+image 540×900 25 1120ms 4.319 15.871 1493.164 1064.825 2561.146 46.921 1.838 1.838
2026/5/22 musetalk mouth inpainting omnirt RTX 4090 Linux x86_64 glibc2.39 driver 570.211.01 f16f7868 + 9a35e675 audio+image 512×512 25 1000ms 18.309 13.866 1506.636 2095.522 3605.564 24.767 5.203 5.203
2026/5/22 wav2lip mouth inpainting omnirt NPU 910B2 Linux aarch64 glibc2.35 cann driver f3532c19 + 5f24f56f audio+image 498×832 30 933ms 9.478 35.931 1401.98 2615.322 4019.564 23.945 9.113 9.113
2026/5/22 quicktalk mouth inpainting omnirt NPU 910B2 Linux aarch64 glibc2.35 cann driver f3532c19 + 5f24f56f audio+image 540×900 25 1120ms 9.471 39.142 1427.894 1782.861 3212.053 29.66 2.473 2.473
2026/5/22 musetalk mouth inpainting omnirt NPU 910B2 Linux aarch64 glibc2.35 cann driver f3532c19 + 5f24f56f audio+image 512×512 25 1000ms 27.177 65.282 1566.821 4211.721 5781.453 12.276 8.754 8.754
2026/5/27 quicktalk mouth inpainting omnirt RTX 3050 Laptop WSL2 glibc2.35 driver 581.57 3c893c52 + 5f24f56f audio+image 540×900 25 1120ms 5.98 20.77 1661 2833 4109 19.06 1.41 1.41
2026/5/27 quicktalk mouth inpainting omnirt RTX 3050 Laptop WSL2 glibc2.35 driver 581.57 3c893c52 + 5f24f56f audio+image 306×512 25 1120ms 6.282 20.78 1580.28 2661 4243.26 20.695 1.385 1.396