Local STT/TTS + QuickTalk¶
This page describes a single-machine path for private validation:
- STT: local
SenseVoiceSmall, CPU by default. - TTS: local
Fun-CosyVoice3-0.5B-2512, served through thelocal_cosyvoiceservice. - Video: local
QuickTalk, CUDA by default. - LLM: still configured through an OpenAI-compatible endpoint. If you already operate a local LLM server, point
OPENTALKING_LLM_BASE_URLto that service.
This path keeps the existing OpenTalking /sessions/*, /tts/preview, and session runner protocols unchanged. The frontend only chooses the local or API provider before a session starts. If an API provider is selected and its key is missing, startup fails before entering the digital-human session; OpenTalking does not silently fall back to local or cloud providers.
Hardware Guidance¶
| Component | Default placement | Guidance |
|---|---|---|
| SenseVoiceSmall | CPU | Usually enough for short utterances and saves VRAM. |
| Fun-CosyVoice3-0.5B-2512 | cuda:0 |
12GB VRAM is recommended; use API TTS first on 8GB machines. |
| QuickTalk | cuda:0 |
Watch peak VRAM and first-turn warmup when sharing the GPU with local TTS. |
Install Dependencies¶
cd "$DIGITAL_HUMAN_HOME/opentalking"
uv sync --extra dev --extra models --extra local-audio --extra quicktalk-cuda --python 3.11
source .venv/bin/activate
The OpenTalking main venv runs the API, SenseVoice, and QuickTalk and keeps the
project transformers>=4.57,<6 dependency. Do not install the CosyVoice runtime
into this venv.
Download Local Audio Models¶
Do not commit model weights. The download helper uses ModelScope for these models by default; configure a Hugging Face mirror only for HF-backed models.
python scripts/download_local_audio_models.py \
--root ./avatar_models/local-audio \
--model sensevoice-small \
--model fun-cosyvoice3-0.5b-2512
Expected layout:
Prepare QuickTalk Weights¶
Place QuickTalk weights, HuBERT files, and InsightFace dependencies as described in QuickTalk Local Deployment:
The key setting is OPENTALKING_QUICKTALK_ASSET_ROOT, which must point to the directory containing checkpoints/.
Prepare the CosyVoice Runtime¶
The recommended local_cosyvoice shape is a standalone Python service. Runtime source should stay outside git-tracked files; placing it under the model directory is fine:
mkdir -p ./avatar_models/local-audio/runtime
git clone https://github.com/FunAudioLLM/CosyVoice.git ./avatar_models/local-audio/runtime/CosyVoice
cd ./avatar_models/local-audio/runtime/CosyVoice
git submodule update --init --recursive
Create the dedicated CosyVoice sidecar venv after the runtime checkout is ready:
cd "$DIGITAL_HUMAN_HOME/opentalking"
OPENTALKING_COSYVOICE_VENV_DIR=.venv-cosyvoice \
bash scripts/prepare_cosyvoice_venv.sh
The sidecar venv is only for scripts/local_cosyvoice_service.py and the
CosyVoice runtime. It pins transformers==4.51.3 and stays separate from the
OpenTalking main .venv.
.env Example¶
# LLM: separate module. Use DashScope, OpenAI, vLLM, Ollama, or a local OpenAI-compatible service.
OPENTALKING_LLM_PROVIDER=openai_compatible
OPENTALKING_LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
OPENTALKING_LLM_API_KEY=<llm-key>
OPENTALKING_LLM_MODEL=qwen-flash
# STT: local SenseVoiceSmall
OPENTALKING_STT_DEFAULT_PROVIDER=sensevoice
OPENTALKING_STT_ENABLED_PROVIDERS=sensevoice,dashscope
OPENTALKING_STT_SENSEVOICE_MODEL=iic/SenseVoiceSmall
OPENTALKING_STT_SENSEVOICE_MODEL_DIR=./avatar_models/local-audio/iic__SenseVoiceSmall
OPENTALKING_STT_SENSEVOICE_DEVICE=cpu
# TTS: local CosyVoice3
OPENTALKING_TTS_DEFAULT_PROVIDER=local_cosyvoice
OPENTALKING_TTS_ENABLED_PROVIDERS=local_cosyvoice,dashscope,edge
OPENTALKING_TTS_LOCAL_COSYVOICE_MODEL=FunAudioLLM/Fun-CosyVoice3-0.5B-2512
OPENTALKING_TTS_LOCAL_COSYVOICE_MODEL_DIR=./avatar_models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512
OPENTALKING_TTS_LOCAL_COSYVOICE_RUNTIME_DIR=./avatar_models/local-audio/runtime/CosyVoice
OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL=http://127.0.0.1:19090/synthesize
OPENTALKING_TTS_LOCAL_COSYVOICE_DEVICE=cuda:0
OPENTALKING_COSYVOICE_VENV_DIR=./.venv-cosyvoice
# Video: QuickTalk local
OPENTALKING_DEFAULT_MODEL=quicktalk
OPENTALKING_QUICKTALK_BACKEND=local
OPENTALKING_QUICKTALK_ASSET_ROOT=./avatar_models/quicktalk
OPENTALKING_QUICKTALK_WORKER_CACHE=1
OPENTALKING_TORCH_DEVICE=cuda:0
If users can switch the frontend to API STT/TTS, configure the provider-specific keys explicitly:
OPENTALKING_STT_DASHSCOPE_API_KEY=<dashscope-stt-key>
OPENTALKING_TTS_DASHSCOPE_API_KEY=<dashscope-tts-key>
Startup Order¶
Start the local TTS service first:
Start OpenTalking:
To set ports explicitly:
bash scripts/start_unified.sh --backend local --model quicktalk --api-port 8210 --web-port 5280
Verification¶
curl -fsS http://127.0.0.1:19090/health
curl -fsS http://127.0.0.1:8000/health
curl -fsS http://127.0.0.1:8000/api/runtime/status
curl -s http://127.0.0.1:8000/models | jq '.statuses[] | select(.id=="quicktalk")'
Expected state:
stt_providerissensevoice.tts_providerislocal_cosyvoice.quicktalk_backendislocal./modelsreportsquicktalk.connected=trueandreason=local_runtime.
In the frontend, select Local SenseVoiceSmall, Local CosyVoice3-0.5B-2512, a
shared avatar, and the quicktalk model, then test:
- Text input:
LLM -> local_cosyvoice -> QuickTalk -> WebRTC. - Microphone input:
SenseVoiceSmall -> LLM -> local_cosyvoice -> QuickTalk -> WebRTC. /tts/preview: confirm that local system voices and cloned voices play.
Notes¶
*_DEFAULT_PROVIDERonly controls the default selection. It is not a failure fallback chain.- LLM, STT, and TTS keys are independent.
DASHSCOPE_API_KEYdoes not automatically configure any module. - The CosyVoice3 service returns audio as a stream, but first-chunk latency still depends on model inference and warmup.
- On 8GB VRAM machines, keep
SenseVoiceSmall CPU + QuickTalk localand use DashScope or Edge TTS first if local TTS is slow or OOMs. - Weights, runtime checkouts, avatar caches, and logs are deployment artifacts and should not be committed.