跳转至

语音合成

TTS 将 LLM 输出转为音频,并驱动 talking-head backend。首次评估建议使用 Edge TTS;需要生产音色、复刻或更高质量时,再切换 provider。

Provider 选项

Provider 适用场景 必要配置
edge 首次运行、CPU 评估、无需 API key OPENTALKING_TTS_DEFAULT_PROVIDER=edge
dashscope 中文实时 TTS 与声音复刻 OPENTALKING_TTS_DASHSCOPE_API_KEY 及 DashScope TTS 配置
local_cosyvoice 本地中文 TTS、内置音色和复刻音色 CosyVoice3 权重和本地 service URL
indextts IndexTTS2 可控配音、情绪控制和复刻音色 OPENTALKING_TTS_INDEXTTS_BACKEND=localomnirt
cosyvoice 自托管 CosyVoice 服务 CosyVoice WebSocket URL/settings
elevenlabs 托管多语言音色 ElevenLabs API key 和 voice id

indextts 是 OpenTalking 对外暴露的唯一 provider 名称。部署时可以让它走同机 HTTP sidecar 的 local backend,也可以走 OmniRT 常驻 backend。这和视频驱动模型的 backend 选择类似:OpenTalking 侧始终选择 IndexTTS,部署者只在环境变量中切换运行 backend。

Edge TTS 默认配置

.env
OPENTALKING_TTS_DEFAULT_PROVIDER=edge
OPENTALKING_TTS_EDGE_VOICE=zh-CN-XiaoxiaoNeural

Edge TTS 仍需要 ffmpeg,因为 OpenTalking 会将 provider 音频解码为 PCM 再送入合成 backend。

DashScope Qwen realtime TTS

.env
OPENTALKING_TTS_DEFAULT_PROVIDER=dashscope
OPENTALKING_TTS_DASHSCOPE_API_KEY=<dashscope-api-key>
OPENTALKING_TTS_DASHSCOPE_MODEL=qwen3-tts-flash-realtime
OPENTALKING_QWEN_TTS_REUSE_WS=1

DashScope TTS 不读取 OPENTALKING_LLM_API_KEYDASHSCOPE_API_KEY;即使和 LLM 使用同一把实际 key,也要显式写入 OPENTALKING_TTS_DASHSCOPE_API_KEY

本地 CosyVoice3 0.5B

推荐把 CosyVoice 作为独立本地服务启动,再由 OpenTalking 的 local_cosyvoice provider 通过 HTTP 读取 PCM 音频流。

.env
OPENTALKING_TTS_DEFAULT_PROVIDER=local_cosyvoice
OPENTALKING_TTS_ENABLED_PROVIDERS=local_cosyvoice,dashscope,edge
OPENTALKING_TTS_LOCAL_COSYVOICE_MODEL=FunAudioLLM/Fun-CosyVoice3-0.5B-2512
OPENTALKING_TTS_LOCAL_COSYVOICE_MODEL_DIR=./models/local-audio/FunAudioLLM__Fun-CosyVoice3-0.5B-2512
OPENTALKING_TTS_LOCAL_COSYVOICE_RUNTIME_DIR=./models/local-audio/runtime/CosyVoice
OPENTALKING_TTS_LOCAL_COSYVOICE_SERVICE_URL=http://127.0.0.1:19090/synthesize
OPENTALKING_TTS_LOCAL_COSYVOICE_DEVICE=cuda:0

下载本地音频权重:

终端
export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/simple}"
export UV_INDEX_URL="${UV_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
export PIP_INDEX_URL="${PIP_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
export UV_LINK_MODE=copy
uv sync --extra dev --extra models --extra local-audio --extra local-cosyvoice-service --python 3.11
.venv/bin/python scripts/download_local_audio_models.py \
  --root ./models/local-audio \
  --model fun-cosyvoice3-0.5b-2512

准备 CosyVoice runtime:

终端
mkdir -p ./models/local-audio/runtime
git clone https://github.com/FunAudioLLM/CosyVoice.git ./models/local-audio/runtime/CosyVoice
cd ./models/local-audio/runtime/CosyVoice
git submodule update --init --recursive

启动本地 TTS service:

终端
OPENTALKING_TTS_LOCAL_COSYVOICE_PRELOAD=1 \
python scripts/local_cosyvoice_service.py --host 127.0.0.1 --port 19090

完整本地语音输入、语音合成和 QuickTalk 视频链路见 本地 STT/TTS + QuickTalk

IndexTTS 部署(provider = indextts)

IndexTTS 在 OpenTalking 侧始终使用 provider=indexttsOPENTALKING_TTS_INDEXTTS_BACKEND 只决定运行拓扑:local 表示同机 HTTP sidecar,omnirt 表示 OmniRT 常驻服务。前端、API payload 和复刻音色元数据都不需要拆成两个 provider。

方式 A:同机 HTTP sidecar(backend = local)

local backend 采用独立的同机 HTTP sidecar:IndexTTS2 在单独 venv 里加载,OpenTalking API 进程只通过 OPENTALKING_TTS_LOCAL_INDEXTTS_SERVICE_URL 消费 audio/L16 PCM。不要把 index-tts 直接安装到 OpenTalking 主 .venv;IndexTTS 官方依赖会固定 torchtransformersprotobuf 等版本,容易破坏 QuickTalk / STT 依赖。

先安装 OpenTalking 主环境和本地音频下载依赖。这个环境不要安装 index-tts

终端
cd "$OPENTALKING_HOME"
export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/simple}"
export UV_INDEX_URL="${UV_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
export PIP_INDEX_URL="${PIP_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
export UV_LINK_MODE=copy
export HF_ENDPOINT="${HF_ENDPOINT:-https://hf-mirror.com}"

uv sync --extra dev --extra models --extra local-audio --python 3.11
.venv/bin/python scripts/download_local_audio_models.py \
  --root ./models/local-audio \
  --model indextts2 \
  --model indextts2-w2v-bert \
  --model indextts2-maskgct \
  --model indextts2-campplus \
  --model indextts2-bigvgan
Hugging Face/Xet 大文件下载偶尔会出现 Read timed outSSL record layer failure 等日志;脚本会尝试断点续传。只要命令最终退出码为 0 即可继续。如果命令失败,保持已下载目录不变,重新执行同一条命令即可复用缓存并继续下载。

如果机器上已经有同样结构的权重目录,也可以不重复下载,只把 OPENTALKING_LOCAL_AUDIO_MODEL_ROOT 指向已有目录。该目录至少需要包含 IndexTeam__IndexTTS-2/config.yamlfacebook__w2v-bert-2.0funasr__campplusnvidia__bigvgan_v2_22khz_80band_256x。MaskGCT 目录优先使用 amphion__MaskGCT;如果已有资产目录叫 amphion__MaskGCT-ms,只要其中存在 semantic_codec/model.safetensors,sidecar 也会自动兼容。

再创建 IndexTTS sidecar 运行时。官方仓包含 LFS 示例音频,部分环境会遇到 LFS quota,因此建议跳过 smudge;实际 prompt 音频使用你自己的 3-15 秒清晰人声。

终端
cd "$OPENTALKING_HOME"
mkdir -p ./models/local-audio/runtime
INDEXTTS_RUNTIME_REPO="${INDEXTTS_RUNTIME_REPO:-https://github.com/index-tts/index-tts.git}"
if [ ! -d ./models/local-audio/runtime/index-tts/.git ]; then
  for i in 1 2 3; do
    GIT_LFS_SKIP_SMUDGE=1 git clone "$INDEXTTS_RUNTIME_REPO" ./models/local-audio/runtime/index-tts && break
    rm -rf ./models/local-audio/runtime/index-tts
    sleep 3
  done
fi
test -d ./models/local-audio/runtime/index-tts/.git
cd ./models/local-audio/runtime/index-tts
export UV_DEFAULT_INDEX="${UV_DEFAULT_INDEX:-https://pypi.tuna.tsinghua.edu.cn/simple}"
export UV_INDEX_URL="${UV_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
export PIP_INDEX_URL="${PIP_INDEX_URL:-https://pypi.tuna.tsinghua.edu.cn/simple}"
export UV_LINK_MODE=copy
uv sync --python 3.11
uv pip install fastapi "uvicorn[standard]" soundfile

IndexTTS 需要一段 3-15 秒清晰人声作为 prompt。可以先准备系统默认音色,也可以在 WebUI 的“音色复刻”里上传参考音频。

终端
cd "$OPENTALKING_HOME"
mkdir -p ./models/local-audio/voices/system/indextts-default
cp /path/to/reference.wav ./models/local-audio/voices/system/indextts-default/prompt.wav
cat > ./models/local-audio/voices/system/indextts-default/meta.json <<'JSON'
{"voice_id":"indextts-default","display_label":"IndexTTS 默认音色","provider":"indextts","target_model":"IndexTeam/IndexTTS-2","source":"system"}
JSON

启动 sidecar 服务:

终端
cd "$OPENTALKING_HOME"
OPENTALKING_LOCAL_AUDIO_MODEL_ROOT=./models/local-audio \
OPENTALKING_TTS_LOCAL_INDEXTTS_MODEL_DIR=./models/local-audio/IndexTeam__IndexTTS-2 \
OPENTALKING_TTS_LOCAL_INDEXTTS_CFG_PATH=./models/local-audio/IndexTeam__IndexTTS-2/config.yaml \
OPENTALKING_TTS_LOCAL_INDEXTTS_PROMPT_AUDIO=./models/local-audio/voices/system/indextts-default/prompt.wav \
OPENTALKING_TTS_LOCAL_INDEXTTS_W2V_BERT_DIR=./models/local-audio/facebook__w2v-bert-2.0 \
OPENTALKING_TTS_LOCAL_INDEXTTS_MASKGCT_DIR=./models/local-audio/amphion__MaskGCT \
OPENTALKING_TTS_LOCAL_INDEXTTS_CAMPPLUS_DIR=./models/local-audio/funasr__campplus \
OPENTALKING_TTS_LOCAL_INDEXTTS_BIGVGAN_DIR=./models/local-audio/nvidia__bigvgan_v2_22khz_80band_256x \
OPENTALKING_TTS_LOCAL_INDEXTTS_DEVICE=cuda:0 \
./models/local-audio/runtime/index-tts/.venv/bin/python scripts/local_indextts_service.py --host 127.0.0.1 --port 19092

OpenTalking .env 中选择 indextts,并把 backend 设置为 local

.env
OPENTALKING_TTS_DEFAULT_PROVIDER=indextts
OPENTALKING_TTS_ENABLED_PROVIDERS=edge,dashscope,local_cosyvoice,indextts
OPENTALKING_TTS_INDEXTTS_BACKEND=local
OPENTALKING_LOCAL_AUDIO_MODEL_ROOT=./models/local-audio
OPENTALKING_TTS_LOCAL_INDEXTTS_MODEL=IndexTeam/IndexTTS-2
OPENTALKING_TTS_LOCAL_INDEXTTS_MODEL_DIR=./models/local-audio/IndexTeam__IndexTTS-2
OPENTALKING_TTS_LOCAL_INDEXTTS_CFG_PATH=./models/local-audio/IndexTeam__IndexTTS-2/config.yaml
OPENTALKING_TTS_LOCAL_INDEXTTS_SERVICE_URL=http://127.0.0.1:19092/synthesize
OPENTALKING_TTS_LOCAL_INDEXTTS_PROMPT_AUDIO=./models/local-audio/voices/system/indextts-default/prompt.wav
OPENTALKING_TTS_LOCAL_INDEXTTS_W2V_BERT_DIR=./models/local-audio/facebook__w2v-bert-2.0
OPENTALKING_TTS_LOCAL_INDEXTTS_MASKGCT_DIR=./models/local-audio/amphion__MaskGCT
OPENTALKING_TTS_LOCAL_INDEXTTS_CAMPPLUS_DIR=./models/local-audio/funasr__campplus
OPENTALKING_TTS_LOCAL_INDEXTTS_BIGVGAN_DIR=./models/local-audio/nvidia__bigvgan_v2_22khz_80band_256x
OPENTALKING_TTS_LOCAL_INDEXTTS_DEVICE=auto

这些 LOCAL_INDEXTTS_*_DIR 变量既可以写在 OpenTalking .env 中,也可以在启动 sidecar 时作为环境变量传入。OpenTalking 主进程需要 SERVICE_URL 和音色 prompt 路径;sidecar 需要模型目录、prompt、w2v / MaskGCT / campplus / BigVGAN 等本地资产目录,避免运行时再次访问 Hugging Face。

启动 OpenTalking API 和 WebUI。这里用 QuickTalk local 作为示例 video backend;如果你只想验证 TTS preview,也可以先用 --mock 启动。

终端
cd "$OPENTALKING_HOME"
bash scripts/start_unified.sh --backend local --model quicktalk --api-port 8000 --web-port 5173

如果 80005173 已被其他服务占用,可以把 --api-port / --web-port 换成空闲端口;下面的 curl 地址也要同步替换对应 API 端口。

先验证 sidecar,再验证 OpenTalking API:

终端
curl -fsS http://127.0.0.1:19092/health
curl -fsS -X POST http://127.0.0.1:19092/synthesize \
  -H 'content-type: application/json' \
  -d '{"text":"你好,这是一次 IndexTTS 本地服务测试。","sample_rate":16000}' \
  --output /tmp/indextts-local.pcm

.venv/bin/python - <<'PY'
from pathlib import Path
pcm = Path("/tmp/indextts-local.pcm").read_bytes()
assert len(pcm) > 0 and len(pcm) % 2 == 0, len(pcm)
print("pcm_bytes", len(pcm), "sample_rate", 16000, "channels", 1)
PY

curl -fsS -X POST http://127.0.0.1:8000/tts/preview \
  --max-time 300 \
  -H 'content-type: application/json' \
  -d '{"text":"你好,这是一次 OpenTalking IndexTTS 测试。","tts_provider":"indextts","tts_voice":"indextts-default","tts_model":"IndexTeam/IndexTTS-2"}' \
  --output /tmp/opentalking-indextts-preview.wav

.venv/bin/python - <<'PY'
import wave
with wave.open("/tmp/opentalking-indextts-preview.wav", "rb") as wf:
    print("wav", wf.getframerate(), wf.getnchannels(), wf.getsampwidth(), wf.getnframes())
PY

也可以检查运行状态,确认 OpenTalking 对外仍是 indextts,同时 backend 解析为 local

终端
curl -fsS http://127.0.0.1:8000/runtime/status | python3 -m json.tool

期望 tts_providerindexttstts_providers.indextts.backendlocaltts_providers.indextts.resolved_providerlocal_indextts,并且 service_url_set=true

方式 B:OmniRT 常驻服务(backend = omnirt)

OmniRT backend 适合把 IndexTTS 常驻在独立服务里,OpenTalking 只通过 HTTP stream 消费 audio/L16 PCM。OpenTalking 侧 provider 仍然是 indextts,只把 backend 切到 omnirt

先在 OmniRT 仓启动 text2audio 服务:

终端
cd "$OMNIRT_HOME"
OMNIRT_INDEXTTS_RUNTIME=1 \
OMNIRT_INDEXTTS_MODEL_DIR=/data2/zhongyi/model/local-audio/IndexTeam__IndexTTS-2 \
OMNIRT_INDEXTTS_CFG_PATH=/data2/zhongyi/model/local-audio/IndexTeam__IndexTTS-2/config.yaml \
OMNIRT_INDEXTTS_PROMPT_AUDIO=/data2/zhongyi/model/local-audio/voices/system/indextts-default/prompt.wav \
OMNIRT_INDEXTTS_DEVICE=cuda:0 \
uv run omnirt serve-text2audio --host 127.0.0.1 --port 9012

再在 OpenTalking .env 中选择 provider 和 backend:

.env
OPENTALKING_TTS_DEFAULT_PROVIDER=indextts
OPENTALKING_TTS_ENABLED_PROVIDERS=edge,dashscope,local_cosyvoice,indextts
OPENTALKING_TTS_INDEXTTS_BACKEND=omnirt
OPENTALKING_TTS_OMNIRT_INDEXTTS_SERVICE_URL=http://127.0.0.1:9012/v1/text2audio/indextts
OPENTALKING_TTS_OMNIRT_INDEXTTS_MODEL=IndexTeam/IndexTTS-2
OPENTALKING_TTS_OMNIRT_INDEXTTS_STREAMING=1
OPENTALKING_TTS_OMNIRT_INDEXTTS_STREAMING_MODE=token_window
OPENTALKING_TTS_OMNIRT_INDEXTTS_MAX_TEXT_TOKENS_PER_SEGMENT=80
OPENTALKING_TTS_OMNIRT_INDEXTTS_QUICK_STREAMING_TOKENS=4
OPENTALKING_TTS_OMNIRT_INDEXTTS_INTERVAL_SILENCE_MS=0
OPENTALKING_TTS_OMNIRT_INDEXTTS_TOKEN_WINDOW_SIZE=40
OPENTALKING_TTS_OMNIRT_INDEXTTS_TOKEN_WINDOW_HOP=96
OPENTALKING_TTS_OMNIRT_INDEXTTS_TOKEN_WINDOW_CONTEXT=8
OPENTALKING_TTS_OMNIRT_INDEXTTS_TOKEN_WINDOW_OVERLAP_MS=60

token_window 是模型 token 级的分窗流式:OmniRT 会在语音 token 累积到窗口后先解码并返回 PCM,不必等待整段文本完成;它不是 20ms 波形级真流式,短句首包延迟仍取决于 IndexTTS GPT 首批语音 token 和 vocoder 解码耗时。

ElevenLabs

.env
OPENTALKING_TTS_DEFAULT_PROVIDER=elevenlabs
OPENTALKING_TTS_ELEVENLABS_API_KEY=<elevenlabs-api-key>
OPENTALKING_TTS_ELEVENLABS_VOICE_ID=<voice-id>
OPENTALKING_TTS_ELEVENLABS_MODEL_ID=eleven_flash_v2_5

验证

先创建 mock 会话,再调用 /speak。这样可以验证 TTS,不依赖真实 talking-head 模型。

终端
SID=<session-id>
curl -s -X POST "http://127.0.0.1:8000/sessions/$SID/speak" \
  -H 'content-type: application/json' \
  -d '{"text":"你好,这是一次 OpenTalking 语音合成测试。","tts_provider":"indextts","tts_voice":"indextts-default","tts_model":"IndexTeam/IndexTTS-2"}'

前端入口

模型或后端服务启动后,统一用 OpenTalking WebUI 访问:

终端
cd "$OPENTALKING_HOME"
bash scripts/quickstart/start_frontend.sh --api-port 8000 --web-port 5173 --host 0.0.0.0

远程服务器部署时,把本地浏览器端口映射到服务器 5173,再打开 http://127.0.0.1:5173