Text to Audio¶
Given target text and a reference audio clip, generate a .wav speech artifact. OmniRT currently exposes two external-service routes and one resident IndexTTS service entrypoint:
cosyvoice3-triton-trtllm: CosyVoice3 through a Triton-compatible service endpoint; CUDA/TensorRT-LLM remains the reference deployment, while Ascend can be targeted through an externally hosted compatible endpoint.soulx-podcast-1.7b: SoulX-Podcast through a FastAPI service endpoint for long-form, podcast, and multi-speaker speech generation; the Ascend path likewise requires the service process to be deployed on NPU first.indextts: exposes an OpenTalking-ready PCM stream throughserve-text2audioand supportscuda,npu/ascend, and CPU service runtimes.
Minimal Example¶
from omnirt import generate
from omnirt.requests import text2audio
result = generate(text2audio(
model="cosyvoice3-triton-trtllm",
prompt="Hello from OmniRT.",
audio="inputs/reference.wav",
reference_text="This is the reference voice text.",
backend="cuda",
server_addr="localhost",
server_port=18001,
seed=42,
))
print(result.outputs[0].path)
SoulX-Podcast¶
soulx-podcast-1.7b does not load model weights inside the OmniRT process. It calls an already-running SoulX-Podcast API. The single-speaker path reuses the standard text2audio fields:
| OmniRT field | SoulX-Podcast field | Notes |
|---|---|---|
inputs.prompt |
dialogue_text |
Target dialogue text |
inputs.audio |
prompt_audio |
Reference audio |
inputs.reference_text |
prompt_texts |
Transcript for the reference audio |
task: text2audio
model: soulx-podcast-1.7b
backend: cuda
inputs:
prompt: Welcome to the OmniRT podcast. This is a SoulX-Podcast adapter test.
audio: inputs/reference.wav
reference_text: This is the reference voice text.
config:
server_url: http://127.0.0.1:18080
seed: 42
temperature: 0.7
top_k: 40
top_p: 0.9
repetition_penalty: 1.1
For multi-speaker podcast generation, prefer YAML and provide matching prompt_audios and prompt_texts lists:
task: text2audio
model: soulx-podcast-1.7b
backend: cuda
inputs:
prompt: |
[S1] Welcome to the OmniRT podcast.
[S2] Today we are discussing voice generation for realtime digital humans.
audio: inputs/speaker_a.wav
config:
server_url: http://127.0.0.1:18080
prompt_audios:
- inputs/speaker_a.wav
- inputs/speaker_b.wav
prompt_texts:
- Reference text for speaker one.
- Reference text for speaker two.
seed: 42
Key Parameters¶
| Parameter | Type | Default | Notes |
|---|---|---|---|
prompt |
str |
required | Target text to synthesize |
audio |
str |
required | Reference audio path, resampled to 16 kHz before the Triton request |
reference_text |
str |
"" |
Transcript for the reference audio; recommended for zero-shot voice reuse |
server_addr |
str |
127.0.0.1 |
Triton gRPC server address |
server_port |
int |
8001 |
Triton gRPC port; the current 146 validation container uses 18001 |
model_name |
str |
cosyvoice3 |
Triton model-repository name |
sample_rate |
int |
24000 |
Output wav sample rate |
seed |
int |
unset | Forwarded as a Triton request parameter; the server-side BLS must consume it for deterministic sampling |
server_url |
str |
http://127.0.0.1:18080 |
SoulX-Podcast HTTP API URL; can also be set with OMNIRT_SOULX_PODCAST_API_URL |
service_accelerator |
str |
inferred from backend | Records the external TTS service accelerator; defaults to ascend when --backend ascend is selected |
timeout |
float |
300 |
SoulX-Podcast HTTP request timeout in seconds |
temperature / top_k / top_p / repetition_penalty |
number | server default | SoulX-Podcast sampling parameters |
prompt_audios / prompt_texts |
list[str] |
single-speaker fallback | Multi-speaker SoulX-Podcast reference audio and transcript lists |
Ascend Service Endpoints¶
cosyvoice3-triton-trtllm and soulx-podcast-1.7b are OmniRT wrappers; they do not load TTS weights in the current process. Selecting --backend ascend records backend=ascend in the run report and defaults service_accelerator to ascend, but the actual inference still happens inside the configured Triton / FastAPI service endpoint.
omnirt generate \
--task text2audio \
--model cosyvoice3-triton-trtllm \
--prompt "Hello from OmniRT." \
--audio inputs/reference.wav \
--reference-text "This is the reference voice text." \
--backend ascend \
--server-addr 8.92.7.195 \
--server-port 18001 \
--service-accelerator ascend
omnirt generate \
--task text2audio \
--model soulx-podcast-1.7b \
--prompt "Welcome to the OmniRT podcast." \
--audio inputs/reference.wav \
--reference-text "This is the reference voice text." \
--backend ascend \
--server-url http://8.92.7.195:18080 \
--service-accelerator ascend
IndexTTS-2 Resident Service¶
indextts uses the dedicated serve-text2audio entry point to expose a PCM stream that OpenTalking can consume directly:
GET /v1/text2audio/modelsreturns IndexTTS runtime status, includingstreaming_mode,streaming_granularity,model_internal_streaming,token_window_size,token_window_hop,token_window_context, andtoken_window_overlap_ms.POST /v1/text2audio/indexttsacceptstext,voice,max_text_tokens_per_segment,quick_streaming_tokens,interval_silence_ms,streaming_mode,token_window_size,token_window_hop,token_window_context,token_window_overlap_ms, and optional generation knobs such asnum_beams,top_p,top_k,temperature,repetition_penalty, andmax_mel_tokens.- The recommended default is
streaming_mode=token_window. This mode consumes the IndexTTS2 GPT code-token stream directly, decodes the first completed token window throughs2mel/CFM/BigVGAN, and starts writing PCM before the full text segment finishes; the status endpoint reportsstreaming_granularity=token_window,model_internal_streaming=true, andstreaming_experimental=true. - This is still not 20 ms waveform-level streaming: GPT tokens are produced incrementally, but playable PCM is decoded per token window and each window still runs
s2mel, CFM, and BigVGAN. Smaller windows reduce first-packet latency but can increase total latency and seam risk; larger windows behave closer to full-segment generation.
serve-text2audio does not import the OmniRT gRPC engine, so it can run inside the official IndexTTS Python 3.11 environment. The omnirt[indextts] extra carries only light HTTP server dependencies such as FastAPI/Uvicorn and does not pull grpcio or protobuf. Set OMNIRT_INDEXTTS_PRELOAD=1 to load the model at service startup; set OMNIRT_INDEXTTS_WARMUP_TEXT as well to run one short synthesis before the first user request.
OMNIRT_HOME=/path/to/omnirt
INDEXTTS_HOME=/path/to/index-tts
MODEL_ROOT=/path/to/models/local-audio
cd "$INDEXTTS_HOME"
uv sync --all-extras --python 3.11 --default-index "https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
uv pip install --python .venv/bin/python -e "$OMNIRT_HOME[indextts]" \
--default-index "https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
PYTHONPATH="$OMNIRT_HOME/src:$INDEXTTS_HOME" \
OMNIRT_INDEXTTS_RUNTIME=1 \
OMNIRT_LOCAL_AUDIO_MODEL_ROOT="$MODEL_ROOT" \
OMNIRT_INDEXTTS_MODEL=IndexTeam/IndexTTS-2 \
OMNIRT_INDEXTTS_MODEL_DIR="$MODEL_ROOT/IndexTeam__IndexTTS-2" \
OMNIRT_INDEXTTS_CFG_PATH="$MODEL_ROOT/IndexTeam__IndexTTS-2/config.yaml" \
OMNIRT_INDEXTTS_PROMPT_AUDIO="$MODEL_ROOT/voices/system/indextts-default/prompt.wav" \
OMNIRT_INDEXTTS_MAX_TEXT_TOKENS_PER_SEGMENT=80 \
OMNIRT_INDEXTTS_QUICK_STREAMING_TOKENS=4 \
OMNIRT_INDEXTTS_STREAMING_MODE=token_window \
OMNIRT_INDEXTTS_TOKEN_WINDOW_SIZE=40 \
OMNIRT_INDEXTTS_TOKEN_WINDOW_HOP=96 \
OMNIRT_INDEXTTS_TOKEN_WINDOW_CONTEXT=8 \
OMNIRT_INDEXTTS_TOKEN_WINDOW_OVERLAP_MS=60 \
OMNIRT_INDEXTTS_NUM_BEAMS=1 \
OMNIRT_INDEXTTS_TOP_P=0.8 \
OMNIRT_INDEXTTS_TOP_K=30 \
OMNIRT_INDEXTTS_TEMPERATURE=0.8 \
OMNIRT_INDEXTTS_REPETITION_PENALTY=10.0 \
OMNIRT_INDEXTTS_MAX_MEL_TOKENS=1500 \
OMNIRT_INDEXTTS_PRELOAD=1 \
OMNIRT_INDEXTTS_WARMUP_TEXT="Hello." \
OMNIRT_INDEXTTS_DEVICE=cuda:0 \
.venv/bin/python -m omnirt.cli.main serve-text2audio --host 127.0.0.1 --port 9012
On Ascend hosts, source CANN and install the matching torch_npu first, then switch the device to ascend, npu, or npu:0:
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export OMNIRT_INDEXTTS_DEVICE=ascend
export OMNIRT_INDEXTTS_NPU_INDEX=0
export OMNIRT_INDEXTTS_USE_CUDA_KERNEL=0
The IndexTTS runtime resolves ascend and npu to npu:0, enables fp16 by default, and checks torch_npu before loading the engine on NPU. CUDA-kernel mode is disabled on NPU so CUDA-only kernels are not forwarded into the Ascend environment.
curl -fsS http://127.0.0.1:9012/v1/text2audio/models
curl -sS -X POST http://127.0.0.1:9012/v1/text2audio/indextts \
-H 'content-type: application/json' \
-d '{"text":"Hello, this is an OmniRT IndexTTS streaming test."}' \
-o /tmp/omnirt-indextts.pcm
OMNIRT_INDEXTTS_MAX_TEXT_TOKENS_PER_SEGMENT=80 with OMNIRT_INDEXTTS_QUICK_STREAMING_TOKENS=4 controls text segment size after OpenTalking early-submit. For realtime conversations, use OMNIRT_INDEXTTS_STREAMING_MODE=token_window, OMNIRT_INDEXTTS_TOKEN_WINDOW_SIZE=40, OMNIRT_INDEXTTS_TOKEN_WINDOW_HOP=96, OMNIRT_INDEXTTS_TOKEN_WINDOW_CONTEXT=8, and OMNIRT_INDEXTTS_TOKEN_WINDOW_OVERLAP_MS=60: the first window stays at 40 speech tokens to balance first-audio latency and total generation time, while later windows decode every 96 tokens to reduce repeated vocoder work on longer replies. Set OMNIRT_INDEXTTS_NUM_BEAMS=1 for the low-latency sampling path; beam search (num_beams>1) blocks GPT token streaming and increases first-segment latency.
Deployment Notes¶
The stable 146-machine service profile is GPU1, token2wav=2, vocoder=2, and kv_cache_free_gpu_memory_fraction=0.2; Triton gRPC is exposed on 18001 inside the validation container. On 2026-04-28, the OmniRT text2audio wrapper generated a 2.92s / 24kHz wav with denoise_loop_ms=1969.611; the official 26-sample streaming benchmark measured RTF=0.1303 and 699.13ms average first-chunk latency.
Full record: CosyVoice Benchmark.
SoulX-Podcast API¶
On machine 220, the validated base-model checkout is /home/video/SoulX-Podcast, the weights live at pretrained_models/SoulX-Podcast-1.7B, and the API listens on port 18080. Startup example:
cd /home/video/SoulX-Podcast
source .venv/bin/activate
python run_api.py \
--model pretrained_models/SoulX-Podcast-1.7B \
--host 0.0.0.0 \
--port 18080 \
--engine hf \
--max-tasks 1
The health endpoint should report model_loaded=true and gpu_available=true. If GPUs are occupied on 220, stop the animator-worker-* Docker containers first instead of killing arbitrary GPU processes.
Troubleshooting¶
- No local Triton service: this wrapper calls an external official service. Start CosyVoice3
runtime/triton_trtllmbefore running OmniRT. - Missing
tritonclientorsoundfile: install the CosyVoice/Triton client dependencies first. seedstill does not stabilize results: verify that the Triton BLS reads and forwardsseedto the OpenAI/TensorRT-LLM request; client-side parameters alone cannot change sampling.- SoulX-Podcast API is unreachable: check
/health, then verify thatserver_urlorOMNIRT_SOULX_PODCAST_API_URLpoints to the running API. - Multi-speaker length error:
prompt_audiosandprompt_textsmust match one-to-one. For single-speaker generation, leave both lists unset and useaudioplusreference_text.