Skip to content

CosyVoice Benchmark

This document records the first real-hardware validation for cosyvoice3-triton-trtllm through OmniRT text2audio, plus an official streaming benchmark rerun on the same service.

Test Environment

  • Date: 2026-04-28
  • Machine: internal CUDA validation host
  • Accelerator: NVIDIA GeForce RTX 3090
  • Docker container: cosyvoice-trt2504
  • Official directory: /workspace/CosyVoice/runtime/triton_trtllm
  • Model: Fun-CosyVoice3-0.5B-2512
  • Triton model repository: model_repo_cosyvoice3_copy
  • LLM endpoint: trtllm-serve on localhost:8000
  • Triton endpoint: HTTP 18000, gRPC 18001, metrics 18002

Service Profile

Current stable profile:

  • GPU: GPU1
  • token2wav instances: 2
  • vocoder instances: 2
  • kv_cache_free_gpu_memory_fraction=0.2
  • Benchmark dataset: /tmp/wenetspeech4tts_cached26.parquet

Health checks during validation:

http://127.0.0.1:18000/v2/health/live -> 200
http://127.0.0.1:18000/v2/health/ready -> 200
http://127.0.0.1:18000/v2/models/cosyvoice3/ready -> 200

OmniRT Smoke

The smoke used the current worktree's CosyVoiceTritonPipeline, called Triton streaming gRPC directly, and produced a real wav artifact.

output=/tmp/omnirt_text2audio_smoke_20260428/cosyvoice3-triton-trtllm-omnirt-smoke-1777375798.wav
sample_rate=24000
samples=70080
duration=2.92s

RunReport timings:

Stage Time
prepare_conditions_ms 0.129 ms
prepare_latents_ms 0.081 ms
denoise_loop_ms 1969.611 ms
decode_ms 0.052 ms
export_ms 6.401 ms

Resolved config:

server_addr=localhost
server_port=18001
model_name=cosyvoice3
sample_rate=24000
seed=42

Official Streaming Benchmark

Command shape:

cd /workspace/CosyVoice/runtime/triton_trtllm
PYTHONPATH=/workspace/CosyVoice/runtime/triton_trtllm/pydeps \
python3 client_grpc.py \
  --server-addr localhost \
  --server-port 18001 \
  --model-name cosyvoice3 \
  --num-tasks 4 \
  --huggingface-dataset /tmp/wenetspeech4tts_cached26.parquet \
  --split-name wenetspeech4tts \
  --log-dir /tmp/omnirt_verify_20260428_185737 \
  --mode streaming \
  --log-interval 100

Results:

Metric Value
RTF 0.1303
synthesized duration 167.360s
processing time 21.815s
average total request latency 3029.77 ms
p50 total request latency 3003.75 ms
p95 total request latency 5438.06 ms
average first chunk latency 699.13 ms
p50 first chunk latency 710.21 ms
p95 first chunk latency 949.71 ms
average second chunk latency 463.37 ms
p50 second chunk latency 446.07 ms
p95 second chunk latency 697.93 ms

Conclusions

  • OmniRT text2audio has completed real Triton gRPC generation validation.
  • CosyVoice3 Triton BLS uses a decoupled streaming policy, so clients must use streaming gRPC and collect waveform chunks; unary infer() fails with ModelInfer RPC doesn't support models with decoupled transaction policy.
  • The current 26-sample streaming benchmark matches the recent stable rerun band: average first chunk is about 0.70s, and RTF is about 0.13.
  • seed is forwarded from OmniRT as a Triton request parameter; fully deterministic benchmark runs still require the server-side BLS to read and forward that value to the OpenAI/TensorRT-LLM request.