FlashTalk Resident Benchmark¶
This document captures the first real-hardware benchmark for soulx-flashtalk-14b under the persistent_worker resident serving path, so later commits can compare against the same baseline.
Test environment¶
- Date:
2026-04-21 - Machine:
internal Ascend validation host - Accelerator:
Ascend 910B2 x8 - Serving path:
omnirt serve+persistent_worker+ residenttorchrun - Model stack:
soulx-flashtalk-14bSoulX-FlashTalk-14Bchinese-wav2vec2-base
Benchmark configuration¶
This run matches the “speed-first” realtime profile documented in the paired LiveAct benchmark summary:
FLASHTALK_HEIGHT=704FLASHTALK_WIDTH=416FLASHTALK_FRAME_NUM=29FLASHTALK_MOTION_FRAMES_NUM=1FLASHTALK_SAMPLE_STEPS=2FLASHTALK_COLOR_CORRECTION_STRENGTH=0audio_encode_mode=stream- Inputs:
examples/woman2.jpgexamples/cantonese_16k.wav
The resident server was also configured with:
max_concurrency=1pipeline_cache_size=1
Metric definitions¶
Three timing layers matter here:
cold requestFirst request, including model load, distributed initialization, and resident worker warmup.steady_chunk_core_ms_avgHot-path average for the corerun_pipeline(...)chunk work. This is the closest metric to the “steady chunk” numbers inSESSION_SUMMARY.steady_chunk_total_ms_avgHot-path average chunk time including audio embedding andvideo.cpu()handling.
Do not compare denoise_loop_ms / chunk_count directly with the old Generate video chunk-x done logs; the scopes differ.
Summary¶
Cold start¶
| Scenario | End-to-end |
|---|---|
max_chunks=1 |
88.409s |
max_chunks=3 |
91.196s |
Full audio (937 frames) |
121.029s |
Hot path¶
| Scenario | End-to-end | Notes |
|---|---|---|
max_chunks=1 |
2.672s |
single hot chunk request |
max_chunks=3 |
5.514s |
used to estimate steady chunk cost |
Full audio (937 frames) |
37.377s |
full resident hot-path video |
Hot chunk metrics¶
Hot resident telemetry for max_chunks=3:
| Metric | Value |
|---|---|
audio_embedding_ms_avg |
21.259 ms |
chunk_core_ms_avg |
894.051 ms |
steady_chunk_core_ms_avg |
891.002 ms |
chunk_copy_ms_avg |
33.339 ms |
chunk_total_ms_avg |
957.137 ms |
steady_chunk_total_ms_avg |
953.662 ms |
Comparison with the standalone script¶
On the same internal validation host, with the same 704x416 + 29/1 + 2 step + stream configuration, the direct generate_video.py run reported:
chunk-0:6.59schunk-1:0.89schunk-2:0.90s
That means:
steady_chunk_core_ms_avg ≈ 891msis effectively aligned with the standalone script’s0.89sto0.90s.- The remaining resident overhead sits mostly in:
- audio embedding
video.cpu()and per-chunk tail work
In other words, the persistent_worker path is not slowing down the core FlashTalk DiT generation loop.
Conclusion¶
soulx-flashtalk-14bin OmniRT’spersistent_workerresident path is now in a usable state.- Under the realtime profile, the core generation speed matches the previous standalone best case.
- Further optimization work should focus on:
video.cpu()/ result collection- chunk-level non-core overhead
- longer-audio and more-diverse sample stability runs