FlashHead Benchmark¶
This document records the first real-hardware benchmark for soulx-flashhead-1.3b through OmniRT's subprocess wrapper. It is different from the external SoulX-FlashHead resident benchmark: this page measures the cold-start end-to-end path where OmniRT launches generate_video.py.
Environment¶
- Date:
2026-04-28 - Machine:
internal Ascend validation host - Accelerator:
Ascend 910B2 - Entry point:
omnirt generate+subprocess+torchrun - Models:
soulx-flashhead-1.3bSoulX-FlashHead-1_3Bwav2vec2-base-960h - External checkout:
/path/to/SoulX-FlashHead - OmniRT test checkout:
/path/to/omnirt
Baseline Config¶
This run uses the quality-oriented 910B profile:
model_type=proaudio_encode_mode=streamFLASHHEAD_SAMPLE_STEPS=2FLASHHEAD_VAE_2D_SPLIT=1FLASHHEAD_LATENT_CARRY=0FLASHHEAD_NPU_FUSION_ATTENTION=1- Inputs:
examples/girl.pngbench_results/bench_10s.wav - Output:
512x51225 FPS10.0s250 frames
Device visibility
Set ASCEND_RT_VISIBLE_DEVICES before starting the OmniRT parent process. Passing only --visible-devices is not enough, because the resource-budget check runs before the external torchrun process starts.
Command Template¶
2 NPU:
set +u
source /usr/local/Ascend/ascend-toolkit/set_env.sh
set -u
ASCEND_RT_VISIBLE_DEVICES=2,3 \
PYTHONPATH=/path/to/omnirt/src \
/path/to/flashhead-venv/bin/python -m omnirt generate \
--task audio2video \
--model soulx-flashhead-1.3b \
--backend ascend \
--image /path/to/SoulX-FlashHead/examples/girl.png \
--audio /path/to/SoulX-FlashHead/bench_results/bench_10s.wav \
--repo-path /path/to/SoulX-FlashHead \
--ckpt-dir models/SoulX-FlashHead-1_3B \
--wav2vec-dir models/wav2vec2-base-960h \
--python-executable /path/to/flashhead-venv/bin/python \
--ascend-env-script /usr/local/Ascend/ascend-toolkit/set_env.sh \
--launcher torchrun \
--nproc-per-node 2 \
--visible-devices 2,3 \
--sample-steps 2 \
--vae-2d-split \
--npu-fusion-attention \
--output-dir outputs/flashhead_bench_2npu \
--json
For 4 NPU, set both ASCEND_RT_VISIBLE_DEVICES and --visible-devices to 2,3,4,5, then set --nproc-per-node 4.
Results¶
| Config | Wall time | denoise_loop_ms |
export_ms |
Output |
|---|---|---|---|---|
| 2 NPU cold start | 82.96s |
69,501.215 ms |
264.686 ms |
512x512 / 10s / 250 frames |
| 4 NPU cold start | 84.08s |
69,963.908 ms |
237.519 ms |
512x512 / 10s / 250 frames |
Interpretation:
- The OmniRT
subprocesswrapper can complete realaudio2videogeneration on 910B. - In cold-start mode, 4 NPU does not beat 2 NPU; distributed initialization, model loading, data preparation, and first-operator warmup offset the extra parallelism.
- To measure steady-state multi-NPU gains, benchmark a future
persistent_worker/ resident path instead of a single cold-start script launch.
Artifact Checks¶
| Config | RunReport run_id |
Remote output directory | Checks |
|---|---|---|---|
| 2 NPU | 8361015e-f1d6-4ad7-8c3c-6f3680354fa1 |
outputs/flashhead_bench_20260428_180909 |
ffprobe: 512x512 / 25 FPS / 10.0s / 250 frames; no blackdetect / freezedetect warnings |
| 4 NPU | 79ebe868-609a-4f5f-a571-6366d984aeb2 |
outputs/flashhead_bench_20260428_181056 |
ffprobe: 512x512 / 25 FPS / 10.0s / 250 frames; no blackdetect / freezedetect warnings |
Notes¶
- Running the OmniRT CLI in the remote venv requires
protobufandgrpcio; this run installed them into/path/to/flashhead-venvfrom the Tsinghua PyPI mirror. - This page tracks the OmniRT
subprocesscold-start path, not service-mode hot latency. latent_carry=falseis the default quality profile.latent_carry=truecan reduce part of VAE encode overhead, but prior adaptation notes observed style drift, so it is not the default display profile.