FlashHead Benchmark¶

This document records the first real-hardware benchmark for soulx-flashhead-1.3b through OmniRT's subprocess wrapper. It is different from the external SoulX-FlashHead resident benchmark: this page measures the cold-start end-to-end path where OmniRT launches generate_video.py.

Environment¶

Date: 2026-04-28
Machine: internal Ascend validation host
Accelerator: Ascend 910B2
Entry point: omnirt generate + subprocess + torchrun
Models: soulx-flashhead-1.3b SoulX-FlashHead-1_3B wav2vec2-base-960h
External checkout: /path/to/SoulX-FlashHead
OmniRT test checkout: /path/to/omnirt

Baseline Config¶

This run uses the quality-oriented 910B profile:

model_type=pro
audio_encode_mode=stream
FLASHHEAD_SAMPLE_STEPS=2
FLASHHEAD_VAE_2D_SPLIT=1
FLASHHEAD_LATENT_CARRY=0
FLASHHEAD_NPU_FUSION_ATTENTION=1
Inputs: examples/girl.png bench_results/bench_10s.wav
Output: 512x512 25 FPS 10.0s 250 frames

Device visibility

Set ASCEND_RT_VISIBLE_DEVICES before starting the OmniRT parent process. Passing only --visible-devices is not enough, because the resource-budget check runs before the external torchrun process starts.

Command Template¶

2 NPU:

set +u
source /usr/local/Ascend/ascend-toolkit/set_env.sh
set -u
ASCEND_RT_VISIBLE_DEVICES=2,3 \
PYTHONPATH=/path/to/omnirt/src \
/path/to/flashhead-venv/bin/python -m omnirt generate \
  --task audio2video \
  --model soulx-flashhead-1.3b \
  --backend ascend \
  --image /path/to/SoulX-FlashHead/examples/girl.png \
  --audio /path/to/SoulX-FlashHead/bench_results/bench_10s.wav \
  --repo-path /path/to/SoulX-FlashHead \
  --ckpt-dir models/SoulX-FlashHead-1_3B \
  --wav2vec-dir models/wav2vec2-base-960h \
  --python-executable /path/to/flashhead-venv/bin/python \
  --ascend-env-script /usr/local/Ascend/ascend-toolkit/set_env.sh \
  --launcher torchrun \
  --nproc-per-node 2 \
  --visible-devices 2,3 \
  --sample-steps 2 \
  --vae-2d-split \
  --npu-fusion-attention \
  --output-dir outputs/flashhead_bench_2npu \
  --json

For 4 NPU, set both ASCEND_RT_VISIBLE_DEVICES and --visible-devices to 2,3,4,5, then set --nproc-per-node 4.

Results¶

Config	Wall time	`denoise_loop_ms`	`export_ms`	Output
2 NPU cold start	`82.96s`	`69,501.215 ms`	`264.686 ms`	`512x512 / 10s / 250 frames`
4 NPU cold start	`84.08s`	`69,963.908 ms`	`237.519 ms`	`512x512 / 10s / 250 frames`

Interpretation:

The OmniRT subprocess wrapper can complete real audio2video generation on 910B.
In cold-start mode, 4 NPU does not beat 2 NPU; distributed initialization, model loading, data preparation, and first-operator warmup offset the extra parallelism.
To measure steady-state multi-NPU gains, benchmark a future persistent_worker / resident path instead of a single cold-start script launch.

Artifact Checks¶

Config	RunReport `run_id`	Remote output directory	Checks
2 NPU	`8361015e-f1d6-4ad7-8c3c-6f3680354fa1`	`outputs/flashhead_bench_20260428_180909`	`ffprobe`: `512x512 / 25 FPS / 10.0s / 250 frames`; no `blackdetect` / `freezedetect` warnings
4 NPU	`79ebe868-609a-4f5f-a571-6366d984aeb2`	`outputs/flashhead_bench_20260428_181056`	`ffprobe`: `512x512 / 25 FPS / 10.0s / 250 frames`; no `blackdetect` / `freezedetect` warnings

Notes¶

Running the OmniRT CLI in the remote venv requires protobuf and grpcio; this run installed them into /path/to/flashhead-venv from the Tsinghua PyPI mirror.
This page tracks the OmniRT subprocess cold-start path, not service-mode hot latency.
latent_carry=false is the default quality profile. latent_carry=true can reduce part of VAE encode overhead, but prior adaptation notes observed style drift, so it is not the default display profile.