Benchmark Baseline¶

This document captures the recommended benchmark methodology for OmniRT and what should be archived before a release. The goal is not one absolute set of numbers for every machine. The goal is reproducible comparisons across phases and commits.

Standard output shape¶

omnirt bench currently emits these core fields:

Field	Meaning
`throughput_rps`	throughput in requests per second
`latency_ms.p50 / p95 / p99`	end-to-end latency percentiles
`ttft_ms.p50 / p95 / p99`	time-to-first-event percentiles
`peak_vram`	peak memory / VRAM observed in this run
`cache_hit_ratio`	fraction of requests that hit the result cache
`batch_size_mean`	average batch size
`batched_request_ratio`	fraction of requests that were merged into a batch
`execution_mode_breakdown`	distribution across `modular / legacy_call / subprocess / persistent_worker`

Always persist the JSON:

omnirt bench ... --output bench.json --json

Built-in scenario¶

Current built-in scenario:

text2image_sdxl_concurrent4

Example:

omnirt bench \
  --scenario text2image_sdxl_concurrent4 \
  --total 100 \
  --warmup 2 \
  --output bench-sdxl-c4.json

SoulX-LiveAct Ascend Baseline¶

The soulx-liveact-14b script-backed wrapper has completed an Ascend real-hardware validation run. Test scope:

Inputs: examples/image/1.png + examples/audio/1.wav
Resolution / FPS: 416*720, fps=20
Inference: --sample-steps 1 --rank0-t5-only --use-lightvae --vae-path models/vae/lightvaew2_1.pth --use-cache-vae --stage-profile
Placement: --text-cache-visible-devices 2 --visible-devices 2,3,4,5, meaning one NPU prepares the T5 text cache before the 4-NPU inference job
Output video: 416x720, 20fps, 755 frames, 37.75s

run	cache state	wall_s	stage total avg	Key stages
`cold`	text cache rebuilt; condition cache miss	190	112.9584s	`prepare_text_cache total=11.10s`, `sample_model_forward avg=9.8508s`, `vae_decode avg=21.3054s`
`warm`	text cache hit; condition cache hit; still initialized T5	207	132.6558s	`prepare_text_cache total=10.09s`, `sample_model_forward avg=16.9334s`, `vae_decode avg=22.7612s`
`warm2`	text cache skipped; condition cache hit	169	121.2429s	`sample_model_forward avg=15.9815s`, `vae_decode avg=23.3559s`, `export avg=13.5125s`

Use warm2 as the current warm baseline. The first warm run measured 207s because the wrapper still called upstream prepare_text_cache.py on a cache hit. That script initializes T5 before checking for the hit, adding about 10s. The wrapper now checks /tmp/liveact_text_ctx_*.pt before launching that script and skips it when all expected cache files exist. This path does not use CPU T5.

P2 close-out baseline: result cache¶

The goal is to verify that prompt-embedding reuse actually hits on repeated same-prompt requests.

Suggested command:

omnirt bench \
  --task text2image \
  --model sdxl-base-1.0 \
  --prompt "a cinematic portrait of a traveler under neon rain" \
  --width 1024 \
  --height 1024 \
  --num-inference-steps 30 \
  --concurrency 1 \
  --total 50 \
  --warmup 1 \
  --batch-window-ms 0 \
  --max-batch-size 1 \
  --output bench-cache.json

What to inspect:

cache_hit_ratio should be clearly above 0
inspect RunReport.cache_hits and confirm text_embedding appears
for stage-level analysis, pair the run with /v1/jobs/{id}/trace or structured logs and inspect encode_prompt

P3 close-out baseline: dynamic batching¶

The goal is to verify that the modular text2image path actually batches under concurrency.

Control:

omnirt bench \
  --scenario text2image_sdxl_concurrent4 \
  --total 100 \
  --batch-window-ms 0 \
  --max-batch-size 1 \
  --output bench-nobatch.json

Experiment:

omnirt bench \
  --scenario text2image_sdxl_concurrent4 \
  --total 100 \
  --batch-window-ms 50 \
  --max-batch-size 4 \
  --output bench-batch.json

What to inspect:

whether throughput_rps improves
whether batch_size_mean rises above 1
whether batched_request_ratio is materially above 0
whether latency_ms.p95 still fits your service target

What to archive before release¶

Each benchmark artifact should be stored with:

commit SHA
hardware model and count
backend (CUDA / Ascend / ...)
driver, Torch, and Diffusers versions
model source and weight precision
full CLI command
the JSON report itself

CI baseline vs real-hardware baseline¶

CI / local fake runtime: verify report structure, non-zero metrics, and schema stability
real hardware benchmark: verify throughput, latency, VRAM, and multi-device gains

Do not treat CPU stub or fake-runtime numbers as release performance claims. They are best used for contract and regression checks.

How to compare results¶

compare two commits on the same machine first
then compare one commit across "cache off vs cache on" or "batching off vs batching on"
avoid cross-machine, cross-driver, or cross-weight-source absolute comparisons