Legacy Optimization Guide¶
This guide targets families still running with execution_mode="legacy_call", such as sd15, sd3, svd, animatediff_sdxl, chronoedit, generalist_image, and video_family.
The goal is not to guarantee that every knob works for every model. The goal is to expose the optimizations already understood by upstream Diffusers and the runtime through one consistent OmniRT request surface.
Design principles¶
- make the model run first, optimize second: confirm the baseline path before stacking offload, layout, cache, or quantization
- turn on one class of optimization at a time: this keeps regressions explainable
- everything is best-effort: if an upstream pipeline or component does not expose a compatible hook, OmniRT skips it instead of forcing a hard failure
Available config keys¶
| Config key | Typical value | Purpose | Notes |
|---|---|---|---|
enable_model_cpu_offload |
true |
model-level CPU offload | good first choice when VRAM is tight and extra latency is acceptable |
enable_sequential_cpu_offload |
true |
more aggressive sequential offload | saves more VRAM and is usually slower |
enable_group_offload |
true |
grouped offload | requires the pipeline to expose enable_group_offload() |
group_offload_type |
block_level |
grouped-offload granularity | follows upstream semantics |
group_offload_use_stream |
true |
use stream for offload transfer | mainly affects throughput and jitter |
group_offload_disk_path |
/path/to/cache |
allow disk-backed offload | only works when the upstream implementation supports it |
enable_vae_slicing |
true |
VAE slicing | common for large images or video decode pressure |
enable_vae_tiling |
true |
VAE tiling | often paired with slicing |
channels_last |
true |
torch.channels_last memory layout |
common for convolution-heavy models |
fuse_qkv |
true |
QKV fusion | only effective when the component exposes fuse_qkv_projections() |
quantization |
int8 / fp8 / nf4 |
best-effort quantization entry point | pair with quantization_backend |
quantization_backend |
torchao |
select the quantization backend | currently torchao is tried first |
enable_layerwise_casting |
true |
layerwise casting | useful when VRAM is constrained but you want a higher compute dtype |
layerwise_casting_storage_dtype |
fp8_e4m3fn |
layerwise storage dtype | real effect depends on backend and component support |
layerwise_casting_compute_dtype |
bf16 |
layerwise compute dtype | commonly paired with enable_layerwise_casting |
cache / enable_tea_cache |
tea_cache / true |
TeaCache entry point | actual reuse depends on compatible component hooks |
tea_cache_ratio |
0.2 |
TeaCache reuse hint | start small |
tea_cache_interval |
2 |
TeaCache interval hint | video workloads often benefit from larger values |
Recommended starting points¶
VRAM first¶
Good for 12 GB to 24 GB boundary-VRAM machines. Aim for "it runs" before tuning latency.
Throughput first¶
Good once the model is already stable and you want more throughput. Measure it together with Benchmark Baseline.
Experimental memory reduction¶
config:
quantization: int8
quantization_backend: torchao
enable_layerwise_casting: true
layerwise_casting_compute_dtype: bf16
These hooks are available, but you should still validate quality and performance on target hardware.
Tuning order¶
- start with
enable_vae_slicingandenable_vae_tiling - if VRAM is still insufficient, try
enable_model_cpu_offload - if you need a more aggressive reduction, move to
enable_sequential_cpu_offloadorgroup_offload - leave quantization, layerwise casting, and TeaCache for last because their gains and tradeoffs are more model-specific
How to tell whether it worked¶
- inspect
RunReport.config_resolvedfor the final applied config - inspect
RunReport.cache_hitsfor entries such astext_embedding - compare
RunReport.device_placement,memory, andtimings - in server mode, combine
/metricsand/v1/jobs/{id}/traceto see stage-level changes
Known boundaries¶
legacy_callfamilies do not guarantee full support for manageddevice_mapplacementtorchao, TeaCache, and layerwise casting all depend on optional runtime dependencies and compatible upstream methods- a config key being accepted does not mean every submodule applied the same optimization