Skip to content

CUDA Deployment (NVIDIA GPU)

OmniRT's public baseline on NVIDIA GPUs: single-card inference on Ampere and newer.

Hardware requirements

Item Requirement Notes
GPU Ampere+ A100 / L40S / RTX 3090 / 4090; torch.compile is only stable on Ampere+
VRAM per-model resource_hint.min_vram_gb check exact values with omnirt models <id>: SD1.5 ≥ 8 GB, SDXL ≥ 12 GB, SVD ≥ 14 GB, Flux2 / Wan2.2 ≥ 24 GB
Driver ≥ 535 (pairs with CUDA 12.1) verify with nvidia-smi
CUDA Toolkit 12.1 or 12.4 must match the PyTorch wheel
PyTorch 2.1+ official CUDA wheel e.g. torch==2.5.1+cu121

Install

# 1. Install the matching CUDA PyTorch wheel (pick the index on pytorch.org)
python -m pip install torch==2.5.1 torchvision==0.20.1 \
  --index-url https://download.pytorch.org/whl/cu121

# 2. Install OmniRT plus runtime extras
python -m pip install -e '.[runtime,dev]'
uv pip install torch==2.5.1 torchvision==0.20.1 \
  --index https://download.pytorch.org/whl/cu121
uv pip install -e '.[runtime,dev]'
git clone https://github.com/datascale-ai/omnirt.git
cd omnirt
python -m pip install -e '.[runtime,dev]'

Smoke test

# Confirm CUDA is available
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

# Dry-run validate the request contract against cpu-stub
omnirt validate --task text2image --model sd15 --prompt "a lighthouse" --backend cpu-stub

# Run a real generation
omnirt generate --task text2image --model sd15 \
  --prompt "a lighthouse in fog" --backend cuda --preset fast

Production tuning

  • torch.compile: on by default, stable on Ampere+. If compilation fails, set OMNIRT_DISABLE_COMPILE=1 to skip — failures are recorded in RunReport.backend_timeline.
  • Device visibility: CUDA_VISIBLE_DEVICES=0 (single card) or 0,1 (multi-card — but note: multi-GPU parallelism, USP, and CFG sharding are not yet public features, see PLAN.md).
  • VRAM peak: inspect RunReport.memory; on OOM, switch to --preset low-vram or drop width/height / num_frames.
  • Telemetry: omnirt.middleware.telemetry emits stage timings, peak memory, and fallback events — see Telemetry.
  • Serving: for FastAPI deployment see HTTP Server.

Known issues

Warning

  • torch.compile crashes on older cardsOMNIRT_DISABLE_COMPILE=1 falls back to eager; each fallback is tracked
  • flashinfer / custom attention kernel miss — failed kernel overrides automatically fall back to eager attention; check kernel_override entries in RunReport.backend_timeline
  • Slow Triton compile with older versions — upgrade to the triton version PyTorch recommends