CUDA Deployment (NVIDIA GPU)¶
OmniRT's public baseline on NVIDIA GPUs: single-card inference on Ampere and newer.
Hardware requirements¶
| Item | Requirement | Notes |
|---|---|---|
| GPU | Ampere+ | A100 / L40S / RTX 3090 / 4090; torch.compile is only stable on Ampere+ |
| VRAM | per-model resource_hint.min_vram_gb |
check exact values with omnirt models <id>: SDXL ≥ 12 GB, SVD ≥ 14 GB, Flux2 / Wan2.2 ≥ 24 GB; SD1.5 and similar experimental models are compatibility references |
| Driver | ≥ 535 (pairs with CUDA 12.1) | verify with nvidia-smi |
| CUDA Toolkit | 12.1 or 12.4 | must match the PyTorch wheel |
| PyTorch | 2.1+ official CUDA wheel | e.g. torch==2.5.1+cu121 |
Install¶
Smoke test¶
# Confirm CUDA is available
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# Dry-run validate the request contract against cpu-stub
omnirt validate --task text2image --model sdxl-base-1.0 --prompt "a lighthouse" --backend cpu-stub
# Run a real generation
omnirt generate --task text2image --model sdxl-base-1.0 \
--prompt "a lighthouse in fog" --backend cuda --preset fast
FlashTalk CUDA¶
soulx-flashtalk-14b uses an isolated SoulX-FlashTalk runtime. The GPU path follows the official setup: CUDA 12.8 PyTorch wheels, flash_attn==2.8.0.post2 --no-build-isolation, and no Ascend code patch.
python -m omnirt.cli.main runtime install flashtalk --device cuda
OMNIRT_FLASHTALK_DEVICE=cuda OMNIRT_FLASHTALK_VISIBLE_DEVICES=0 bash scripts/start_flashtalk_ws.sh
See FlashTalk-compatible WebSocket for WebSocket serving, OpenTalking integration, and checkpoint path configuration.
Production tuning¶
torch.compile: on by default, stable on Ampere+. If compilation fails, setOMNIRT_DISABLE_COMPILE=1to skip — failures are recorded inRunReport.backend_timeline.- Device visibility:
CUDA_VISIBLE_DEVICES=0(single card) or0,1(multi-card — but note: multi-GPU parallelism, USP, and CFG sharding are not yet public features, see PLAN.md). - VRAM peak: inspect
RunReport.memory; on OOM, switch to--preset low-vramor dropwidth/height/num_frames. - Telemetry:
omnirt.middleware.telemetryemits stage timings, peak memory, and fallback events — see Telemetry. - Serving: for FastAPI deployment see HTTP Server; production defaults should use
--model-tier core --model-tier adjacent.
Known issues¶
Warning
torch.compilecrashes on older cards —OMNIRT_DISABLE_COMPILE=1falls back to eager; each fallback is trackedflashinfer/ custom attention kernel miss — failed kernel overrides automatically fall back to eager attention; checkkernel_overrideentries inRunReport.backend_timeline- Slow Triton compile with older versions — upgrade to the
tritonversion PyTorch recommends