OmniRT Architecture¶
OmniRT has evolved from a single-process pipeline wrapper into a generation runtime with queues, executors, observability, and remote-worker extension points.
Main layers¶
1. Interface layer¶
- Python API:
omnirt.generate(...),omnirt.validate(...) - CLI:
generate / validate / models / serve / bench / worker - HTTP: native
/v1/generateplus OpenAI-compatible routes
This layer normalizes all external inputs into GenerateRequest.
2. Contract and registry layer¶
GenerateRequest/GenerateResult/RunReport- registry / presets / validation
- model alias resolution
This layer answers "is the request valid?" and "what final execution config should be used?"
3. Engine and dispatch layer¶
OmniEngineJobQueueRequestBatcherInMemoryJobStore/RedisJobStoreController
This layer owns:
- unified sync and async entry points
- local queueing and job lifecycle
- batching
- deciding whether a request stays local or is forwarded to a remote worker
4. Executor layer¶
There are currently three execution paths:
| execution_mode | Meaning |
|---|---|
modular |
component-oriented path for migrated families |
legacy_call |
wrapper path around existing Diffusers pipelines |
subprocess |
external script / repository-driven execution such as FlashTalk |
ModelSpec.execution_mode decides which path the engine takes.
5. Model / launcher / backend layer¶
- model-family implementations live under
src/omnirt/models/ - launchers handle
python / torchrun / accelerate - backends wrap device and compile behavior
This is also where OmniRT applies:
device_map/devices- legacy official optimization switches
- quantization / layerwise casting / TeaCache
6. Observability layer¶
RunReport- Prometheus metrics
- trace recorder plus OTLP exporter
- SSE / WebSocket / Realtime event streams
This layer makes one execution visible both inside the response and outside the process.
Synchronous execution path¶
- the interface layer builds
GenerateRequest - validation resolves the model, task, and config
OmniEngine.run_sync()selects either a local executor or a remote worker- the executor runs the model and returns
GenerateResult - telemetry fills
RunReport
Asynchronous execution path¶
POST /v1/generatewithasync_run=true- the engine creates a job and writes it to JobStore
- the queue, batcher, and controller process the job
- events are continuously appended to the job stream
- clients consume the job through
GET /v1/jobs/{id}, SSE, WebSocket, or the trace route
Distributed extension points¶
Extension points already implemented:
- gRPC worker transport
Controllerrouting to remote workersRedisJobStore- OTLP/HTTP trace export
Still intentionally lightweight:
- the gRPC transport is a minimal unary RPC transport, not a full control plane
- backend support is intentionally focused on
CUDA / Ascend / cpu-stub; there is noROCm / XPUsupport plan
Stable public contracts¶
The most important stable public surfaces are:
GenerateRequestGenerateResultRunReport.schema_version
Executors, middleware, and launchers can continue to evolve internally, but these three should remain backward compatible whenever possible.