Distributed Serving¶

When a single-process omnirt serve is no longer enough, split OmniRT into one HTTP gateway plus one or more gRPC workers, then add Redis and OTLP as needed.

Topologies¶

Topology	Best for	Key components
Single-process server	development, local validation, light internal services	`omnirt serve`
Gateway + remote workers	separate heavy inference from HTTP ingress	`omnirt serve --remote-worker ...` + `omnirt worker`
Gateway + Redis + OTLP	async jobs, cross-process state, external observability	`--redis-url` + `--otlp-endpoint`

Minimal distributed example¶

Start one worker:

omnirt worker \
  --host 0.0.0.0 \
  --port 50061 \
  --worker-id sdxl-a \
  --backend cuda

Then start the gateway:

omnirt serve \
  --host 0.0.0.0 \
  --port 8000 \
  --backend auto \
  --remote-worker 'sdxl-a=127.0.0.1:50061@sdxl-base-1.0,sdxl-refiner-1.0'

--remote-worker uses this format:

worker_id=host:port@model1,model2#tag1,tag2

@model1,model2 declares which models the worker should prefer
#tag1,tag2 is an optional routing tag set
serve probes worker health before startup and fails fast if the target is unreachable

Adding Redis and OTLP¶

omnirt serve \
  --host 0.0.0.0 \
  --port 8000 \
  --redis-url redis://127.0.0.1:6379/0 \
  --otlp-endpoint http://127.0.0.1:4318/v1/traces \
  --remote-worker 'sdxl-a=10.0.0.21:50061@sdxl-base-1.0'

omnirt worker \
  --host 0.0.0.0 \
  --port 50061 \
  --worker-id sdxl-a \
  --backend cuda \
  --redis-url redis://127.0.0.1:6379/0 \
  --otlp-endpoint http://127.0.0.1:4318/v1/traces

Recommended conventions:

use the same Redis deployment for the gateway and workers so job state and event streams stay consistent
export both gateway and worker traces to the same OTLP endpoint so worker_id appears in the full trace view
keep gRPC on a private network, service mesh, or reverse proxy; the current transport uses plain grpc.insecure_channel

Validation checklist¶

After startup, these probes quickly tell you whether the deployment is wired correctly:

curl -sS http://127.0.0.1:8000/readyz
curl -sS http://127.0.0.1:8000/metrics | head

Expected signals:

/readyz returns job_store_backend and remote_worker_count
/metrics includes omnirt_jobs_total, omnirt_stage_duration_seconds, and omnirt_queue_depth
async jobs can be observed through /v1/jobs/{id}, /v1/jobs/{id}/events, and /v1/jobs/{id}/stream
with OTLP enabled, /v1/jobs/{id}/trace returns the trace view for the same job

Recommended rollout order¶

Start with single-process omnirt serve
Add Redis and stabilize async jobs plus event streaming
Add one remote worker and verify routing plus /readyz
Finally add Prometheus scraping and OTLP export

Known boundaries¶

the current worker transport is a minimal unary gRPC transport, not a full multi-tenant RPC framework
backend support is intentionally focused on CUDA / Ascend / cpu-stub; there is no ROCm / XPU support plan
real multi-host load testing and GPU/NPU baselines should still be run in your target environment