跳转至

OmniRT Realtime Avatar WebSocket

OmniRT Native Realtime Avatar WebSocket is the long-term protocol for model-agnostic digital-human streaming. It keeps the efficient AUDI / VIDX binary framing from the FlashTalk-compatible path, but uses an OmniRT session control plane with session_id, trace_id, structured errors, and metrics.

Endpoint

WS /v1/avatar/realtime
GET /v1/audio2video/models
WS /v1/audio2video/flashtalk
WS /v1/audio2video/wav2lip

/v1/audio2video/flashtalk and /v1/audio2video/wav2lip are the public FlashTalk-compatible streaming paths for OpenTalking. /v1/avatar/flashtalk and /v1/avatar/wav2lip remain compatibility aliases. /v1/avatar/realtime is the model-agnostic control-plane protocol.

Session create

{
  "type": "session.create",
  "model": "soulx-flashtalk-14b",
  "backend": "auto",
  "inputs": {
    "image_b64": "<base64 png/jpeg>",
    "prompt": "A person is talking naturally."
  },
  "config": {
    "preset": "realtime",
    "seed": 9999,
    "wav2lip_postprocess_mode": false,
    "mouth_metadata": {
      "source_image_hash": "<sha256>",
      "animation": {
        "mouth_center": [0.5, 0.56],
        "mouth_rx": 0.06,
        "mouth_ry": 0.02,
        "outer_lip": [[0.45, 0.55], [0.5, 0.53], [0.55, 0.55]]
      }
    }
  }
}

Response:

{
  "type": "session.created",
  "session_id": "avt_...",
  "trace_id": "trace_...",
  "audio": {
    "format": "pcm_s16le",
    "sample_rate": 16000,
    "channels": 1,
    "chunk_samples": 17920
  },
  "video": {
    "encoding": "jpeg-seq",
    "wire_magic": "VIDX",
    "fps": 25,
    "width": 416,
    "height": 704
  }
}

Wav2Lip postprocess mode

Wav2Lip sessions accept wav2lip_postprocess_mode and optional mouth_metadata in session config. When disabled, OmniRT keeps native Wav2Lip output behavior. When enabled, the Wav2Lip runtime can use the supplied mouth polygon to blend the generated mouth region back into the reference frame with lower-lip coverage, feathering, and color matching.

The service default is off. It can be enabled process-wide with:

OMNIRT_WAV2LIP_POSTPROCESS_MODE=1 omnirt serve ...

The enhanced path exposes separate knobs for lower-lip coverage and jaw motion transfer:

OMNIRT_WAV2LIP_LOWER_LIP_DYNAMIC_EXPAND=0.25
OMNIRT_WAV2LIP_ENABLE_JAW_MOTION_BLEND=1
OMNIRT_WAV2LIP_JAW_BLEND_ALPHA=0.22
OMNIRT_WAV2LIP_JAW_MASK_EXPAND_X=0.25
OMNIRT_WAV2LIP_JAW_MASK_EXPAND_Y=0.55

Jaw motion blending is disabled by default so enhanced mouth blending and jaw motion can be A/B tested independently.

OpenTalking-compatible clients may also send the same fields in the init message to /v1/audio2video/wav2lip.

Audio and video chunks

Send audio:

b"AUDI" + pcm_s16le

The server sends a metrics event, then a video binary payload:

{"type": "metrics", "chunk_index": 1, "infer_ms": 0, "encode_ms": 0}
b"VIDX" + uint32(frame_count) + repeated(uint32(jpeg_len) + jpeg_bytes)

Control messages

{"type": "session.cancel"}
{"type": "session.close"}
{"type": "ping"}

Runtime 模式

v1 endpoint 保持 wire contract 稳定,不同部署可以选择不同 runtime:

模式 选择方式 说明
fake 默认,或 OMNIRT_REALTIME_AVATAR_RUNTIME=fake 为协议测试和 CPU-stub demo 输出确定性 JPEG chunk
proxy OMNIRT_REALTIME_AVATAR_RUNTIME=proxy + OMNIRT_AVATAR_FLASHTALK_WS_URL 把 FlashTalk-compatible 路由转发到已有 WebSocket 服务
resident OMNIRT_REALTIME_AVATAR_RUNTIME=resident 通过 OmniRT resident soulx-flashtalk-14b 执行路径渲染 chunk

GET /v1/audio2video/models 会返回 fallback_runtimeproxyresident_runtime,客户端可以据此区分协议测试模式和真实模型后端。