Avatar Format¶

An avatar bundle defines the visual identity of a digital human together with the metadata required to align mouth motion with audio. OpenTalking reads avatar bundles when a session is created, pairs them with the configured synthesis model (wav2lip, musetalk, flashtalk, flashhead, or quicktalk), and drives video generation from streaming audio input.

This page documents the directory layout, the manifest.json schema, the scripts that generate avatar bundles, and the validation endpoints.

Directory layout¶

Each avatar bundle is a single subdirectory containing a manifest.json file:

examples/avatars/
├── demo-avatar/
│   ├── manifest.json
│   └── preview.png
├── singer-wav2lip/
│   ├── manifest.json
│   ├── preview.png
│   └── frames/
│       ├── frame_00000.png
│       ├── frame_00001.png
│       └── ...
└── singer-musetalk/
    ├── manifest.json
    ├── preview.png
    └── full_frames/
        └── ...

Per-model conventions:

`model_type`	Required subdirectory	Contents
`wav2lip`	`frames/`	Ordered image sequence (PNG or JPG), sorted by filename.
`musetalk`	`full_frames/`	Ordered image sequence. Future extensions may include `mask/` and `latent/`.
`quicktalk`	none	External assets referenced via `metadata.asset_root` and `metadata.template_video`.
`flashtalk`	none	Reference identity managed by the model service.
`flashhead`	none	Reference identity managed by the model service.

A preview.png file is recommended; the frontend uses it to populate the avatar picker.

`manifest.json` schema¶

Field	Type	Required	Description
`id`	string	Yes	Globally unique identifier referenced by the client.
`name`	string	No	Display name. Defaults to `id`.
`model_type`	enum	Yes	One of `wav2lip`, `musetalk`, `quicktalk`, `flashtalk`, `flashhead`.
`fps`	number	Yes	Target output frame rate. Typical value: 25.
`sample_rate`	number	Yes	Audio sample rate aligned with the TTS output. Typical value: 16000.
`width`	number	Yes	Output video width in pixels.
`height`	number	Yes	Output video height in pixels.
`version`	string	No	Asset version string.
`metadata`	object	No	Arbitrary additional fields, including per-model conventions documented below.

Wav2Lip `metadata`¶

Wav2lip avatars should populate the metadata field with mouth localization data:

{
  "source_image_hash": "<sha256>",
  "animation": {
    "mouth_center": [0.5, 0.56],
    "mouth_rx": 0.06,
    "mouth_ry": 0.02,
    "outer_lip": [[0.45, 0.55], [0.5, 0.53], [0.55, 0.55]],
    "inner_mouth": [[0.47, 0.55], [0.53, 0.55], [0.5, 0.57]]
  }
}

Coordinates are normalized to the image dimensions. When a single-image wav2lip avatar is uploaded through /avatars/custom, OpenTalking attempts mouth detection using MediaPipe locally. If detection fails, the upload succeeds without an animation field; in this case, OmniRT's wav2lip backend falls back to its built-in alignment. The wav2lip_postprocess_mode flag controls whether server-side post-processing is applied; the default is disabled.

QuickTalk manifest example¶

{
  "id": "quicktalk-daytime",
  "name": "QuickTalk Daytime",
  "model_type": "quicktalk",
  "fps": 25,
  "sample_rate": 16000,
  "width": 512,
  "height": 512,
  "version": "1.0",
  "metadata": {
    "asset_root": "/path/to/quicktalk/assets",
    "template_video": "/path/to/template.mp4"
  }
}

Avatar bundle preparation¶

From a video file¶

terminal

python scripts/prepare_wav2lip_video_asset.py \
    --source /path/to/source.mp4 \
    --output-dir examples/avatars/my-avatar \
    --avatar-id my-avatar \
    --name "My Avatar" \
    --fps 25

The script performs the following steps:

Extracts frames using ffmpeg and writes them to examples/avatars/my-avatar/frames/.
Runs MediaPipe mouth detection and records the results in metadata.animation.
Generates manifest.json and preview.png.

From a single image¶

terminal

python scripts/prepare_wav2lip_image_asset.py \
    --source /path/to/face.jpg \
    --output-dir examples/avatars/my-avatar-static \
    --avatar-id my-avatar-static

This produces a single frame at frames/frame_00000.png along with a complete manifest.

Interactive preparation¶

terminal

bash scripts/prepare-avatar.sh

The script prompts for the source file, model type, and avatar identifier.

Validation¶

REST endpoint¶

terminal

curl -s http://127.0.0.1:8000/avatars | jq
# [
#   {"id":"demo-avatar","name":"Demo","model_type":"mock","width":512,...},
#   ...
# ]

Programmatic validation¶

from opentalking.avatar.validator import list_avatar_dirs
from opentalking.avatar.loader import load_avatar_bundle

for path in list_avatar_dirs("./examples/avatars"):
    bundle = load_avatar_bundle(path, strict=True)
    print(bundle.manifest.id, bundle.manifest.model_type)

The strict=True parameter raises an exception when required fields or subdirectories are missing; this mode is appropriate for continuous integration.

Custom upload¶

The frontend's avatar creation flow invokes POST /avatars/custom with a multipart request body:

Field	Description
`name`	Display name.
`base_avatar_id`	Identifier of the avatar whose manifest serves as the template.
`image`	Portrait image uploaded by the user.

The server copies the base avatar's manifest, overrides id and name, sets metadata.custom_avatar=true, writes the uploaded image to frames/frame_00000.png, and runs mouth detection.

Only avatars marked custom_avatar=true may be removed through DELETE /avatars/{avatar_id}.

Source files¶

File	Responsibility
`opentalking/avatar/loader.py`	Manifest parsing and bundle loading.
`opentalking/avatar/validator.py`	Directory traversal and strict validation.
`opentalking/avatar/mouth_metadata.py`	MediaPipe mouth detection.
`apps/api/routes/avatars.py`	REST endpoint implementations.

Avatar Format¶

Directory layout¶

manifest.json schema¶

Wav2Lip metadata¶

QuickTalk manifest example¶

Avatar bundle preparation¶

From a video file¶

From a single image¶

Interactive preparation¶

Validation¶

REST endpoint¶

Programmatic validation¶

Custom upload¶

Source files¶

`manifest.json` schema¶

Wav2Lip `metadata`¶