Avatar Format¶
An avatar bundle defines the visual identity of a digital human together with the
metadata required to align mouth motion with audio. OpenTalking reads avatar bundles
when a session is created, pairs them with the configured synthesis model (wav2lip,
musetalk, flashtalk, flashhead, or quicktalk), and drives video generation
from streaming audio input.
This page documents the directory layout, the manifest.json schema, the scripts that
generate avatar bundles, and the validation endpoints.
Directory layout¶
Each avatar bundle is a single subdirectory containing a manifest.json file:
examples/avatars/
├── demo-avatar/
│ ├── manifest.json
│ └── preview.png
├── singer-wav2lip/
│ ├── manifest.json
│ ├── preview.png
│ └── frames/
│ ├── frame_00000.png
│ ├── frame_00001.png
│ └── ...
└── singer-musetalk/
├── manifest.json
├── preview.png
└── full_frames/
└── ...
Per-model conventions:
model_type |
Required subdirectory | Contents |
|---|---|---|
wav2lip |
frames/ |
Ordered image sequence (PNG or JPG), sorted by filename. |
musetalk |
full_frames/ |
Ordered image sequence. Future extensions may include mask/ and latent/. |
quicktalk |
none | External assets referenced via metadata.asset_root and metadata.template_video. |
flashtalk |
none | Reference identity managed by the model service. |
flashhead |
none | Reference identity managed by the model service. |
A preview.png file is recommended; the frontend uses it to populate the avatar picker.
manifest.json schema¶
| Field | Type | Required | Description |
|---|---|---|---|
id |
string | Yes | Globally unique identifier referenced by the client. |
name |
string | No | Display name. Defaults to id. |
model_type |
enum | Yes | One of wav2lip, musetalk, quicktalk, flashtalk, flashhead. |
fps |
number | Yes | Target output frame rate. Typical value: 25. |
sample_rate |
number | Yes | Audio sample rate aligned with the TTS output. Typical value: 16000. |
width |
number | Yes | Output video width in pixels. |
height |
number | Yes | Output video height in pixels. |
version |
string | No | Asset version string. |
metadata |
object | No | Arbitrary additional fields, including per-model conventions documented below. |
Wav2Lip metadata¶
Wav2lip avatars should populate the metadata field with mouth localization data:
{
"source_image_hash": "<sha256>",
"animation": {
"mouth_center": [0.5, 0.56],
"mouth_rx": 0.06,
"mouth_ry": 0.02,
"outer_lip": [[0.45, 0.55], [0.5, 0.53], [0.55, 0.55]],
"inner_mouth": [[0.47, 0.55], [0.53, 0.55], [0.5, 0.57]]
}
}
Coordinates are normalized to the image dimensions. When a single-image wav2lip avatar
is uploaded through /avatars/custom, OpenTalking attempts mouth detection using
MediaPipe locally. If detection fails, the upload succeeds without an animation
field; in this case, OmniRT's wav2lip backend falls back to its built-in alignment.
The wav2lip_postprocess_mode flag controls whether server-side post-processing is
applied; the default is disabled.
QuickTalk manifest example¶
{
"id": "quicktalk-daytime",
"name": "QuickTalk Daytime",
"model_type": "quicktalk",
"fps": 25,
"sample_rate": 16000,
"width": 512,
"height": 512,
"version": "1.0",
"metadata": {
"asset_root": "/path/to/quicktalk/assets",
"template_video": "/path/to/template.mp4"
}
}
Avatar bundle preparation¶
From a video file¶
python scripts/prepare_wav2lip_video_asset.py \
--source /path/to/source.mp4 \
--output-dir examples/avatars/my-avatar \
--avatar-id my-avatar \
--name "My Avatar" \
--fps 25
The script performs the following steps:
- Extracts frames using ffmpeg and writes them to
examples/avatars/my-avatar/frames/. - Runs MediaPipe mouth detection and records the results in
metadata.animation. - Generates
manifest.jsonandpreview.png.
From a single image¶
python scripts/prepare_wav2lip_image_asset.py \
--source /path/to/face.jpg \
--output-dir examples/avatars/my-avatar-static \
--avatar-id my-avatar-static
This produces a single frame at frames/frame_00000.png along with a complete manifest.
Interactive preparation¶
The script prompts for the source file, model type, and avatar identifier.
Validation¶
REST endpoint¶
curl -s http://127.0.0.1:8000/avatars | jq
# [
# {"id":"demo-avatar","name":"Demo","model_type":"mock","width":512,...},
# ...
# ]
Programmatic validation¶
from opentalking.avatar.validator import list_avatar_dirs
from opentalking.avatar.loader import load_avatar_bundle
for path in list_avatar_dirs("./examples/avatars"):
bundle = load_avatar_bundle(path, strict=True)
print(bundle.manifest.id, bundle.manifest.model_type)
The strict=True parameter raises an exception when required fields or subdirectories
are missing; this mode is appropriate for continuous integration.
Custom upload¶
The frontend's avatar creation flow invokes POST /avatars/custom with a multipart
request body:
| Field | Description |
|---|---|
name |
Display name. |
base_avatar_id |
Identifier of the avatar whose manifest serves as the template. |
image |
Portrait image uploaded by the user. |
The server copies the base avatar's manifest, overrides id and name, sets
metadata.custom_avatar=true, writes the uploaded image to frames/frame_00000.png,
and runs mouth detection.
Only avatars marked custom_avatar=true may be removed through
DELETE /avatars/{avatar_id}.
Source files¶
| File | Responsibility |
|---|---|
opentalking/avatar/loader.py |
Manifest parsing and bundle loading. |
opentalking/avatar/validator.py |
Directory traversal and strict validation. |
opentalking/avatar/mouth_metadata.py |
MediaPipe mouth detection. |
apps/api/routes/avatars.py |
REST endpoint implementations. |