Project Three: Building LLaVA Multimodal Instruction Dataset¶
Scope: Multimodal LLM (LMM) development, data engineering, visual instruction tuning (Visual Instruction Tuning)
1. Project Background (Project Brief)¶
-
Task Definition: Build a high-quality visual instruction fine-tuning dataset supporting single-image QA (Visual QA), object localization (Grounding), and multi-image context reasoning (Interleaved Image-Text), for training multimodal models like LLaVA or Qwen-VL.
-
Input and Output:
- Input:
- Raw image library (
.jpg/.png) - Structured annotation data (e.g., COCO-format
instances.jsonwith Bbox coordinates)
- Raw image library (
-
Output:
- JSON files conforming to LLaVA training standard (with
image,conversationsfields). - Grounding data with coordinate normalization and format alignment.
- JSON files conforming to LLaVA training standard (with
-
Challenge Analysis:
- Coordinate alignment (Coordinate Alignment): Raw detection data coordinates are typically pixel absolutes (x, y, w, h), while LLaVA requires normalization to
[0-1000]range with order[ymin, xmin, ymax, xmax]—once wrong, model suffers severe "hallucination." - Multi-image logic construction: Traditional Image-Caption data is one image one text; building "multi-image interleaved" dialogue requires constructing reasonable comparative Prompts to induce model to understand inter-image relationships.
2. Architecture Design (Architecture Design)¶
-
Technology Stack:
- OpenAI Compatible API (SiliconFlow/Qwen): For generating high-quality image-text descriptions and multi-image comparison logic; leverages LLM reasoning for dialogue construction.
- Python & OpenCV: Core glue language. OpenCV essential for reading image dimensions (H, W) for coordinate normalization and "draw-box verification" visualization.
- JSON: LLaVA standard data exchange format.
3. Step-by-Step Implementation¶
Phase 1: Multi-Image Interleaved Data Generation¶
To teach the model to "compare" two images, we use API to dynamically input multiple images and request comparison.
Key Logic: Use VLM API to construct multi-image input Prompt.
# From interleaved.py
def generate_comparison(img1_path, img2_path):
# Construct Prompt: Require multi-image comparison
prompt = "Here are two images. Please briefly compare them..."
# Build multi-image payload
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": f"...{img1_path}..."}}, # Image 1
{"type": "image_url", "image_url": {"url": f"...{img2_path}..."}} # Image 2
]
}
]
# ... send request and parse result ...
Phase 2: Core Processing—Bounding Box Alignment¶
This is the project's core math. COCO uses [x_topleft, y_topleft, width, height] while LLaVA needs [ymin, xmin, ymax, xmax] with values normalized to 0-1000 integers.
Key Function: Coordinate normalization conversion
# From alignment.py
def convert_bbox(bbox, width, height):
# COCO raw input: x, y, w, h
x, y, w, h = bbox
# Convert to LLaVA format: [ymin, xmin, ymax, xmax] normalized to 0-1000
# Must use max/min for clipping to prevent float error overflow
xmin = int((x / width) * 1000)
ymin = int((y / height) * 1000)
xmax = int((x + w) / width * 1000)
ymax = int((y + h) / height * 1000)
return [
max(0, min(1000, ymin)),
max(0, min(1000, xmin)),
max(0, min(1000, ymax)),
max(0, min(1000, xmax))
]
Phase 3: Formatting and Verification¶
Data generation must not go directly to training. Must pass visualization reverse verification. If our drawn boxes are wrong, the trained model will be useless.
Verification Logic: Parse generated JSON, restore [0-1000] coordinates to pixel coordinates and draw.
# From visualize_bbox.py
def draw_bbox(image, bbox, label, color):
h, w, _ = image.shape
ymin, xmin, ymax, xmax = bbox # Read LLaVA format
# Restore to pixel coordinates for drawing
x1 = int(xmin / 1000 * w)
y1 = int(ymin / 1000 * h)
x2 = int(xmax / 1000 * w)
y2 = int(ymax / 1000 * h)
# OpenCV draw box
cv2.rectangle(image, (x1, y1), (x2, y2), color, 2)
# ...
4. Results Showcase (Showcase)¶
1. Data Structure Example:
Final generated llava_instruct.json has this standard structure, directly readable by Training Pipeline:
{
"id": "1296_laptop",
"image": "000000001296.jpg",
"conversations": [
{
"from": "human",
"value": "Where is the laptop in the image? <image>"
},
{
"from": "qwen",
"value": "The laptop is located at [350, 201, 680, 505]."
}
]
}
2. Visualization Verification Report:
After running visualize_bbox.py, verification images generated in viz_debug directory. If boxes precisely frame objects (as shown below), data pipeline logic is correct.
Effect Image Generation:
5. Cost and Optimization (Cost & Optimization)¶
- Resource consumption:
- API cost:
interleaved.pydepends on external LLM API. Generating 10,000 multi-image comparison samples at \(0.5/1M Tokens costs ~\)20-$30. -
Compute time:
alignment.pyis pure CPU; processing COCO validation set (5k images) takes seconds. -
Scaling considerations:
- Concurrent processing: When processing millions of images (e.g., Objects365), single-threaded image reading for
(h, w)becomes bottleneck. Can introducemultiprocessingfor 16 processes parallel read and convert. - Negative sample mining: Current code only generates "where is object" positive samples. For model robustness, extend code to generate negative samples like "Is there an elephant in the image? -> No."

