Project 13: Qwen-VL Multimodal Instruction Factory¶
Abstract¶
This project builds a reproducible data-engineering case around a "multimodal instruction factory." It focuses on business goals, data boundaries, architecture decisions, core implementation, acceptance metrics, and risk control. Installation commands and script details are condensed into an engineering-review perspective. The emphasis is on the relationship among sample schema, data flow, failure modes, and deliverables, so that readers can turn methods from earlier chapters into auditable and extensible project assets.
Keywords¶
multimodal instruction factory; practical project; reproducible data engineering; data pipeline; acceptance metrics
Project Goals and Reader Outcomes¶
This project uses the multimodal instruction factory as its core case. The goal is to build a multimodal instruction production chain covering images, text, OCR, charts, and dialogue tasks. After completing this chapter, readers should be able to identify the key data objects in this scenario, decompose the engineering pipeline, set acceptance metrics, and transfer the same approach to adjacent data-engineering tasks.
Scenario Constraints and Data Boundaries¶
This project targets controlled assets and sample factories. It does not cover unauthorized media collection or fully automated safety review. These boundaries make the case reproducible and auditable. When data scale, data sources, permission scope, or deployment environment changes, sampling strategy, quality thresholds, runtime cost, and compliance requirements must be reassessed.
The boundary between this project and P03 must be explicit. P03 focuses on the classic LLaVA flow: image assets, OCR, bounding boxes, conversation templates, visual spot checks, and training packaging as the baseline chain. P13 focuses on modern multimodal-instruction factory capabilities: Qwen-VL-style generation, self-consistency quality calibration, LLM-as-Judge filtering, multilingual expansion, and unified packaging for multi-image and video references. P13 therefore does not repeat the proof of the LLaVA baseline pipeline; it shows how newer factory capabilities extend the data-factory skeleton already established in P03.
Architecture Decisions¶
This project uses an architecture path of asset selection, task templates, caption/OCR signals, dialogue generation, quality scoring, and data packaging. The decision prioritizes input-output contracts, traceable versions, localizable exceptions, and reviewable results rather than compressing all logic into a one-off script.
Sample Schema and Data Flow¶
The core data flow can be summarized as:
Listing P13-1 provides a process flow example.
visual assets -> metadata/OCR/caption -> instruction tasks -> multi-turn samples -> quality filtering -> multimodal training set
Listing P13-1: Process flow example.
At minimum, the sample schema should retain fields such as id, source, content_or_payload, metadata, quality_signals, split_or_stage, and audit_trace. The exact fields are further refined by the data type, downstream task, and acceptance method used in this project.
Core Implementation Fragments¶
The chapter keeps only implementation fragments that explain design trade-offs. Full scripts, long configurations, runtime logs, and large files should live in the companion repository or appendix notes. Code examples focus on input-output contracts, quality thresholds, exception handling, and acceptance interfaces.
Experiment or Acceptance Metrics¶
Acceptance metrics include task coverage, image-text consistency, OCR usability, format pass rate, safety-filtering rate, and manual spot-check quality. If the project enters production, a course environment, or a public reproduction environment, it should also record version numbers, dependency environment, random seeds, sample spot-check results, and failure-sample review records.
Table P13-1 summarizes the corresponding comparison and engineering considerations.
Table P13-1: Publication acceptance table for the multimodal instruction factory.
| Acceptance dimension | Metric / evidence | Publication review rule |
|---|---|---|
| Task coverage | Ratio of description, OCR, chart, grounding, and multi-turn QA tasks | Task types must correspond to data sources, model capability, and downstream training goals |
| Quality filtering | Image-text consistency, format pass rate, safety-filtering rate, self-consistency result, manual review quality | LLM-as-Judge results must retain scoring rules, spot-check calibration examples, and multi-sample consistency records |
| Multilingual expansion | Ratio of Chinese, English, and translated samples; cross-language terminology consistency; format-preservation rate | Multilingual samples must not be judged by quantity only; semantic consistency, visual reference, and proper-name translation require sampling review |
| Copyright safety | Image authorization, sensitive-content interception, redistribution boundary | Public examples should prefer authorized or owned assets; external images require separate registration |
Cost, Risk, and Compliance Boundaries¶
Cost mainly comes from vision-language models, OCR, and manual review. Risk concentrates in image authorization, sensitive content, hallucinated descriptions, and task homogenization. When external data, personal information, copyrighted material, or third-party services are involved, retain source notes, permission status, masking strategy, call records, and manual-review records.
Common Failure Modes¶
Common failures include input-distribution drift, missing schema fields, quality thresholds that are too loose or too strict, insufficient evaluation-sample coverage, unstable model calls, and results that cannot be traced. Troubleshooting should locate data boundaries and intermediate artifacts first, then inspect models, toolchains, and deployment environment.
Reproducible Resource Notes¶
Reproduction materials should include data-source notes, a minimal sample set, configuration files, run commands, metric scripts, inspection reports, and output directories. The chapter keeps necessary fragments; full notebooks, long scripts, and large files should be maintained as companion resources.
Background and Objectives¶
In VLM data engineering, the bottleneck is often not only the number of image-text pairs but also the construction of high-quality, diverse instruction data. Project 3 already introduced the entry-level process for generating simple descriptions and QA instructions from single images. Under modern multimodal architectures represented by Qwen2.5-VL (Bai et al. 2025) and InternVL3 (Zhu et al. 2025), however, datasets also need to cover complex reasoning, OCR reading, fine-grained grounding, interleaved multi-image input, and video understanding.
Therefore, this chapter is not a repeated version of P03. P03 answers "how to organize the basic pipeline of a LLaVA-style data factory clearly"; this chapter answers "after the foundation model, sampling strategy, and quality-filtering capability are upgraded, how should multimodal instruction production be extended into a pipeline closer to an industrial factory." The former emphasizes reproducibility of the classic process, while the latter emphasizes newer capabilities such as Qwen-VL generation, self-consistency, LLM-as-Judge, multilingual expansion, and unified packaging.
Industrial multimodal instruction synthesis must handle several challenges:
- Instruction diversity: Beyond description, datasets need reasoning, fine-grained grounding, chart reading, and OCR tasks.
- Multi-source and multi-form input: Data should support not only single images, but also interleaved images and video.
- Quality control: Pure generation creates severe hallucinations, so multi-sample verification and LLM-as-Judge filtering (Zheng et al. 2023) are needed.
This project builds a complete multimodal instruction data factory. Starting from an image-only pool such as a LAION subset, it uses strong foundation models, including Qwen2.5-VL-7B and Qwen2.5-72B, to produce high-quality complex instructions in an automated and scalable way. After completing the project, readers can adapt the same production line to private image collections in domains such as medicine, law, and e-commerce.
Architecture¶
The factory is divided into five components, shown in Figure P13-1.
Figure P13-1: Qwen-VL-style multimodal instruction synthesis pipeline.
- Seed selector: Retrieves seed images from massive image pools, emphasizing OCR-rich images, charts, and realistic complex scenes.
- Instruction generator: Defines six categories of complex instruction templates and calls Qwen2.5-VL through vLLM (Kwon et al. 2023) for high-throughput generation.
- Quality scorer and self-consistency: Uses self-consistency (Wang et al. 2023) to validate reasoning tasks through repeated sampling.
- LLM-as-Judge filter: Uses a strong text-only model such as Qwen2.5-72B-Instruct to score logic and detail, discarding samples below 4.0.
- Multilingual expander and packer: Extends data through Chinese-English translation where needed and exports a unified format that supports image, multi-image, and video references.
Table P13-2 maps architecture components to code entry points and key artifacts. Unlike P03, P13 does not walk through LLaVA image-text preparation again. Its focus is how a modern multimodal instruction factory organizes seed selection, templates, generation, filtering, expansion, packaging, and acceptance into a reviewable chain.
Table P13-2: Stage artifacts and code entry points for the multimodal instruction factory.
| Stage | Code entry | Main input | Main output | Key review point |
|---|---|---|---|---|
| Seed selection | seed_selector.py |
LAION metadata or private visual-asset manifest | Seed list | Resolution, aspect ratio, original caption length, authorization status |
| Template management | instruction_templates.py |
Task type | Prompt template | Task coverage, template repetition, prompt boundary |
| VLM generation | generate_with_qwen_vl.py |
Seed list, templates, Qwen2.5-VL | Raw instruction records | Model version, sampling parameters, failed samples |
| LLM-as-Judge | llm_judge.py |
Instruction and response | Scored records | Scoring rule, threshold, human calibration examples |
| Self-consistency | self_consistency.py |
Multi-sample generations | Consistency score | Multi-sample agreement, reasoning-task stability |
| Multilingual expansion | multilingual_expand.py |
High-quality English samples | Bilingual records | Terminology consistency, visual-reference preservation |
| Unified packaging | pack_multi_image_video.py |
Scored records | mm_sft_final.jsonl |
Qwen format, image/video paths, conversation fields |
| Unit tests | tests/test_factory.py |
Template, judge, expansion, packaging functions | Test report | Basic contracts and example-output completeness |
The key function of Table P13-2 is to split "generation" out of a single model call. In real projects, VLM generation is only the middle of the pipeline. Before it, controlled seeds and task templates are required; after it, consistency checks, score filtering, multilingual review, and format packaging are required. If only the generation script is kept, the chapter becomes a demo. If stage artifacts and review fields are kept, the chapter has the engineering depth expected of a project chapter.
Step-by-Step Implementation¶
Step 1: Seed Selector¶
From an open LAION subset (Schuhmann et al. 2022), use metadata such as image width, height, original caption length, and tags to select promising seeds.
Listing P13-2 provides a Python implementation excerpt.
# code/zh/project_13_mm_instruction_factory/seed_selector.py
from datasets import load_dataset
def select_seeds(dataset_name="laion/laion2B-en", num_samples=5000):
print("Loading LAION metadata...")
# In production, stream metadata first instead of downloading all images.
ds = load_dataset(dataset_name, split="train", streaming=True)
seeds = []
for item in ds:
try:
w, h = item.get("WIDTH", 0), item.get("HEIGHT", 0)
if w > 512 and h > 512 and 0.5 < (w / h) < 2.0:
# Text longer than 10 words suggests richer visual context.
if len(str(item.get("TEXT", "")).split()) > 10:
seeds.append({
"url": item["URL"],
"original_caption": item["TEXT"],
})
except Exception:
continue
if len(seeds) >= num_samples:
break
print(f"Selected {len(seeds)} high-quality seed images.")
return seeds
if __name__ == "__main__":
select_seeds(num_samples=100)
Listing P13-2: Python implementation excerpt.
Step 2: Instruction Template Design¶
Unlike fixed-question LLaVA data, this pipeline needs diverse roles and task templates.
Listing P13-3 provides a Python implementation excerpt.
# code/zh/project_13_mm_instruction_factory/instruction_templates.py
import random
TEMPLATES = {
"detailed_description": [
"Please provide a highly detailed, comprehensive description of this image, capturing every visible element, spatial relationship, and background context.",
"Describe this image as if you are explaining it to someone who cannot see it, ensuring no detail is left out.",
],
"complex_reasoning": [
"Based on the visual evidence in the image, infer the sequence of events that likely led to this scene. Explain your reasoning step-by-step.",
"What are the implicit relationships between the objects shown? Provide a logical deduction.",
],
"ocr_reading": [
"Extract all visible text in this image and format it into a structured markdown table or list.",
],
}
def get_random_prompt(task_type):
return random.choice(TEMPLATES.get(task_type, TEMPLATES["detailed_description"]))
Listing P13-3: Python implementation excerpt.
Step 3: High-throughput Generation with vLLM¶
With vLLM's high concurrency, selected images and instruction templates can be sent to a base multimodal model at scale.
Listing P13-4 provides a latent-reasoning trace sample.
# code/zh/project_13_mm_instruction_factory/generate_with_qwen_vl.py
from vllm import LLM, SamplingParams
from instruction_templates import get_random_prompt
def generate_instructions(seeds, model_path="Qwen/Qwen2.5-VL-7B-Instruct"):
llm = LLM(
model=model_path,
trust_remote_code=True,
max_num_seqs=16,
gpu_memory_utilization=0.9,
)
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=1024)
inputs = []
for seed in seeds:
task = "detailed_description"
prompt = get_random_prompt(task)
messages = [{
"role": "user",
"content": [
{"type": "image", "image_url": {"url": seed["url"]}},
{"type": "text", "text": prompt},
],
}]
# In production, use the transformers tokenizer to process messages.
prompt_text = f"<|im_start|>user\n<|image_pad|>\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
inputs.append({
"prompt": prompt_text,
"multi_modal_data": {"image": seed["url"]},
"metadata": {"task": task, "url": seed["url"], "prompt": prompt},
})
print(f"Generating answers for {len(inputs)} seeds...")
outputs = llm.generate(inputs, sampling_params=sampling_params)
results = []
for output, req in zip(outputs, inputs):
results.append({
"url": req["metadata"]["url"],
"task": req["metadata"]["task"],
"instruction": req["metadata"]["prompt"],
"response": output.outputs[0].text,
})
return results
Listing P13-4: Latent-reasoning trace sample.
Step 4: LLM-as-Judge Quality Filtering¶
Generated responses often hallucinate. We introduce a strong judge model such as Qwen2.5-72B-Instruct. Because a text-only 72B model cannot directly inspect images, we use text-only evaluation: the judge scores the internal logic, completeness, and structure of the generated long response.
Listing P13-5 provides a Python implementation excerpt.
# code/zh/project_13_mm_instruction_factory/llm_judge.py
def score_with_llm_judge(generated_data):
"""
Demonstration logic. In a real pipeline this function calls a 72B judge model
served by vLLM. Input is an instruction and response; output is a 1-5 score.
"""
scored_data = []
for item in generated_data:
# Example production prompt:
# Rate the quality of this response to the instruction. Score 1 to 5.
word_count = len(item["response"].split())
score = 4.5 if word_count > 50 else 3.0
if score >= 4.0:
item["judge_score"] = score
scored_data.append(item)
print(f"Filtered {len(generated_data)} down to {len(scored_data)} high-quality samples.")
return scored_data
Listing P13-5: Python implementation excerpt.
Step 5: Unified Downstream Packaging¶
Whether the source is a single image, multiple images, or a video clip, the final output is written as JSONL in a community format such as ShareGPT or a model-specific format such as Qwen2.5-VL fine-tuning format.
Listing P13-6 provides a Python implementation excerpt.
# code/zh/project_13_mm_instruction_factory/pack_multi_image_video.py
import json
def pack_to_qwen_format(scored_data, output_path="./data/mm_sft_final.jsonl"):
formatted_dataset = []
for item in scored_data:
record = {
"type": "image",
"image": item["url"],
"conversations": [
{
"from": "user",
"value": f"<image>\n{item['instruction']}",
},
{
"from": "assistant",
"value": item["response"],
},
],
"quality": {"judge_score": item["judge_score"]},
}
formatted_dataset.append(record)
with open(output_path, "w", encoding="utf-8") as f:
for record in formatted_dataset:
f.write(json.dumps(record, ensure_ascii=False) + "\n")
print(f"Saved {len(formatted_dataset)} samples to {output_path}")
if __name__ == "__main__":
dummy_data = [{
"url": "http://example.jpg",
"instruction": "Describe",
"response": "A cat.",
"judge_score": 4.5,
}]
pack_to_qwen_format(dummy_data)
Listing P13-6: Python implementation excerpt.
Engineering Run Path and Minimal Reproduction¶
The P13 code directory is code/zh/project_13_mm_instruction_factory. Compared with P11 and P14, this project is more of a generative data factory. The minimal reproduction path is therefore not one fixed shell script, but a staged function chain: select seeds, generate instructions from templates, then run judge, self-consistency, multilingual expansion, and format packaging. In teaching environments, a small seed set and mock judge can first validate artifact contracts before replacing them with real Qwen2.5-VL and Qwen2.5-72B-Instruct services.
Listing P13-7 shows the minimal run order. A production implementation can wrap it in shell, Makefile, Airflow, or Ray, but the project chapter should make stage boundaries and artifact transfer explicit.
from seed_selector import select_seeds
from generate_with_qwen_vl import generate_instructions
from llm_judge import score_with_llm_judge
from self_consistency import self_consistency_filter
from multilingual_expand import expand_multilingual
from pack_multi_image_video import pack_to_qwen_format
seeds = select_seeds(num_samples=100)
raw = generate_instructions(seeds)
consistent = self_consistency_filter(raw)
scored = score_with_llm_judge(consistent)
expanded = expand_multilingual(scored)
pack_to_qwen_format(expanded, "./data/mm_sft_final.jsonl")
Listing P13-7: Python implementation excerpt.
This code describes the factory's minimal closed loop, but it is not yet a production script. Production runs need four additional controls. First, model calls must record model path, temperature, top-p, max tokens, and concurrency. Second, seeds must record source, authorization, and download status. Third, judge output must retain the scoring prompt, threshold, and human calibration set. Fourth, before packaging, the pipeline must check image links, conversation format, and sample deduplication.
Table P13-3 summarizes the corresponding comparison and engineering considerations.
Table P13-3: Runtime records for the multimodal instruction factory.
| Category | Record item | Purpose |
|---|---|---|
| Asset version | Image source, URL, authorization, download time | Proves sample traceability |
| Generation model | Qwen2.5-VL path, inference framework, sampling parameters | Explains output differences |
| Template version | Task type, template text, template hash | Controls task distribution |
| Judge version | Scoring model, scoring rubric, threshold | Reviews filtering results |
| Multilingual version | Translation model, terminology table, language ratio | Reviews cross-language consistency |
| Packaging version | Output format, field schema, target training framework | Ensures training scripts can read the data |
| Spot-check record | Human samples, failure samples, revision notes | Supports release gates |
Data Schema and Sample Contract¶
A minimal multimodal instruction record cannot contain only image, instruction, and response. The project chapter should emphasize that training formats can be narrow, but engineering intermediate states must be wider. Otherwise, once hallucination, format errors, or copyright problems appear, the data team cannot trace a sample back to its image, template, model call, or filtering step.
Table P13-4 summarizes the corresponding comparison and engineering considerations.
Table P13-4: Intermediate-state sample schema for the multimodal instruction factory.
| Field | Example | Meaning |
|---|---|---|
sample_id |
p13_laion_000001 |
Stable primary key across logs |
asset.url |
https://...jpg |
Original image or video address |
asset.license |
cc-by, internal |
Basis for release and redistribution judgment |
seed.original_caption |
Original alt text | Used to judge seed quality |
task.type |
detailed_description, ocr_reading |
Controls task distribution |
prompt.template_id |
ocr_v1_002 |
Tracks template version |
generation.model |
Qwen2.5-VL-7B-Instruct |
Tracks generation model |
generation.response |
Long response text | Candidate training sample |
quality.judge_score |
4.5 |
Basis for LLM-as-Judge filtering |
quality.consistency_score |
1.0 |
Multi-sample consistency |
language |
en, zh |
Distinguishes multilingual samples |
audit_trace |
Batch, timestamp, script version | Supports review and takedown |
The final mm_sft_final.jsonl file in Qwen format can be narrower than the intermediate state, but the intermediate state should not be discarded. Training files serve the training framework; audit files serve quality and release. They can be joined by sample_id.
Quality Filtering: From Length Thresholds to Calibrated Rubrics¶
The demonstration llm_judge.py uses response length as a proxy: answers above a certain word count receive 4.5, while shorter answers receive 3.0. This is acceptable for teaching, but not for a real release gate. A real LLM-as-Judge setup should include at least four scoring dimensions: image-text consistency, answer completeness, task following, and safety/compliance.
Table P13-5 summarizes the corresponding comparison and engineering considerations.
Table P13-5: LLM-as-Judge scoring rubric for multimodal instruction samples.
| Scoring dimension | 5-point behavior | Low-score risk |
|---|---|---|
| Image-text consistency | Describes only content supported by visual evidence | Hallucinates subjects, actions, or text |
| Task following | Strictly follows the template requirement, such as OCR table output | Off-task answer or invalid format |
| Detail completeness | Covers subjects, spatial relations, text, and background | Too short, generic, missing trainable information |
| Reasoning reliability | Reasoning steps are supported by visual evidence | Over-infers causality or intention |
| Safety and compliance | Avoids sensitive identity inference and improper content | Privacy, bias, or dangerous guidance |
| Language quality | Clear expression without severe repetition | Mechanical repetition, garbling, or abnormal language mixing |
Self-consistency complements judge blind spots. For complex reasoning questions, the model can generate multiple answers, then compare whether conclusions and key evidence agree. If different samples conflict on subjects, text, or spatial relationships, the record should not enter the training set even if one answer is long and fluent. The current self_consistency.py is a simplified interface and teaching implementation; real projects should plug in multi-sample generation and consistency metrics.
Multilingual Expansion and Cross-Language Acceptance¶
Multilingual expansion is not simply copying an English instruction and adding an instruction_zh field. In multimodal tasks, cross-language errors often occur in visual references and proper-name translation. For example, "the sign on the left" may be translated as "the sign on the right," or brands, place names, and units may be localized incorrectly. P13 should treat Chinese and English as two sample sets that both require spot checks, not as a cheap way to double the count.
Table P13-6 summarizes the corresponding comparison and engineering considerations.
Table P13-6: Multilingual expansion acceptance items.
| Acceptance item | Check method | Common issue |
|---|---|---|
| Reference consistency | Compare image against left/right, top/bottom, foreground/background | Direction words mistranslated |
| Terminology consistency | Check OCR, chart, bbox, caption against the glossary | Terminology changes across samples |
| Proper-name preservation | Check brands, places, people, and units | Over-translation or mistranslation |
| Format preservation | Check tables, lists, JSON, Markdown | Translation breaks structure |
| Safety boundary | Check whether sensitive content bypasses filtering in another language | English filtering works but Chinese filtering fails |
If the project targets Chinese-model training, do not only translate English samples into Chinese. Keep a portion of native Chinese templates and native Chinese judge prompts. Translated samples are useful for scale, but native Chinese samples better reflect real Chinese user questions.
Test Coverage and Code Notes¶
tests/test_factory.py covers template existence, random prompt return type, judge filtering, Chinese expansion, and JSONL packaging. These tests prevent basic interface breakage, but they do not prove the factory is releasable. In particular, generate_with_qwen_vl.py is a teaching example. Before real vLLM or Qwen-VL integration, it needs input variables, exception handling, model-call result parsing, and failed-sample records. The chapter presents it to explain the generation-stage interface, not to claim production completeness.
Table P13-7 summarizes the corresponding comparison and engineering considerations.
Table P13-7: Test coverage and acceptance gaps for the multimodal instruction factory.
| Test item | Covered | Still needed |
|---|---|---|
| Template test | Three template types exist; prompt returns a string | Template repetition rate, task ratio |
| Judge test | Short responses filtered; long responses retained | Real judge rubric and human agreement |
| Multilingual test | Chinese expansion field generated | Semantic consistency and format preservation |
| Packaging test | JSONL file can be written | Conversation-field spot check |
| End-to-end mock | Test entry exists | Small-sample real model run |
Common Faults and Troubleshooting Paths¶
Seed-stage issues usually involve dead image links, missing image-size fields, or low-quality original captions. First count filtering reasons rather than only final seed count. If many images are removed by aspect-ratio filters, confirm field units and source schema.
Generation-stage issues include incorrect model input format, image-download failure, empty VLM output, and OOM or timeout from excessive concurrency. With vLLM, record concurrency, GPU memory, and failed requests. With APIs, record retry count, error codes, and billing units.
Filtering-stage issues often come from a judge that over-rewards long answers. Long answers are not necessarily high quality; in multimodal settings they may contain more hallucinations. Review high-score, low-score, and threshold-near samples separately, and periodically calibrate the judge with human labels.
Packaging-stage issues involve mismatch among image URLs, <image> markers, and the target training framework's conversation format. Randomly read JSONL lines and confirm that each line is valid JSON, conversations[0].value contains an image placeholder, assistant output is non-empty, and quality fields link back to intermediate records.
Manual Spot Checks and Release Gates¶
Multimodal instruction factories often look good on automated metrics while failing under human reading. High-scoring samples may be fluent but unfaithful to the image; multilingual samples may be grammatical but wrong on direction, count, or OCR text. Manual spot checks are therefore mandatory before release.
Table P13-8 summarizes the corresponding comparison and engineering considerations.
Table P13-8: Manual review strata for the multimodal instruction factory.
| Review layer | Sample source | Review focus |
|---|---|---|
| High-score samples | Highest judge-score batch | Whether the judge over-rewards long text |
| Boundary samples | Samples close to threshold | Whether the threshold is too strict or too loose |
| Low-score samples | Filtered samples | Whether valuable samples were wrongly removed |
| OCR samples | ocr_reading tasks |
Text accuracy and format preservation |
| Reasoning samples | complex_reasoning tasks |
Whether reasoning has visual evidence |
| Chinese samples | Multilingual expansion results | Terminology, direction, proper names |
| Multi-image / video samples | Packer extensions | Reference order and placeholders |
Manual review should use dual review plus arbitration. The first reviewer checks image-text consistency and task following. The second checks language quality and safety boundaries. Conflicts enter an arbitration pool, which is then used to revise judge prompts, templates, and thresholds. Manual review is not a one-time quality check; it is part of factory iteration.
Release gates should include at least four checks. First, sample sources must be traceable, and external images must not be represented only by naked URLs. Second, training files must be readable by the target framework, not merely valid JSON. Third, there must be an agreement report between judge and human review. Fourth, if multilingual samples are released, Chinese and English quality must be reported separately.
Table P13-9 summarizes the corresponding comparison and engineering considerations.
Table P13-9: Release-gate checklist for the multimodal instruction factory.
| Gate | Required evidence | Action on failure |
|---|---|---|
| Source gate | URL, license, download status, deletion-request handling | Remove unauthorized or untraceable samples |
| Format gate | JSONL validation, small-sample training loader read | Fix packer or field schema |
| Quality gate | Judge distribution, human-review pass rate, failure types | Adjust templates, thresholds, or generation parameters |
| Multilingual gate | Separate Chinese and English review reports | Roll back low-quality translation batches |
| Safety gate | Sensitive content, privacy, identity-inference checks | Delete samples and update filtering rules |
| Version gate | Model version, template version, run batch | Freeze versions before release |
Multi-Image and Video Extension Path¶
The presence of pack_multi_image_video.py indicates that this project targets more than single-image SFT. Modern VLM training increasingly depends on interleaved images, multi-image comparison, and short video clips. The core issue is not concatenating several <image> tags, but making the instruction clearly point to each visual input and making the answer explicitly express comparison, ordering, temporal change, or cross-image relation.
Table P13-10 summarizes the corresponding comparison and engineering considerations.
Table P13-10: Comparison of multimodal instruction types.
| Type | Input organization | Instruction focus | Common error |
|---|---|---|---|
| Single image | One <image> |
Description, OCR, local reasoning | Hallucinated object or text |
| Multi-image comparison | <image_1>, <image_2> |
Difference, similarity, ordering, change | Image order confused |
| Interleaved text-image | Text paragraphs with multiple images | Refer to images through context | Wrong image reference or missing context |
| Short video | Multiple frames or <video> |
Action, temporal order, camera movement | Describes video as static image |
| Chart screenshot | Image plus OCR/table structure | Numeric reading, trend explanation | Fabricated value or axis |
For video, reuse P14's shot-level structure: frame_paths, caption_en, shot_language, and camera_motion can become video-instruction material for P13. A video QA sample can ask the model to explain how the subject moves or infer camera movement. P13 and P14 are therefore upstream and downstream: P13 is the instruction factory, while P14 is the video data pipeline.
Deliverable Directory and Version Management¶
P13 deliverables should be separated into raw, scored, expanded, packed, and reports. This avoids mixing training files with audit files and makes stage-level rollback possible.
Table P13-11 summarizes the corresponding comparison and engineering considerations.
Table P13-11: Deliverable directory for the multimodal instruction factory.
| Path | Content | Note |
|---|---|---|
data/seeds.jsonl |
Seed asset list | URL, authorization, original caption, filtering reason |
data/generated_raw.jsonl |
Raw VLM generations | Not used directly for training; used for review |
data/scored.jsonl |
Judge-filtered records | Score, rubric, model version |
data/consistent.jsonl |
Self-consistency-filtered records | Multi-sample consistency evidence |
data/multilingual.jsonl |
Multilingual expansion samples | Language, glossary version, translation model |
data/mm_sft_final.jsonl |
Training input file | Targeting Qwen-VL or another training framework |
reports/task_distribution.json |
Task-distribution report | Checks task imbalance |
reports/human_review.md |
Manual review report | Core release-gate evidence |
reports/license_audit.md |
Copyright and source audit | Required for public release |
For version management, hash templates and judge prompts. A model version can remain fixed while template text changes enough to shift sample distribution. A small judge-prompt change can also move pass rates. Release reports should include model version, template version, judge-prompt version, and data batch, not just "generated with Qwen2.5-VL."
Data Dashboard and Continuous Iteration¶
After launch, the factory must continue observing sample distribution instead of generating once and sending data directly to training. The dashboard can start as JSONL statistics scripts; it does not need to be a complex platform. Each batch should answer: whether task types are balanced, whether judge pass rate is abnormal, whether multilingual ratio is stable, and which asset types dominate failure samples.
Table P13-12 summarizes the corresponding comparison and engineering considerations.
Table P13-12: Dashboard metrics for the multimodal instruction factory.
| Dashboard metric | Object | Purpose |
|---|---|---|
| Seed pass rate | seeds.jsonl |
Judge whether asset-selection thresholds are too strict |
| Task-type distribution | task.type |
Prevent over-production of detailed-description samples |
| Average response length | Raw/scored samples | Detect templated short answers or verbose hallucination |
| Judge-score distribution | quality.judge_score |
Observe model and rubric drift |
| Consistency distribution | quality.consistency_score |
Detect unstable reasoning tasks |
| Chinese-English ratio | language |
Control multilingual expansion scale |
| Format error rate | JSONL validation result | Detect packer or template problems |
| Manual-review pass rate | Review report | Judge release-gate readiness |
| Safety-interception rate | Safety filter | Monitor sensitive content and privacy risk |
Dashboards must be stored by batch. A sudden judge-pass-rate increase does not necessarily mean quality improved; the judge prompt may have become looser, templates may have become longer, or the model may have learned to produce verbose answers. If OCR-task pass rate is much lower than description-task pass rate, inspect OCR image quality and the scoring rubric separately instead of raising the global threshold.
Sample Takedown and Copyright Response¶
Multimodal data triggers copyright, portrait-right, and privacy risks more easily than pure text. A public URL does not imply unlimited redistribution of images or generated results. P13 must keep a takedown path: when an image, author, or source collection must be deleted, the system should locate related instructions, translated samples, and final training files.
Table P13-13 summarizes the corresponding comparison and engineering considerations.
Table P13-13: Takedown path for multimodal instruction samples.
| Step | Operation | Affected artifact |
|---|---|---|
| Register request | Record URL, author, source, request time, evidence | Ticket |
| Locate asset | Search seed by URL, hash, source, or sample ID | seeds.jsonl |
| Locate derived samples | Search raw, scored, multilingual, and packed records | All intermediate states |
| Delete training samples | Remove corresponding lines from mm_sft_final.jsonl |
Training file |
| Recompute statistics | Update task distribution, language ratio, quality report | Reports |
| Publish note | Record deletion reason and new version | Release note |
The takedown mechanism requires stable sample_id in the intermediate state. If only the final Qwen conversation format is saved, it is hard to trace a training sample back to the original image and generation batch. P13 must therefore distinguish the narrow training table from the wide audit table.
Domain Transfer: From General Images to Industry Assets¶
P13 can transfer to medical imaging, industrial inspection, e-commerce product images, legal-evidence screenshots, and educational charts. But templates and gates must be redesigned for each domain. Visual evidence and risk boundaries differ too much to reuse generic LAION templates directly.
Table P13-14 summarizes the corresponding comparison and engineering considerations.
Table P13-14: Domain-transfer adjustments for the multimodal instruction factory.
| Domain | Asset type | Template adjustment | Risk control |
|---|---|---|---|
| Medical | Images, report screenshots | Describe abnormal regions; avoid diagnostic conclusions | Expert review, privacy masking |
| Industrial | Defect images, equipment photos | Describe defect location, morphology, severity | Internal confidentiality, misjudgment cost |
| E-commerce | Product images, detail-page screenshots | Attribute extraction, comparison, OCR reading | Brand authorization, exaggerated descriptions |
| Finance | Report screenshots, charts | Table reading, trend explanation, evidence citation | Numeric accuracy, investment-advice boundary |
| Education | Problem figures, board writing, textbook illustrations | Solving hints, chart understanding | Copyright, answer leakage |
For domain transfer, build a small set of high-quality templates and expert-review samples before scaling. In high-risk domains, do not rely entirely on LLM-as-Judge. The judge can pre-filter, but release gates should be decided jointly by domain experts or rule systems.
Relationship with P03 and P14¶
P13 sits between P03 and P14. P03 establishes the classic LLaVA image-text and conversation baseline. P13 adds Qwen-VL-style generation, judge, self-consistency, and multilingual expansion. P14 extends visual input from static images to video shots. Together they form a progression from single-image baseline to modern multimodal factory and then to video-generation data.
Table P13-15 summarizes the corresponding comparison and engineering considerations.
Table P13-15: Project boundaries among P03, P13, and P14.
| Project | Core object | Key capability | Boundary not to confuse |
|---|---|---|---|
| P03 | LLaVA image-text pairs and conversation | Classic flow, OCR, bbox, visual spot checks | Does not emphasize newer Qwen-VL factory capability |
| P13 | Multimodal instruction samples | Templates, VLM generation, judge, multilingual packaging | Does not handle video cutting and T2V quality filtering |
| P14 | Video shot data | Shot segmentation, motion, aesthetics, caption, shot language | Does not handle large-scale instruction diversification |
With this organization, readers can treat P03 as the baseline data structure, P13 as the instruction-generation factory, and P14 as the video-material and temporal-supervision source. A future Video-QA or Video-Instruct dataset can first use P14 to create video segments and shot fields, then use P13 templates, judge, and packaging to produce instruction samples.
Results and Analysis¶
The example acceptance setting deploys Qwen2.5-VL-7B with vLLM on one node with four 4090 GPUs and calls a 72B model as judge through an API, producing a candidate batch of multimodal instruction samples. In formal reproduction, replace the example scale with actual task configuration, generation logs, and sample manifests.
- Task distribution: Detailed description (40%), complex reasoning (30%), OCR and tables (20%), and fine-grained grounding (10%). No single category exceeded 40%.
- Quality distribution: Samples passing LLM-as-Judge filtering should record mean, quantiles, and rejection reasons. An example acceptance report may show an average score such as 4.3 / 5.0, but formal results must retain scoring details and judge version.
Formal acceptance should check four kinds of evidence. First, the data can be read by downstream training scripts. Second, the task-type distribution matches the planned ratio. Third, image, instruction, and answer are not obviously mismatched. Fourth, source license, model license, and redistribution rules for generated artifacts are registered. Only after these checks can generated data move from the candidate pool into the training set.
Cost and Optimization¶
The industrial synthesis factory has the following cost profile:
- Synthesis cost: On private compute, a 7B VLM takes about 1-2 seconds to generate one long image-conditioned response. With commercial APIs, the cost is about \(5-\)10 per thousand high-quality samples.
- Scalability: vLLM tensor parallelism handles multimodal generation pressure well. When compute is limited, reduce
max_num_seqsand lower the sampling temperature to prevent low-value divergence.
Extensions¶
Compared with earlier LLaVA-style data pipelines that relied heavily on manual work or expensive GPT-4V distillation, the Qwen-VL plus LLM-as-Judge self-distillation pipeline substantially lowers fine-tuning cost.
Video clips can be inserted into the same pipeline by changing the packer: sampled frames can be represented as multiple <image> tags or one <video> field, enabling data synthesis for T2V or Video-QA models.
Data Compliance and Open-source Licensing¶
When building and publishing instruction datasets, observe these constraints:
- LAION seed images: Original image links may be governed by CC-BY or other public licenses and should be used for research under the corresponding terms.
- Qwen2.5-VL: Model use and redistribution of generated content are governed by the model's open-source or commercial license.
- Generated artifacts: A dataset such as
dataforge-mm-instruction-50kcan be released under CC-BY-SA when the upstream licenses allow it.
Chapter Summary¶
This chapter used the multimodal instruction factory as a project case to show how to build a multimodal instruction production chain covering image, text, OCR, chart, and dialogue tasks. Its main value is putting task definition, data boundaries, architecture decisions, sample schema, acceptance metrics, and reproduction resources into one chain, so the project is not merely a sequence of operations but a reviewable case study.
The boundary of the case must also remain explicit. It targets controlled assets and sample factories; it does not cover unauthorized media collection or fully automated safety review. In larger-scale, higher-risk, or more strictly regulated settings, data sources, permission status, human-review ratio, runtime cost, and rollback plans must be reassessed.
As part of Part 14, this chapter validates methods from earlier chapters at the project layer. Readers can combine this case with Part 13's data recipes, the platform-governance chapters, and the appendix checklists to form a closed loop from method understanding to engineering delivery.
References¶
Bai S, Chen K, Liu X, Wang J, Ge W, Song S, Dang K, Wang P, Wang S, Tang J, et al. (2025) Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923.
Zhu J, Wang W, Chen Z, Liu Z, Ye S, Gu L, Duan Y, Tian H, Su W, Shao J, et al. (2025) InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models. arXiv preprint arXiv:2504.10479.
Kwon W, Li Z, Zhuang S, Sheng Y, Zheng L, Yu C H, Gonzalez J E, Zhang H, Stoica I (2023) Efficient Memory Management for Large Language Model Serving with PagedAttention. In: Proceedings of the 29th ACM Symposium on Operating Systems Principles, pp 611-626. https://doi.org/10.1145/3600006.3613165.
Schuhmann C, Beaumont R, Vencu R, Gordon C, Wightman R, Cherti M, Coombes T, Katta A, Mullis C, Wortsman M, et al. (2022) LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models. In: Advances in Neural Information Processing Systems 35, pp 25278-25294. Available at: https://arxiv.org/abs/2210.08402.
Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S, Chowdhery A, Zhou D (2023) Self-Consistency Improves Chain of Thought Reasoning in Language Models. In: International Conference on Learning Representations. arXiv:2203.11171.
Zheng L, Chiang W L, Sheng Y, Zhuang S, Wu Z, Zhuang Y, Lin Z, Li Z, Li D, Xing E P, Zhang H, Gonzalez J E, Stoica I (2023) Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In: Advances in Neural Information Processing Systems 36. arXiv:2306.05685.