Part 5: Synthetic Data Engineering¶

Positioning of This Part¶

Part 5 focuses on the process of building a reusable synthetic data factory from seed samples, covering knowledge distillation, teacher model collaboration, quality control, and the risk of model collapse.

Terminology¶

Throughout this part, "synthetic data" refers to training candidates generated by models, rules, or hybrid pipelines and validated through quality inspection; "teacher model" describes any model that provides generation, evaluation, or distillation signals; "data factory" describes a reproducible, auditable, and scalable generation pipeline. Synthetic samples must specify their seed sources, generation strategy, filtering rules, and applicable scope — generation volume alone does not substitute for quality conclusions.

Learning Objectives¶

After completing this part, readers should be able to:

Design a synthetic data pipeline from seed samples and generation strategy to filtering and acceptance.
Choose collaboration patterns among teacher models, rule-based validation, and human sampling.
Identify risks such as model collapse, distribution drift, templated output, and false diversity.
Record sources, versions, generation parameters, quality signals, and applicability boundaries for synthetic samples.

Prerequisites¶

Before reading this part, readers should understand the SFT, preference-data, and QA conventions from Part 4. Readers with synthetic data experience may focus on whether their own process is reproducible, auditable, and rollbackable, rather than only on generation scale.

Chapter Logic¶

Chapter 15 establishes the basic process of a synthetic data factory and explains how seed data, generation, filtering, and acceptance are connected. Chapter 16 discusses knowledge distillation and multi-model collaboration, explaining how teacher models provide transferable supervision signals. Chapter 17 closes with quality control and model collapse, emphasizing that synthetic data must pass checks for distribution, difficulty, realism, and risk boundaries before entering training.