Data Engineering for Large Models: Architecture, Algorithms & Projects¶
Full Table of Contents Overview¶
The Chinese 2026 edition is the mainline of this book, covering 33 chapters and 14 end-to-end project chapters. The English and Japanese editions are being translated incrementally. Chapters not yet translated will display a notice; please use the language switcher at the top to view the Chinese edition.
- Preface
- Chinese 2026 Edition Translation Status
- Part 1: Overview and Infrastructure
- Part 2: Text Pre-training Data Engineering
- Part 3: Multimodal Data Engineering
- Part 4: Instruction Fine-tuning and Preference Data
- Part 5: Synthetic Data Engineering
- Part 6: Reasoning and Agent Data Engineering
- Part 7: Application-Level Data Engineering
- Part 8: Data Operations and Platform Development
- Part 9: Privacy Compliance and Data Security
- Part 10: Practical Projects
- Part 11: Open-Source LLM Data Engineering in Practice
Part 1: Overview and Infrastructure¶
Establishes the core cognitive framework for LLM data engineering, covering the data lifecycle, quality evaluation, platform stack, and cost governance.
- Chapter 1: The Data Revolution in the Era of Large Models
- Chapter 2: LLM Data Lifecycle and Quality Evaluation Framework
- Chapter 3: AI-Native Data Stack and Cost Governance
Part 2: Text Pre-training Data Engineering¶
Targets large-scale text corpora, covering data sources, acquisition and copyright, cleaning and deduplication, tokenization and serialization, efficient loading, and the quality closed loop.
- Chapter 4: Data Sources, Acquisition, and Copyright
- Chapter 5: Cleaning, Deduplication, and Decontamination
- Chapter 6: Tokenization, Serialization, and Efficient Loading
- Chapter 7: Data Evaluation, Quality Closed Loop, and Operational Iteration
Part 3: Multimodal Data Engineering¶
Handles image-text, document, video, audio, and cross-modal alignment data, focusing on sample structure, quality control, annotation augmentation, and fusion training.
- Chapter 8: Image-Text Pair Data Engineering
- Chapter 9: Re-captioning and Document Understanding
- Chapter 10: Video and Audio Data Engineering
- Chapter 11: Cross-modal Alignment and Fusion
Part 4: Instruction Fine-tuning and Preference Data¶
Centers on model alignment data, covering the SFT instruction system, preference data, reward signals, annotation platforms, and quality operations.
- Chapter 12: SFT Data Design and Instruction System
- Chapter 13: Preference Data and Reward Signals
- Chapter 14: Annotation Platforms, QA Systems, and Data Operations
Part 5: Synthetic Data Engineering¶
Walks from seed samples to a synthetic data factory, including knowledge distillation, model collaboration, quality control, and the risk of model collapse.
- Chapter 15: Synthetic Data Factory: From Seed to Verification
- Chapter 16: Knowledge Distillation and Model Collaboration
- Chapter 17: Synthetic Data Quality Control and Model Collapse
Part 6: Reasoning and Agent Data Engineering¶
Covers the construction and validation of chain-of-thought reasoning traces, Tool-Use, function calling, agent memory, and multi-turn interaction data.
- Chapter 18: Chain-of-Thought and Reasoning Data Engineering
- Chapter 19: Tool-Use and Function Calling Data
- Chapter 20: Agent Memory and Multi-turn Interaction Data
Part 7: Application-Level Data Engineering¶
Targets RAG and online knowledge systems, including document parsing, visual retrieval, multimodal RAG, online feedback loops, and knowledge updates.
- Chapter 21: RAG Data Pipeline
- Chapter 22: Multimodal RAG and Visual Retrieval
- Chapter 23: Online Feedback Loop and Knowledge Update
Part 8: Data Operations and Platform Development¶
Builds sustainable data platform capabilities from the perspectives of team organization, version management, experiment tracking, and observability.
- Chapter 24: DataOps Flywheel and Team Organization
- Chapter 25: Data Version Management and Experiment Tracking
- Chapter 26: Data Platform Observability
Part 9: Privacy Compliance and Data Security¶
Discusses data compliance and governance, privacy protection, federated learning, and security boundaries, emphasizing compliance gates in engineering workflows.
- Chapter 27: Data Compliance Framework and Governance
- Chapter 28: Federated Learning and Privacy-Preserving Technologies
Part 10: Practical Projects¶
Ten runnable projects that string together acquisition, cleaning, synthesis, RAG, Agent, DataOps, privacy protection, and the data flywheel into end-to-end practice.
- Project 1: Building a Distributed Mini-C4 Data Pipeline with Ray
- Project 2: Vertical-Domain Expert SFT (Legal)
- Project 3: LLaVA Multimodal Instruction Data Factory
- Project 4: Synthetic Math and Code Textbook Factory
- Project 5: Multimodal RAG Enterprise Financial Report Assistant
- Project 6: CoT Reasoning Dataset Construction and PRM Training
- Project 7: Agent Tool-Use Data Factory
- Project 8: Enterprise DataOps Platform: From Data Projects to Organizational Governance
- Project 9: Privacy-Preserving Data Pipeline
- Project 10: End-to-End LLM Data Flywheel
Part 11: Open-Source LLM Data Engineering in Practice¶
- Chapter 29: LLM Pre-training Data Recipes
- Chapter 30: LLM Post-training Data Engineering: SFT and Preference Alignment
- Chapter 31: Reasoning Models and RL Data Engineering: R1 / QwQ Paradigm
- Chapter 32: Multimodal Understanding VLM
- Chapter 33: Multimodal Generative Model Data Engineering — T2I and T2V Data Pipelines
- Project 11: Mini-DeepSeek Pre-training Reproduction
- Project 12: R1 Reasoning Flywheel
- Project 13: Multimodal Instruction Factory
- Project 14: Video Generation Dataset — From Video Source to T2V Training Pipeline