Skip to content

Data Engineering for Large Models: Architecture, Algorithms & Projects

Full Table of Contents Overview

The Chinese 2026 edition is the mainline of this book, covering 33 chapters and 14 end-to-end project chapters. The English and Japanese editions are being translated incrementally. Chapters not yet translated will display a notice; please use the language switcher at the top to view the Chinese edition.

  • Preface
  • Chinese 2026 Edition Translation Status
  • Part 1: Overview and Infrastructure
  • Part 2: Text Pre-training Data Engineering
  • Part 3: Multimodal Data Engineering
  • Part 4: Instruction Fine-tuning and Preference Data
  • Part 5: Synthetic Data Engineering
  • Part 6: Reasoning and Agent Data Engineering
  • Part 7: Application-Level Data Engineering
  • Part 8: Data Operations and Platform Development
  • Part 9: Privacy Compliance and Data Security
  • Part 10: Practical Projects
  • Part 11: Open-Source LLM Data Engineering in Practice

Part 1: Overview and Infrastructure

Establishes the core cognitive framework for LLM data engineering, covering the data lifecycle, quality evaluation, platform stack, and cost governance.

Part 2: Text Pre-training Data Engineering

Targets large-scale text corpora, covering data sources, acquisition and copyright, cleaning and deduplication, tokenization and serialization, efficient loading, and the quality closed loop.

Part 3: Multimodal Data Engineering

Handles image-text, document, video, audio, and cross-modal alignment data, focusing on sample structure, quality control, annotation augmentation, and fusion training.

Part 4: Instruction Fine-tuning and Preference Data

Centers on model alignment data, covering the SFT instruction system, preference data, reward signals, annotation platforms, and quality operations.

Part 5: Synthetic Data Engineering

Walks from seed samples to a synthetic data factory, including knowledge distillation, model collaboration, quality control, and the risk of model collapse.

Part 6: Reasoning and Agent Data Engineering

Covers the construction and validation of chain-of-thought reasoning traces, Tool-Use, function calling, agent memory, and multi-turn interaction data.

Part 7: Application-Level Data Engineering

Targets RAG and online knowledge systems, including document parsing, visual retrieval, multimodal RAG, online feedback loops, and knowledge updates.

Part 8: Data Operations and Platform Development

Builds sustainable data platform capabilities from the perspectives of team organization, version management, experiment tracking, and observability.

Part 9: Privacy Compliance and Data Security

Discusses data compliance and governance, privacy protection, federated learning, and security boundaries, emphasizing compliance gates in engineering workflows.

Part 10: Practical Projects

Ten runnable projects that string together acquisition, cleaning, synthesis, RAG, Agent, DataOps, privacy protection, and the data flywheel into end-to-end practice.

Part 11: Open-Source LLM Data Engineering in Practice