Part 1: Overview and Infrastructure¶
Positioning¶
Part 1 establishes the shared conceptual framework for the whole book. It explains the objects, boundaries, quality goals, core cost items, and infrastructure layers of large-model data engineering. It is not centered on a single tool or model. Instead, it starts from the data lifecycle and explains why data has become the common constraint behind model capability, cost, and risk control.
In the published structure, this part has three functions. Chapter 1 introduces the background and paradigm shift, explaining why large-model development must move from a model-centered view to joint governance of data and systems. Chapter 2 establishes the quality language used throughout the book, turning noise, duplication, contamination, bias, missingness, and freshness into measurable, reviewable, and enforceable engineering indicators. Chapter 3 maps the quality framework to infrastructure, explaining how ingestion, processing orchestration, storage, indexing, evaluation operations, governance, and security jointly support training and applications.
Learning Objectives¶
After reading this part, readers should be able to:
- Explain the key differences between large-model data engineering, traditional data warehouses, and traditional machine-learning data processing.
- Identify the different quality goals of pre-training, instruction tuning, preference alignment, and RAG applications.
- Map common data issues to detectable quality indicators, governance actions, and rollback strategies.
- Draft an initial AI-native data stack and cost-governance plan based on team scale and training goals.
Terminology Conventions¶
This part uses "large-model data engineering" to refer to lifecycle activities around data for training, alignment, application, and governance. It uses "data lifecycle" to describe the continuous process of acquisition, cleaning, annotation, evaluation, launch, feedback, and retirement. It uses "AI-native data stack" to describe data infrastructure that serves large-model training and applications. The pre-training corpora, SFT data, preference data, RAG corpora, agent trajectories, and data products discussed in later parts should all be understood through this part's frameworks for data quality, cost governance, and risk boundaries.
Chapter Relationships¶
Chapter 1 introduces the book's core proposition: data quality, data scale, and data diversity jointly define the boundary of model capability. Chapter 2 answers the question "How do we judge whether data is usable?" and provides quality scorecards and governance gates. Chapter 3 answers "What infrastructure carries these governance actions?" and places the later chapters on cleaning, alignment, RAG, DataOps, and compliance into one architecture.
Full Book Contents¶
- Part 1: Overview and Infrastructure
- Part 2: Text Pre-training Data Engineering
- Part 3: Multimodal Data Engineering
- Part 4: Instruction Fine-tuning and Preference Data
- Part 5: Synthetic Data Engineering
- Part 6: Reasoning and Agent Data Engineering
- Part 7: Application-Level Data Engineering
- Part 8: Data Operations and Platform Development
- Part 9: Data Assets, Data Products, and Data Contracts
- Part 10: Intelligent Data Engineering and Data Engineering Agents
- Part 11: Privacy Compliance and Data Security
- Part 12: Specialized Datasets and Multimodal Data Engineering Practice
- Part 13: Open-source LLM Data Engineering Recipes and Paradigms
- Part 14: Practical Projects
Part Contents¶
- Chapter 1: The Data Revolution in the Era of Large Language Models
- Chapter 2: LLM Data Lifecycle and Quality Assessment Framework
- Chapter 3: AI-Native Data Stack and Cost Governance
Suggested Reading Order¶
- Start with Chapter 1 to build the overall problem awareness that data determines the upper bound of model capability.
- Then read Chapter 2 to master the data lifecycle, quality layers, and evaluation framework.
- Finally read Chapter 3 to ground the framework in platforms, compute, storage, and cost governance.