Part 1: Overview and Infrastructure¶

Positioning¶

Part 1 establishes the shared conceptual framework for the whole book. It explains the objects, boundaries, quality goals, core cost items, and infrastructure layers of large-model data engineering. It is not centered on a single tool or model. Instead, it starts from the data lifecycle and explains why data has become the common constraint behind model capability, cost, and risk control.

In the published structure, this part has three functions. Chapter 1 introduces the background and paradigm shift, explaining why large-model development must move from a model-centered view to joint governance of data and systems. Chapter 2 establishes the quality language used throughout the book, turning noise, duplication, contamination, bias, missingness, and freshness into measurable, reviewable, and enforceable engineering indicators. Chapter 3 maps the quality framework to infrastructure, explaining how ingestion, processing orchestration, storage, indexing, evaluation operations, governance, and security jointly support training and applications.

Learning Objectives¶

After reading this part, readers should be able to:

Explain the key differences between large-model data engineering, traditional data warehouses, and traditional machine-learning data processing.
Identify the different quality goals of pre-training, instruction tuning, preference alignment, and RAG applications.
Map common data issues to detectable quality indicators, governance actions, and rollback strategies.
Draft an initial AI-native data stack and cost-governance plan based on team scale and training goals.

Terminology Conventions¶

This part uses "large-model data engineering" to refer to lifecycle activities around data for training, alignment, application, and governance. It uses "data lifecycle" to describe the continuous process of acquisition, cleaning, annotation, evaluation, launch, feedback, and retirement. It uses "AI-native data stack" to describe data infrastructure that serves large-model training and applications. The pre-training corpora, SFT data, preference data, RAG corpora, agent trajectories, and data products discussed in later parts should all be understood through this part's frameworks for data quality, cost governance, and risk boundaries.

Chapter Relationships¶

Chapter 1 introduces the book's core proposition: data quality, data scale, and data diversity jointly define the boundary of model capability. Chapter 2 answers the question "How do we judge whether data is usable?" and provides quality scorecards and governance gates. Chapter 3 answers "What infrastructure carries these governance actions?" and places the later chapters on cleaning, alignment, RAG, DataOps, and compliance into one architecture.

Part 1: Overview and Infrastructure¶

Positioning¶

Learning Objectives¶

Terminology Conventions¶

Chapter Relationships¶

Full Book Contents¶

Part Contents¶

Suggested Reading Order¶