Chinese 2026 Edition Translation Status
The Chinese edition is the primary 2026 version of this book. It contains 28 chapters and 10 hands-on projects covering the full large-model data-engineering lifecycle.
The English edition is being updated from the earlier public edition. Until the full translation is ready, browse the Chinese site for the latest chapter structure and project scope.
Current Policy
- Chinese: latest complete edition (28 chapters + 10 projects).
- English: legacy edition plus this status page; new chapters land here as they are translated.
- Japanese: same legacy + status policy as English.
Translation Progress
Each row shows whether the chapter is available in the legacy edition and whether the 2026 edition has been merged. "Pending" means readers should consult the Chinese site.
Part 1 — Overview and Infrastructure
| 2026 ID |
Chapter |
English |
Japanese |
| ch01 |
The data shift in the large-model era |
Legacy only |
Legacy only |
| ch02 |
LLM data lifecycle and quality framework |
Pending |
Pending |
| ch03 |
AI-native data stack and cost governance |
Pending |
Pending |
Part 2 — Text Pre-training Data Engineering
| 2026 ID |
Chapter |
English |
Japanese |
| ch04 |
Sources, acquisition, copyright |
Legacy only |
Legacy only |
| ch05 |
Cleaning, deduplication, decontamination |
Legacy only |
Legacy only |
| ch06 |
Tokenization, serialization, efficient loading |
Legacy only |
Legacy only |
| ch07 |
Evaluation, feedback loop, ops iteration |
Pending |
Pending |
Part 3 — Multimodal Data Engineering
| 2026 ID |
Chapter |
English |
Japanese |
| ch08 |
Image-text pair data engineering |
Legacy only |
Legacy only |
| ch09 |
Recaptioning and document understanding |
Legacy only |
Legacy only |
| ch10 |
Video and audio data engineering |
Legacy only |
Legacy only |
| ch11 |
Cross-modal alignment and fusion |
Pending |
Pending |
Part 4 — Instruction Fine-tuning and Preference Data
| 2026 ID |
Chapter |
English |
Japanese |
| ch12 |
SFT data design and instruction system |
Legacy only |
Legacy only |
| ch13 |
Preference data and reward signals |
Legacy only |
Legacy only |
| ch14 |
Annotation platform, QA, data ops |
Pending |
Pending |
Part 5 — Synthetic Data Engineering
| 2026 ID |
Chapter |
English |
Japanese |
| ch15 |
Synthetic data factory: from seed to verification |
Pending |
Pending |
| ch16 |
Knowledge distillation and model collaboration |
Pending |
Pending |
| ch17 |
Synthetic-data quality control and model collapse |
Pending |
Pending |
Part 6 — Reasoning and Agent Data Engineering
| 2026 ID |
Chapter |
English |
Japanese |
| ch18 |
Chain-of-thought and reasoning data engineering |
Pending |
Pending |
| ch19 |
Tool-use and function-calling data |
Pending |
Pending |
| ch20 |
Agent memory and multi-turn interaction data |
Pending |
Pending |
Part 7 — Application-level Data Engineering
| 2026 ID |
Chapter |
English |
Japanese |
| ch21 |
RAG data pipeline |
Legacy only |
Legacy only |
| ch22 |
Multimodal RAG and visual retrieval |
Legacy only |
Legacy only |
| ch23 |
Online feedback loop and knowledge updates |
Pending |
Pending |
| 2026 ID |
Chapter |
English |
Japanese |
| ch24 |
DataOps flywheel and team organization |
Pending |
Pending |
| ch25 |
Data versioning and experiment tracking |
Pending |
Pending |
| ch26 |
Data platform observability |
Pending |
Pending |
Part 9 — Privacy, Compliance, and Data Security
| 2026 ID |
Chapter |
English |
Japanese |
| ch27 |
Compliance framework and governance |
Pending |
Pending |
| ch28 |
Federated learning and privacy-preserving tech |
Pending |
Pending |
Part 10 — Hands-on Projects
| Project |
Title |
English |
Japanese |
| P01 |
Distributed Mini-C4 with Ray |
Legacy only |
Legacy only |
| P02 |
Vertical-domain SFT (legal) |
Legacy only |
Legacy only |
| P03 |
LLaVA multimodal instruction factory |
Legacy only |
Legacy only |
| P04 |
Synthetic math + code textbook factory |
Legacy only |
Legacy only |
| P05 |
Multimodal RAG (financial reports) |
Legacy only |
Legacy only |
| P06 |
CoT dataset + PRM training |
Pending |
Pending |
| P07 |
Agent tool-use data factory |
Pending |
Pending |
| P08 |
Enterprise DataOps platform |
Pending |
Pending |
| P09 |
Privacy-preserving data pipeline |
Pending |
Pending |
| P10 |
End-to-end LLM data flywheel |
Pending |
Pending |
Reading Order
Until the English chapters catch up, read the Chinese 2026 edition for the canonical structure. Code examples under code/zh/ cover all 10 projects; code/en/ currently mirrors only P01–P05.