Skip to content

Chinese 2026 Edition Translation Status

The Chinese edition is the primary 2026 version of this book. It contains 28 chapters and 10 hands-on projects covering the full large-model data-engineering lifecycle.

The English edition is being updated from the earlier public edition. Until the full translation is ready, browse the Chinese site for the latest chapter structure and project scope.

Current Policy

  • Chinese: latest complete edition (28 chapters + 10 projects).
  • English: legacy edition plus this status page; new chapters land here as they are translated.
  • Japanese: same legacy + status policy as English.

Translation Progress

Each row shows whether the chapter is available in the legacy edition and whether the 2026 edition has been merged. "Pending" means readers should consult the Chinese site.

Part 1 — Overview and Infrastructure

2026 ID Chapter English Japanese
ch01 The data shift in the large-model era Legacy only Legacy only
ch02 LLM data lifecycle and quality framework Pending Pending
ch03 AI-native data stack and cost governance Pending Pending

Part 2 — Text Pre-training Data Engineering

2026 ID Chapter English Japanese
ch04 Sources, acquisition, copyright Legacy only Legacy only
ch05 Cleaning, deduplication, decontamination Legacy only Legacy only
ch06 Tokenization, serialization, efficient loading Legacy only Legacy only
ch07 Evaluation, feedback loop, ops iteration Pending Pending

Part 3 — Multimodal Data Engineering

2026 ID Chapter English Japanese
ch08 Image-text pair data engineering Legacy only Legacy only
ch09 Recaptioning and document understanding Legacy only Legacy only
ch10 Video and audio data engineering Legacy only Legacy only
ch11 Cross-modal alignment and fusion Pending Pending

Part 4 — Instruction Fine-tuning and Preference Data

2026 ID Chapter English Japanese
ch12 SFT data design and instruction system Legacy only Legacy only
ch13 Preference data and reward signals Legacy only Legacy only
ch14 Annotation platform, QA, data ops Pending Pending

Part 5 — Synthetic Data Engineering

2026 ID Chapter English Japanese
ch15 Synthetic data factory: from seed to verification Pending Pending
ch16 Knowledge distillation and model collaboration Pending Pending
ch17 Synthetic-data quality control and model collapse Pending Pending

Part 6 — Reasoning and Agent Data Engineering

2026 ID Chapter English Japanese
ch18 Chain-of-thought and reasoning data engineering Pending Pending
ch19 Tool-use and function-calling data Pending Pending
ch20 Agent memory and multi-turn interaction data Pending Pending

Part 7 — Application-level Data Engineering

2026 ID Chapter English Japanese
ch21 RAG data pipeline Legacy only Legacy only
ch22 Multimodal RAG and visual retrieval Legacy only Legacy only
ch23 Online feedback loop and knowledge updates Pending Pending

Part 8 — DataOps and Platform Building

2026 ID Chapter English Japanese
ch24 DataOps flywheel and team organization Pending Pending
ch25 Data versioning and experiment tracking Pending Pending
ch26 Data platform observability Pending Pending

Part 9 — Privacy, Compliance, and Data Security

2026 ID Chapter English Japanese
ch27 Compliance framework and governance Pending Pending
ch28 Federated learning and privacy-preserving tech Pending Pending

Part 10 — Hands-on Projects

Project Title English Japanese
P01 Distributed Mini-C4 with Ray Legacy only Legacy only
P02 Vertical-domain SFT (legal) Legacy only Legacy only
P03 LLaVA multimodal instruction factory Legacy only Legacy only
P04 Synthetic math + code textbook factory Legacy only Legacy only
P05 Multimodal RAG (financial reports) Legacy only Legacy only
P06 CoT dataset + PRM training Pending Pending
P07 Agent tool-use data factory Pending Pending
P08 Enterprise DataOps platform Pending Pending
P09 Privacy-preserving data pipeline Pending Pending
P10 End-to-end LLM data flywheel Pending Pending

Reading Order

Until the English chapters catch up, read the Chinese 2026 edition for the canonical structure. Code examples under code/zh/ cover all 10 projects; code/en/ currently mirrors only P01–P05.