Preface¶

In the current era of rapidly developing large-scale models, data has become the key factor determining model performance. Whether it is text, images, video, or audio, the acquisition, cleaning, and processing of multimodal data directly affects model performance. Faced with massive web corpora, public datasets, and industry documents, we must design efficient pipelines to convert raw, messy data into high-quality training samples suitable for model learning. Every step in data processing—from web crawling and HTML parsing to text denoising, deduplication, tokenization, serialization, as well as video frame extraction, audio-video synchronization, multimodal alignment, and annotation—is challenging and directly impacts training outcomes.

This book focuses on data engineering and systematically presents the complete workflow from data acquisition to data generation for large-scale model training. It provides in-depth discussions on high-performance distributed data pipelines, task scheduling on heterogeneous hardware, storage optimization, and performance diagnostics. By constructing instruction fine-tuning datasets, preference alignment datasets, synthetic textbook-level datasets, and multimodal instruction datasets, the book demonstrates how to generate and manage large-scale, high-complexity training samples while ensuring data quality and logical consistency. It also covers the design of retrieval-augmented generation systems and enterprise-level multimodal question-answering systems, forming a complete loop from theory to engineering practice. It is worth noting that all content in this book is available in Chinese, English, and Japanese, facilitating reading and understanding for researchers of different language backgrounds.

In each chapter, the authors illustrate practical engineering methods through case-driven approaches, analyze key technical details of data pipeline design, and summarize strategies and experiences for handling massive datasets. Readers will gain a deep understanding of the complexity of data required for large-scale model training and learn how to construct scalable, efficient, and reproducible data engineering systems. By studying and practicing these methods, researchers and engineers can more effectively support model training and application, improving capabilities in text understanding, image generation, video analysis, and cross-modal reasoning.

This book aims to provide a comprehensive, practical guide in the field of large-scale model data engineering, enabling readers to master high-quality and scalable training data pipelines while understanding state-of-the-art techniques and theories. We hope this work serves as a reference and guide for academic researchers, industrial engineers, and data science practitioners, helping them efficiently develop and deploy large-scale models in complex data environments.

Author¶

Jun Yu
Professor Jun Yu, Associate Professor and Ph.D. supervisor at the Department of Automation, University of Science and Technology of China, Ph.D., Huawei Most Valuable Instructor (MVI), and Huawei/MindSpore Certified Developer Evangelist. His research focuses on multimedia computing and intelligent robotics. He has led 40 research projects, including 5 National Natural Science Foundation projects, 1 National Aviation Science Fund project, 3 Chinese Association for Artificial Intelligence-Huawei Academic Fund projects, and 3 Huawei flagship research programs. He has spearheaded the development of multiple model suites now integrated into Huawei computing products, and has published over 200 academic papers, including 100+ first-author/corresponding papers in IEEE/ACM top journals, CCF-A international conferences, and SCI Q1 journals.

As first author or principal investigator, he has received the highest Chinese award for intelligent technology, the Wu Wenjun Science and Technology Award, 6 best paper awards at top international conferences (CVPR_PBVS/ICCV_MFR/ICME/FG), over 100 championships in international AI challenges (CVPR/ICCV/IJCAI/AAAI/MM/ECCV, etc.), and the Anhui Provincial First Prize in AI Technology Progress, among other honors. As second author, he has received the Anhui Provincial Natural Science Second Prize, five first-class teaching achievement awards, one second-class award, and a nomination from the Chinese Higher Education Society “School-Enterprise Cooperation Double Hundred Plan”. He holds 20+ patents as first inventor.

He has served on the SPC of multiple international top conferences (IJCAI/AAAI/CVPR/ICCV/ICML/NeurIPS/MM/ICLR) and is a member of the Huawei MindSpore Technical Committee. As the sole supervising instructor, his students have won the World Robot Contest twice, first prize in the national-level “Challenge Cup” special competition twice, the Huawei Ascend AI Innovation Competition silver medal, and the MindSpore Outstanding Developer award (2 students).

He teaches undergraduate courses including Data Structures and Algorithms, Introduction to Pattern Recognition, Introduction to Artificial Intelligence, and Digital Logic Circuits, as well as graduate courses in Computer Vision, with an average of 350 teaching hours per year. Four core AI courses have been selected for the Huawei Intelligent Base program. He has led nine Ministry of Education–Huawei collaborative education and provincial quality projects, and has edited eight textbooks, including Computer Vision and Pattern Recognition, Embedded Efficient Visual Perception: From Theory to Practice, and Multi-modal Human Modeling, Analysis and Synthesis, one of which received the Huawei ICT Excellent Textbook Award. He has led the development of the Huawei MindSpore Face Suite (MindFace, GitHub) and contributed to the Huawei MindSpore OCR Suite (MindOCR, GitHub).