Project 11: Mini-DeepSeek Pre-Training Reproduction¶

Jun Yu; Ke Wang; Yang Luo

Abstract¶

This project builds a reproducible data engineering case study around "Mini-DeepSeek Pre-Training Reproduction," with an emphasis on business objectives, data boundaries, architectural decisions, core implementation, acceptance criteria, and risk controls. The chapter consolidates installation commands and script details into an engineering retrospective perspective, highlighting the relationships among sample schemas, data flows, failure modes, and deliverables, helping readers translate the methods presented earlier into auditable and extensible project assets.

Keywords¶

Mini-DeepSeek; project practice; reproducible data engineering; data pipeline; acceptance criteria

Project Objectives and Reader Outcomes¶

This project uses "Mini-DeepSeek Pre-Training Reproduction" as its core case study, with the goal of reproducing the key engineering stages of an open-source LLM pre-training data recipe using small-scale resources. Upon completing this chapter, readers should be able to identify the critical data objects in this scenario, decompose the engineering pipeline, set acceptance criteria, and transfer the case methodology to comparable data engineering tasks.

Scenario Constraints and Data Boundaries¶

This project is positioned as a reduced-scale recipe validation exercise; it does not aim for full large-model scale or publicly reported SOTA metrics. These boundaries make the case reproducible and auditable. When data scale, data sources, access permissions, or deployment environments change, sampling strategies, quality thresholds, runtime costs, and compliance requirements must be re-evaluated.

Architectural Decisions¶

This project follows an architectural path of "corpus mixing, tokenization, training-sample packing, training smoke test, metric logging, and cost analysis." This decision prioritizes input/output contracts, version traceability, anomaly localizability, and result verifiability over compressing all logic into a single one-shot script execution.

Sample Schema / Data Flow¶

The core data flow can be summarized as:

Listing P11-1 provides a process flow example.

Candidate corpus -> Recipe sampling -> Tokenizer processing -> Packed dataset -> Training smoke test -> Loss and sample quality report

Listing P11-1: Process flow example.

The sample schema should retain at minimum the fields id, source, content_or_payload, metadata, quality_signals, split_or_stage, and audit_trace; specific fields are further refined by the data types, downstream tasks, and acceptance methods of this project.

Core Implementation Fragments¶

The body of the chapter retains only the key implementation fragments that illustrate design trade-offs. Complete scripts, lengthy configurations, execution logs, and large files should be placed in the companion repository or appendix; code presentation focuses on input/output contracts, quality thresholds, exception handling, and acceptance interfaces.

Experiment and Acceptance Metrics¶

Acceptance metrics include token distribution, corpus-mix deviation, packing efficiency, training loss trend, throughput, GPU memory/cost, and failed-sample review. If the project enters production, a course, or a public reproduction experiment environment, the version number, dependency environment, random seed, sample spot-check results, and failed-sample retrospective records should also be logged.

Table P11-1 summarizes the publication-facing acceptance dimensions for this reproduction project.

Table P11-1: Mini-DeepSeek Pre-Training Reproduction Publication Acceptance Table.

Acceptance Dimension	Metric / Evidence	Publication Review Criterion
Recipe reproduction	Corpus-mix deviation, cross-source deduplication records, and tokenizer training logs	A reduced-scale experiment must state the scale difference from the original recipe and the boundaries of non-comparability
Training smoke test	Packing efficiency, loss trend, throughput, and GPU memory/cost records	Report retains random seed, environment, sample scale, and failed-sample review conclusions
Data compliance	Data-source licenses, contamination checks, and sample deletion mechanisms	External corpora must have their origin and redistribution rights confirmed before entering public deliverables

Cost, Risk, and Compliance Boundaries¶

Costs arise primarily from training compute and data processing. Risks center on recipe misinterpretation, sample contamination, tokenizer inconsistency, and extrapolating small-scale conclusions. When external data, personal information, copyrighted content, or third-party services are involved, source documentation, permission status, anonymization strategies, call records, and manual review records should be retained.

Common Failure Modes¶

Common failures include input distribution drift, missing schema fields, quality thresholds that are too loose or too tight, insufficient evaluation-sample coverage, unstable model calls, and non-traceable results. When diagnosing, prioritize locating data boundaries and intermediate artifacts before examining the model, toolchain, and deployment environment.

Reproducible Resource Description¶

Reproduction materials should include data source descriptions, minimal samples, configuration files, run commands, metric scripts, inspection reports, and an artifact directory. The body of the chapter retains necessary fragments; complete notebooks, long scripts, and large files are maintained separately as companion resources.

Background and Objectives¶

In pre-training data engineering, "Scaling Laws" (Kaplan et al. 2020) apply not only to model parameters but equally to the experimentation and validation of data recipes. In the earlier Project 1 (Mini-C4), we completed an end-to-end cleaning pipeline for a single-source corpus. However, real industrial-scale large models—such as DeepSeek-V3 (DeepSeek-AI et al. 2024)—are never trained on a single corpus; they are trained on a precise mixture of web pages, code, mathematics, academic papers, and other data sources.

Why do we need a Mini pre-training pipeline?

Low-cost validation: Experimenting on the full 14.8T tokens of real data is prohibitively expensive. Through proportional scaling, we can rapidly validate multi-source mixing strategies at the 1B-token scale.
Exposing inter-source interactions: Engineering problems such as cross-source deduplication and the effect of data-mix adjustments on the tokenizer vocabulary distribution only surface in a multi-source mixing environment.
Smooth scaling curve: A validated 1B-token data pipeline requires only replacing the underlying data-source cluster and compute nodes to scale out horizontally to 7B, 14B, or even 70B tokens.

This project aims to fully replicate the data recipe of DeepSeek-V3 using approximately 1B tokens—an amount that a single node with 8× 4090/A100 GPUs can process in tens of hours. Upon completing this project, readers will have a multi-source mixing sampler, a cross-source deduplication engine, and tokenizer training code targeting a 150K super-vocabulary, all meeting industrial-grade standards, providing a solid foundation for large-scale pre-training.

Architecture Design¶

To achieve the objectives above, we designed a data pipeline consisting of four core components. The overall architecture is shown in Figure P11-1.

Figure P11-1: Mini-DeepSeek Multi-Source Pre-Training Data Pipeline Architecture.

The four core components of the pipeline are:

Multi-source Sampler: Responsible for fetching multiple open-source datasets from Hugging Face (e.g., FineWeb-Edu, The Stack v2) and performing precise sampling according to the per-domain proportions disclosed in the DeepSeek-V3 report.
Cross-source MinHash Deduplication Engine: When data sources include not only ordinary web pages but also GitHub code and arXiv papers, implicit overlap may exist between sources. This component implements efficient deduplication across different data sources using the MinHash LSH algorithm (Broder 1997).
Tokenizer Trainer: Using the BPE algorithm (Sennrich et al. 2016), this component trains and constructs a super-vocabulary of 150K entries on the mixed multilingual and multi-code-domain corpus, ensuring efficient compression of both Chinese and English text as well as specialized code.
Pack & Shuffle: After tokenization, variable-length sequences are efficiently "packed" into fixed-length training sequences, globally shuffled, and output as .arrow format files suitable for large-scale distributed training.

Table P11-2 maps each architectural component to its code entry point, stage artifact, and review fields. Project chapters need to retain tables of this kind because they connect "the engineering narrative visible to the reader" with the scripts in code/zh/project_11_mini_deepseek, preventing the chapter from remaining at the level of conceptual introduction alone.

Table P11-2: Mini-DeepSeek Data Pipeline Stage Artifacts and Code Entry Points.

Stage	Code Entry Point	Primary Input	Primary Output	Review Fields
Multi-source sampling	`mix_sampler.py`	`RECIPE`, target document count, Hugging Face data sources	`./data/mixed_1b_raw`	`source`, sample count, recipe weight deviation
Cross-source deduplication	`cross_source_dedup.py`	`mixed_1b_raw`	`./data/mixed_1b_dedup`	MinHash parameters, duplicate sample count, retention ratio
Tokenizer training	`train_tokenizer.py`	`mixed_1b_dedup`	`mini_deepseek_tokenizer.json`	vocab size, special tokens, training sample ratio
Pack & Shuffle	`pack_shuffle.py`	Deduplicated corpus, tokenizer	`./data/mixed_1b_final_packed`	`SEQ_LEN`, packing efficiency, shuffle seed
End-to-end run	`run_pipeline.sh`	Stage scripts and local environment	Complete data directory	Logs, failed stages, artifact integrity
Unit tests	`tests/test_pipeline.py`	Recipe, MinHash, packing constants	Test report	Weights sum to 1, MinHash similarity, `SEQ_LEN`

The most easily overlooked column in Table P11-2 is "Review Fields." For example, mixed_1b_raw is simply a Hugging Face Dataset directory on its own and says nothing about whether the recipe is correct; one must additionally verify that the sample count for each source is consistent with the target weights. Similarly, mixed_1b_dedup cannot be validated merely by checking whether the directory exists—the duplicate sample ratio and threshold must also be recorded. For the tokenizer, the existence of the file does not indicate training success; special tokens, vocabulary size, Chinese/code compression ratio, and rare-character coverage must also be checked.

Step-by-Step Implementation¶

Step 1: Multi-Source Mixed Extraction and Proportioning¶

According to the DeepSeek-V3 report, we need to fuse multiple data sources. In this implementation, we select open-source alternative datasets:

English web pages: FineWeb-Edu (Penedo et al. 2024)
Chinese web pages: Wudao or open-source Chinese–English mixed data
Code: The Stack v2 (Lozhkov et al. 2024)
Mathematics: OpenWebMath (Paster et al. 2023)
Academic: arXiv

We write the mix_sampler.py script to sample at the configured proportions.

Listing P11-2 provides a Python implementation excerpt.

from datasets import load_dataset, concatenate_datasets

RECIPE = {
    "HuggingFaceFW/fineweb-edu": {"weight": 0.40},
    "bigcode/the-stack-v2": {"weight": 0.25},
    "open-web-math/open-web-math": {"weight": 0.15},
    "togethercomputer/RedPajama-Data-1T": {"name": "arxiv", "weight": 0.10},
    "m-a-p/WanJuan-1.0-Text": {"weight": 0.10},
}

def sample_multi_source(recipe, target_docs):
    shards = []
    for repo_id, cfg in recipe.items():
        n = int(target_docs * cfg["weight"])
        stream = load_dataset(repo_id, cfg.get("name"), split="train", streaming=True)
        rows = [normalize_text(item, source=repo_id) for item in take(stream, n)]
        shards.append(rows_to_dataset(rows))
    return concatenate_datasets(shards)

mixed = sample_multi_source(RECIPE, target_docs=500_000)
mixed.save_to_disk("./data/mixed_1b_raw")

Listing P11-2: Python implementation excerpt.

Step 2: Cross-Source MinHash LSH Deduplication¶

After multi-source mixing, the greatest hidden risk is duplicates between different sources (for example, code snippets in The Stack v2 duplicating code segments in arXiv papers). In Project 1 (Mini-C4), we performed MinHash deduplication only within a single source; here we need global deduplication.

Listing P11-3 provides a Python implementation excerpt.

from datasketch import MinHash, MinHashLSH

def get_minhash(text, num_perm=128):
    sig = MinHash(num_perm=num_perm)
    for token in char_ngrams(text, n=5):
        sig.update(token.encode("utf-8"))
    return sig

def cross_source_dedup(dataset, threshold=0.8):
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    keep, duplicates = [], 0
    with lsh.insertion_session() as session:
        for idx, row in enumerate(dataset):
            sig = get_minhash(row["text"])
            if lsh.query(sig):
                duplicates += 1
                continue
            session.insert(str(idx), sig)
            keep.append(idx)
    return dataset.select(keep), duplicates

unique, dup_count = cross_source_dedup(load_stage("mixed_1b_raw"))
unique.save_to_disk("./data/mixed_1b_dedup")

Listing P11-3: Python implementation excerpt.

Step 3: Training a 150K Super-Vocabulary Tokenizer¶

DeepSeek-V3 (DeepSeek-AI et al. 2024) employs a super-vocabulary of approximately 150K entries (a substantial increase over Llama-2's 32K), which makes it highly efficient at processing Chinese text and code. In this step, we train a BPE tokenizer on the mixed and deduplicated data.

Listing P11-4 provides a Python implementation excerpt.

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers

def train_large_tokenizer(dataset, vocab_size=150_000):
    tokenizer = Tokenizer(models.BPE())
    tokenizer.normalizer = normalizers.Sequence([normalizers.NFKC()])
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        special_tokens=["<|endoftext|>", "<|pad|>", "<|unk|>"],
        initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    )
    sample = dataset.select(range(0, len(dataset), 10))
    tokenizer.train_from_iterator(batch_text(sample), trainer=trainer)
    tokenizer.save("./data/mini_deepseek_tokenizer.json")
    return tokenizer

train_large_tokenizer(load_stage("mixed_1b_dedup"))

Listing P11-4: Python implementation excerpt.

Step 4: Pack & Shuffle and `.arrow` Shard Output¶

To avoid having the GPU handle large amounts of padding during training, we concatenate variable-length token sequences into contiguous segments of length 4096 or 8192 (packing), inserting special separator tokens.

Listing P11-5 provides a Python implementation excerpt.

from tokenizers import Tokenizer

SEQ_LEN = 4096

def pack_and_shuffle(dataset, tokenizer_path):
    tokenizer = Tokenizer.from_file(tokenizer_path)
    eot = tokenizer.token_to_id("<|endoftext|>")

    def encode_batch(batch):
        stream = []
        for text in batch["text"]:
            stream.extend(tokenizer.encode(text).ids + [eot])
        usable = (len(stream) // SEQ_LEN) * SEQ_LEN
        blocks = [stream[i:i + SEQ_LEN] for i in range(0, usable, SEQ_LEN)]
        return {"input_ids": blocks}

    packed = dataset.map(encode_batch, batched=True, remove_columns=dataset.column_names)
    return packed.shuffle(seed=42)

packed = pack_and_shuffle(load_stage("mixed_1b_dedup"), "./data/mini_deepseek_tokenizer.json")
packed.save_to_disk("./data/mixed_1b_final_packed")

Listing P11-5: Python implementation excerpt.

Engineering Execution and Minimal Reproduction Path¶

The minimal entry point for running this project is run_pipeline.sh. The script chains together four stages: multi-source sampling, cross-source deduplication, tokenizer training, and packing. Its value is not merely "saving the manual execution of four commands"; it fixes the stage order, artifact paths, and failure locations. In pre-training data engineering, an incorrect stage order directly alters the data distribution. For example, if the tokenizer is trained before cross-source deduplication, the tokenizer will see duplicate samples that should have been removed; if packing precedes shuffling, subsequent recipe adjustments become difficult to trace.

Listing P11-6 gives the minimal entry point for this project. Formal reproduction experiments should record Python, datasets, tokenizers, datasketch, disk paths, and random seeds before running.

cd code/zh/project_11_mini_deepseek
bash run_pipeline.sh

Listing P11-6: Command-line run example.

This command sequentially generates mixed_1b_raw, mixed_1b_dedup, mini_deepseek_tokenizer.json, and mixed_1b_final_packed. If a stage fails, it is not recommended to delete the entire data/ directory and rerun from scratch; the safer approach is to first confirm whether the failed stage and its upstream artifacts are intact, then clean only the affected stage directory. For teaching reproductions, target_docs can be reduced from 500000 to a smaller scale to validate contracts and tests before scaling up the data volume.

Table P11-3 lists the minimal audit information that should be recorded before and after a run.

Table P11-3: Mini-DeepSeek Minimal Reproduction Experiment Record Items.

Category	Record Item	Purpose
Data version	Data source repo id, split, config name, sampling time	Explains sample distribution changes
Recipe parameters	`RECIPE` weights, target document count	Determines whether the reduced-scale recipe is satisfied
Deduplication parameters	n-gram length, `num_perm`, LSH threshold	Reproduces duplicate rate and false-deletion risk
Tokenizer parameters	vocab size, normalizer, pre-tokenizer, special tokens	Reproduces compression ratio and compatibility
Packing parameters	`SEQ_LEN=4096`, shuffle seed, batch size	Reproduces training sample boundaries
Environment information	Python, datasets, tokenizers, datasketch versions	Diagnoses artifact differences
Run results	Sample count, token count, directory size, failure logs	Determines deliverability

Data Quality and Recipe Acceptance¶

The acceptance focus for Mini-DeepSeek is not the final model score but whether the recipe is interpretable, verifiable, and extensible. A common pitfall is reporting only "1B tokens were ultimately obtained" without stating which domains those tokens came from, how many cross-source duplicates were removed, or whether the tokenizer is biased toward a particular language or code domain. For pre-training data, quantity is merely the outcome; the recipe is the core.

Table P11-4 gives the recipe-level acceptance criteria.

Table P11-4: Mini-DeepSeek Recipe-Level Acceptance Checklist.

Acceptance Item	Recommended Check	Non-conformance Indicator
Weight consistency	Deviation between each data source's sample count and its `RECIPE` weight	A source is over-sampled or underestimated due to streaming interruption
Field completeness	Each sample retains `text` and `source`	Downstream cannot compute source statistics or perform sample review
Cross-source duplicates	Record duplicate count and duplication rate	Implicit duplicates among code, papers, and web pages are not cleaned
Text anomalies	Ratio of empty text, extremely short text, garbled content, and binary residues	Tokenizer learns meaningless tokens
Tokenizer coverage	Compression ratio for Chinese, English, code, and math samples	Tokens/char is noticeably abnormal for some text category
Packing integrity	All `input_ids` have length `4096`	Padding or length inconsistency occurs during training
Randomness	Fixed shuffle seed and sampling strategy	Sample order or distribution across multiple runs is inexplicable

In practice, begin by sampling 100 records from each source for manual inspection to verify that the text type matches expectations; then spot-check near-duplicate pairs from the MinHash-deleted samples to assess whether the threshold is too strict. If the duplication rate is anomalously high, it may indicate genuine overlap between data sources, or it may indicate that 5-grams are overly sensitive to short texts. If the duplication rate is anomalously low, check whether the text field was selected correctly—some datasets use content rather than text.

Fine-Grained Inspection of Tokenizer and Packing¶

The tokenizer training stage most easily produces the problem of "file creation succeeded but quality is unknown." train_tokenizer.py uses BPE, NFKC normalization, a ByteLevel pre-tokenizer, and a 150K vocabulary. This configuration suits mixed Chinese, English, code, and math corpora, but it introduces two risks. First, with a large vocabulary, a small-scale training sample may not provide sufficient coverage of long-tail tokens. Second, code and mathematical symbols may occupy a disproportionate share of the vocabulary space, affecting the compression ratio for ordinary text.

It is recommended to compute the metrics in Table P11-5 after training is complete.

Table P11-5: Tokenizer and Packing Quality Metrics.

Metric	Computation Method	Interpretation
Chinese tokens/char	Token count of Chinese web page samples divided by character count	Assesses Chinese compression efficiency
English tokens/word	Token count of English web page samples divided by word count	Assesses whether ordinary English segmentation is abnormal
Code tokens/char	Token count of The Stack v2 samples divided by character count	Assesses code symbol coverage
Math formula fragmentation rate	Ratio of short tokens in math samples	Assesses whether formulas and LaTeX are over-fragmented
`<\|endoftext\|>` presence	Whether separator tokens exist in packed data	Assesses whether document boundaries are preserved
OOV behavior	`<\|unk\|>` usage rate	Assesses whether ByteLevel coverage is normal

The packing stage also requires inspection. pack_shuffle.py concatenates sample tokens and truncates to an integer multiple of SEQ_LEN, outputting fixed-length input_ids. This improves training throughput but means original document boundaries are no longer directly visible. Therefore, the insertion and counting of <|endoftext|> is critically important. If the separator is forgotten, the model learns adjacent documents as continuous text; if separators are overly dense, short documents will dominate the context structure.

Test Coverage and Code Isolation¶

tests/test_pipeline.py already covers several minimal contracts: RECIPE weights sum to 1, critical data sources exist, MinHash objects can be created, identical texts have a Jaccard similarity of 1.0, and SEQ_LEN equals 4096. These tests are not intended to replace comprehensive data acceptance; rather, they prevent the example code in the project chapter from losing its basic contracts after refactoring.

Table P11-6 maps tests to gaps.

Table P11-6: Mini-DeepSeek Test Coverage and Acceptance Gaps.

Test Item	Covered Content	Still Requires Manual or Integration Acceptance
`test_recipe_weights`	Recipe weights sum to 1	Actual sample count per source
`test_sampler_keys`	Critical data sources exist	Data source licenses, field names, and accessibility
`test_minhash_creation`	MinHash object is usable	Large-scale LSH memory footprint
`test_minhash_similarity`	Identical text similarity	Near-duplicate false-deletion/missed-deletion boundary
`test_tokenizer_pack`	`SEQ_LEN` constant	Length of each packed sample
`test_end_to_end_mock`	Test entry point exists	Real end-to-end small-sample run

Published manuscripts should clearly distinguish "teaching example code" from "production-ready code." The code in this chapter illustrates the basic organization of multi-source sampling, MinHash, BPE, and packing, but real large-scale pre-training additionally requires distributed execution, data-source failure retries, download caching, contamination detection, sensitive-content filtering, license whitelisting, and training framework integration.

Common Failures and Diagnostic Paths¶

If the sampling stage reports an error, first check the data source name, config name, and streaming split. Field names are not consistent across different Hugging Face datasets—some use text, some use content, and others require additional authentication or configuration. If a data source returns very few samples, training should not proceed immediately; instead, the gap for that source should be recorded in the report and the actual recipe recomputed.

If the deduplication stage has excessive memory usage, the current approach of holding the entire MinHash LSH in memory is approaching single-node limits. First reduce the sample volume to validate the pipeline, then migrate to Spark, Ray, or an external key-value store. If the sample count drops sharply after deduplication, spot-check duplicate pairs to determine whether over-matching is caused by short texts, template texts, or license headers.

If tokenizer training takes too long, first check the training sub-sample ratio. The current implementation samples every tenth record using ds.select(range(0, len(ds), 10)) to train the tokenizer—a pedagogical compromise. For larger data volumes, stratified sampling by source can be used to ensure coverage of code, math, Chinese, and English. If the resulting mini_deepseek_tokenizer.json cannot be loaded by pack_shuffle.py, check whether special tokens were written and whether the file was corrupted by an interrupted write.

If the training smoke test shows an anomalous loss after packing, the diagnostic sequence should be: first confirm that all input_ids have length 4096; next, sample-decode packed records to check document boundaries and anomalous characters; then verify the shuffle seed and sample order; only then suspect training hyperparameters. Pre-training data problems frequently masquerade as training problems; diagnosis should return to the data artifacts first.

Training Smoke Test and Metric Recording¶

The goal of P11 is not to report a complete large-model training result but to demonstrate that data artifacts can enter the training pipeline. Therefore, the training smoke test should remain small-scale, short-duration, and reproducible. A qualified smoke test may run only a few hundred to a few thousand steps, but it must answer three questions: whether the data can be stably read by the training framework, whether the loss exhibits a reasonable downward trend, and whether throughput, GPU memory, and sample decoding meet expectations.

Table P11-7 provides a training smoke test record template.

Table P11-7: Mini-DeepSeek Training Smoke Test Record Template.

Record Item	Example	Review Significance
Data directory	`./data/mixed_1b_final_packed`	Confirms that training reads from the final packed data
Sample length	`4096`	Confirms that the training context is consistent with the packing configuration
Batch configuration	Global batch, micro batch, gradient accumulation	Explains throughput and GPU memory differences
Model scale	e.g., 100M, 300M, 1B parameter teaching model	Makes explicit that no comparison is made with the full DeepSeek-V3
Tokenizer	`mini_deepseek_tokenizer.json`	Confirms that training and packing use the same tokenizer
Step range	e.g., steps 0–1000	States smoke test duration
Loss curve	Initial and final loss, anomalous spikes	Determines whether the data flow is clearly abnormal
Tokens/s	Tokens per second	Estimates cost of subsequent scale-up
Sample decoding	Random decode of 10 packed samples	Checks for garbled content, boundary issues, and repetition

The smoke test report should not simply state "training runs successfully." More useful content includes: read throughput, loss over the first several steps, decoded results from several packed samples, and a record of how failed samples were handled. If the loss fails to decrease over an extended period, this may indicate a model configuration problem or the presence of large amounts of duplicated, garbled, or meaningless tokens in the data. If throughput is below expectation, the .arrow shards may be too small, data loading workers may be insufficient, or disk I/O may be the bottleneck.

Data Contamination and Benchmark Leakage Inspection¶

Pre-training reproduction projects must pay attention to benchmark contamination. Even in a pedagogical reduced-scale context, the contamination inspection methodology should be documented. Multi-source corpora may contain questions, solutions, and answers from evaluation sets such as GSM8K, MATH, HumanEval, MMLU, or others. If such content enters the pre-training data, subsequent benchmark scores will be inflated.

Table P11-8 presents a minimal contamination inspection plan.

Table P11-8: Pre-Training Data Contamination Inspection Plan.

Inspection Target	Method	Handling
Exact duplicates	Hash or normalized string matching against benchmark prompts	Directly remove matched samples
Near-duplicates	MinHash or embedding retrieval on prompts, answers, and solutions	Manual spot-check followed by deletion
Code evaluation	Check for HumanEval function names, docstrings, and canonical solutions	Remove complete problems and answers
Math evaluation	Check for problem stems, answer choices, and final-answer patterns	Remove solutions and answer leakage
Forum reposts	Check for benchmark problem reposts on web pages or blogs	Remove or flag as contamination risk
Training report	Save match counts, deletion strategy, and sample ids	Supports subsequent benchmark claims

Contamination inspection should be completed before packing, because once the data enters a packed dataset, original document boundaries and source fields become harder to trace. If a teaching project does not implement complete contamination filtering, the report should explicitly state "not used for public benchmark claims" to prevent readers from mistaking smoke-test scores for publishable results.

Deliverable Directory and Release Package Structure¶

A deliverable P11 project should contain not only the final data directory but also the recipe, logs, tests, and audit materials. Table P11-9 gives the recommended directory structure.

Table P11-9: Mini-DeepSeek Project Deliverable Directory.

Path	Content	Included in Public Release
`data/mixed_1b_raw/`	Raw mixed data sampled according to the recipe	Depends on data licenses
`data/mixed_1b_dedup/`	Training candidate corpus after cross-source deduplication	Depends on data licenses
`data/mini_deepseek_tokenizer.json`	BPE tokenizer	May be made public
`data/mixed_1b_final_packed/`	Packed `.arrow` data	Depends on data licenses
`reports/source_mix.json`	Sample count and proportion per source	May be made public
`reports/dedup_report.json`	Duplicate count, threshold, spot-checked samples	Anonymized version may be made public
`reports/tokenizer_eval.md`	Compression ratio, anomalous tokens, sample decodes	May be made public
`reports/smoke_train.md`	Training smoke test metrics	May be made public
`tests/`	Unit tests and small-sample tests	May be made public
`LICENSES.md`	Data source license documentation	Must be made public

If data sources do not permit redistribution, the recipe, scripts, tokenizer, and report templates may still be published, but raw or packed samples cannot be released directly. In this case, a reproducibility guide should be provided so that readers with the requisite authorization can rebuild the data locally.

Scaling from 1B Tokens to Larger Scales¶

The reduced-scale pipeline of P11 helps readers understand the recipe, but scaling to 10B, 70B, or larger requires systematic refactoring. Table P11-10 lists the refactoring checklist for moving from the teaching implementation to a production-scale implementation.

Table P11-10: Mini-DeepSeek Refactoring Checklist from Teaching Implementation to Production Scale.

Module	Teaching Implementation	Production-Scale Refactoring
Data ingestion	Hugging Face streaming + local Dataset	Connect to object storage, data lake, or distributed cache
Recipe sampling	Single-pass proportional sampling	Support epoch-level dynamic mixing ratios and curriculum
Deduplication	Single-node MinHash LSH	Distributed MinHash, SimHash, or embedding near-duplicate system
Tokenizer	Single-node sampled training	Stratified sampling, version freezing, and compatibility regression
Packing	Single-node map + shuffle	Distributed packing, fixed shard size, training framework prefetch
Auditing	Manual reports	Metadata service, lineage tracking, deletion-request replay
Testing	Unit tests + mock e2e	Small-sample real e2e, data-diff regression, contamination scan

The most important point when scaling is not to mistake "the script runs" for "the system is scalable." Large-scale pre-training data systems must handle failure retries, checkpoint-based resumption, version freezing, sample deletion, license changes, in-training data mixing strategy adjustments, and multi-team collaboration. The value of P11 lies in presenting the minimal form of these problems rather than claiming to replace a complete industrial system.

Data Dashboard and Continuous Monitoring¶

Once pre-training data engineering enters continuous iteration, one-off run logs are insufficient. Data sources update, licenses change, field schemas are adjusted, and Hugging Face datasets may change their splits or sample content due to maintenance. Therefore, P11 needs a lightweight data dashboard for comparing differences across different run batches.

Table P11-11 gives the recommended dashboard metrics.

Table P11-11: Mini-DeepSeek Data Dashboard Metrics.

Dashboard Metric	Statistics Source	Observation Purpose
Source sample count	`mixed_1b_raw`	Check for recipe proportion drift
Source token count	Post-tokenizer statistics	Check for consistency between sample count and token count
Deduplication rate	`mixed_1b_raw` vs. `mixed_1b_dedup`	Detect duplication anomalies or threshold issues
Text length distribution	Raw text field	Detect overly short, overly long, or templated samples
Language proportion	Language identification script	Check proportions of Chinese, English, and other languages
Code proportion	Source and code feature detection	Check whether code sources are over-sampled
Math sample proportion	Source and LaTeX/formula detection	Check math data coverage
Tokenizer compression ratio	Tokenizer evaluation script	Detect vocabulary quality changes
Packed shard size	`mixed_1b_final_packed`	Check training read balance
Contamination match count	Contamination scan	Record benchmark leakage risk

The core value of the dashboard is batch comparison. For example, if The Stack v2 sample count is unchanged in a given run but the token count rises significantly, it may indicate that code samples have become longer or that filtering conditions have changed. If the deduplication rate for OpenWebMath suddenly increases in a given run, it may indicate that templated content in math web pages has increased. Without a batch dashboard, these changes often go undetected until a training loss or benchmark anomaly is observed.

Deletion Requests and Sample Withdrawal Mechanism¶

Multi-source pre-training data must have a sample withdrawal capability. Even in a pedagogical reproduction context, the following should be documented: if a data source, author, or content owner requests the deletion of samples, how does the system locate, remove, and rebuild downstream artifacts? The current P11 implementation retains the source field, but original document boundaries are weakened in packed data, so deletion requests are best handled before packing.

Table P11-12 gives the processing path for deletion requests.

Table P11-12: Pre-Training Sample Deletion Request Processing Path.

Step	Action	Artifact
Receive request	Record the requester, URL, sample characteristics, and timestamp	Deletion request ticket
Locate samples	Search `raw`/`dedup` data by URL, hash, text fragment, or source	Affected sample ids
Delete upstream	Remove matched samples from `mixed_1b_raw` or `mixed_1b_dedup`	Revised dataset
Assess tokenizer rebuild	If the deletion proportion is large, evaluate whether to retrain the tokenizer	Tokenizer impact report
Rebuild packed data	Re-run packing and shuffle	Revised packed dataset
Update report	Update recipe proportions, token counts, and deletion notes	Release note

Real production systems typically store document-level hashes, URLs, source, license, and a packed-shard reverse index. Teaching implementations do not necessarily require a complete index, but readers should understand that the earlier lineage is discarded, the higher the subsequent withdrawal cost. Retaining this point in the published text prevents readers from mistakenly believing that .arrow training data can circulate independently of its origin.

Domain Transfer: From General Recipe to Domain-Specific Models¶

The Mini-DeepSeek recipe is dominated by web pages, code, mathematics, academic papers, and Chinese text. Transferring to legal, medical, financial, or industrial domains requires more than simply adding a domain-specific data source to RECIPE. Domain data typically carries stronger requirements around permissions, privacy, terminology, and temporal relevance, necessitating independently designed recipes and acceptance criteria.

Table P11-13 gives adjustment directions for domain transfer.

Table P11-13: Domain Transfer Considerations for the Mini-DeepSeek Recipe.

Domain	Data to Add	Additional Risks	Acceptance Focus
Legal	Regulations, case law, contracts, compliance Q&A	Regional and temporal differences	Statute version, citation accuracy
Medical	Guidelines, papers, drug package inserts, case templates	Privacy and high-risk recommendations	Anonymization, source grade, expert review
Financial	Research reports, announcements, financial statements, market rules	Temporal and investment-advice risks	Date, market, calibration consistency
Industrial	Equipment manuals, fault records, process documents	Internal confidentiality and terminology ambiguity	Permissions, terminology glossary, fault classification
Education	Textbooks, exercises, solutions, course notes	Copyright and answer leakage	Copyright licenses, question-bank contamination

When transferring to a domain, it is recommended to retain the general corpus as a foundation and gradually increase the domain corpus proportion. If the domain data proportion is raised too rapidly, the model may acquire domain terminology capability while losing general language and coding capability. A safer approach is to design a multi-round curriculum: maintain a dominant share of general corpus in the early stages, then increase domain- and task-relevant data proportions in the mid-to-late stages, monitored by domain validation sets.

Relationship with P01 Mini-C4¶

P01 focuses on single-source web page cleaning; P11 focuses on a multi-source pre-training recipe. These are not redundant; they represent an upgrade from "cleaning one type of corpus" to "organizing multiple types of corpora." Table P11-14 summarizes the differences.

Table P11-14: Differences Between P01 Mini-C4 and P11 Mini-DeepSeek.

Dimension	P01 Mini-C4	P11 Mini-DeepSeek
Data sources	Single source or a small number of web sources	Web pages, code, mathematics, academic papers, Chinese text
Core problem	Cleaning quality, denoising, basic filtering	Recipe proportions, cross-source deduplication, tokenizer, and packing
Deduplication scope	Near-duplicates within a single source	Cross-source near-duplicates
Tokenizer	Can reuse an existing tokenizer	Train a 150K super-vocabulary
Training samples	Cleaned documents	Fixed-length packed token blocks
Acceptance focus	Text quality and cleaning rules	Recipe, compression ratio, contamination, smoke test
Scaling direction	Larger web corpus	Larger multi-source pre-training system

Understanding this relationship helps readers connect the projects in Part 14. P01 is the starting point of data cleaning; P11 organizes multiple cleaned sources into a pre-training recipe. Without the quality filtering of P01, the multi-source recipe of P11 would absorb large amounts of noise; without the recipe organization of P11, the single-source cleaning of P01 is insufficient to support modern general-purpose model training.

Results Presentation and Analysis¶

The pedagogical example configuration can be set up to run on a single node (e.g., 8× 4090 GPUs), recording end-to-end elapsed time to demonstrate the organization of a pipeline acceptance report.

When running at a sampling scale of TARGET_TOTAL_DOCS = 500,000, the MinHash deduplication rate should be recorded in the report; an indicative figure of approximately 4.2% implicit duplicates filtered—concentrated primarily between code and academic sources—can be used. For formal delivery, actual run logs, random seeds, and data manifests are required.

The shuffled and packed mixed_1b_final_packed dataset should record storage size, number of .arrow shards, and token statistics. An indicative report may use approximately 5 GB and approximately 1.05B tokens to describe the output format, but the formal version must be generated jointly from script output, the sample manifest, and the random seed.

Tokenizer Efficiency Validation¶

With the vocabulary expanded to 150K entries, sampling-based validation shows that this tokenizer achieves an average Chinese web page compression ratio (tokens/char) of 0.62, a significant improvement over Llama-2's 1.1, substantially increasing downstream pre-training throughput efficiency.

Cost and Optimization¶

For the pedagogical example at the 1B-token scale, resource consumption can be recorded under the following headings:

Storage: Record the actual size of raw crawled data and the packed output; indicative figures are approximately 8 GB for raw data and approximately 5 GB after packing.
Compute and memory: Record peak memory usage and elapsed time for streaming extraction, parallel map operations, and cross-source MinHash deduplication; indicative figures are a peak memory of approximately 32 GB and approximately 3 hours for MinHash deduplication. Formal delivery should be based on run logs.

Optimization Notes: If horizontal scaling to 70B tokens is required, single-node Python in-memory processing will become the bottleneck. It is recommended to integrate a distributed engine such as Apache Spark (Zaharia et al. 2016) or Ray (Moritz et al. 2018). For the MinHash deduplication step, memory decoupling can be achieved by storing hash buckets in an external database such as Redis.

Extended Considerations¶

Scaling the Mini-DeepSeek recipe to tens-of-billions-of-tokens requires particular attention to two points:

Dynamic decay of mixing ratios (Curriculum): In the early stages of training, foundational knowledge (web pages and academic papers) should dominate; in the mid-to-late stages, the sampling weights of code and mathematics (OpenWebMath) should be increased. mix_sampler.py can be refactored into a streaming module that supports epoch-level dynamic loading.
Comparison with the earlier project: Compared to P01 (Mini-C4) in Part 14, this project no longer relies on simple filtering with a single quality threshold; instead, it uses cross-source fusion and a super-vocabulary design to demonstrate how modern industrial-scale models such as DeepSeek-V3 lay their multi-task foundations.

Data Compliance and Open-Source License Notes¶

When performing multi-source mixing, the open-source licenses of the original data must be strictly observed:

FineWeb-Edu: CC0 license (fully open).
The Stack v2: SPDX whitelist license system; only code with redistribution-permitting licenses is used.
OpenWebMath: ODC-By license.
arXiv: The specific distribution license chosen by each paper's authors.
Project Gutenberg: Public Domain.

(Note: The complete 1B-token data sample has been processed in compliance with applicable licenses and may be uploaded to the HuggingFace Datasets repository dataforge-mini-deepseek-1b for direct use in downstream fine-tuning.)

Chapter Summary¶

This chapter used Mini-DeepSeek Pre-Training Reproduction to walk through the key engineering stages of an open-source LLM pretraining data recipe under small-scale resource constraints. The project keeps task definition, data boundaries, architecture choices, sample schema, metric acceptance, and reproduction resources on one traceable chain.

This is a reduced-scale recipe validation. It does not aim to reproduce full large-model scale or publicly reported SOTA metrics. Larger-scale, higher-risk, or stricter compliance settings require renewed review of data sources, permission status, manual review proportions, runtime costs, and rollback plans.

As part of Part 14, this chapter corresponds to the project-level validation of the methods presented earlier. Readers can combine this case with the data recipes of Part 13, the platform governance chapters in earlier sections, and the checklists in the appendices to form a complete loop from methodological understanding to engineering delivery.

References¶

Broder A Z (1997) On the Resemblance and Containment of Documents. In: Proceedings of the Compression and Complexity of Sequences, pp 21–29. https://doi.org/10.1109/sequen.1997.666900.

Kaplan J, McCandlish S, Henighan T, Brown T B, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D (2020) Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361.

DeepSeek-AI, Liu A, Feng B, Xue B, Wang B, Wu B, Lu C, Zhao C, Deng C, Zhang C, Ruan C, et al. (2024) DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437.

Lozhkov A, Ben Allal L, von Werra L, Wolf T (2024) StarCoder 2 and The Stack v2: The Next Generation (The Stack v2). arXiv preprint arXiv:2402.19173.

Moritz P, Nishihara R, Wang S, Tumanov A, Liaw R, Liang E, Elibol M, Yang Z, Paul W, Jordan M I, Stoica I (2018) Ray: A Distributed Framework for Emerging AI Applications. In: Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, pp 561–577.

Paster K, Santos M D, Azerbayev Z, Ba J (2023) OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text. arXiv preprint arXiv:2310.06786.

Penedo G, Kydlíček H, Ben Allal L, Lozhkov A, Mitchell M, Raffel C, von Werra L, Wolf T (2024) The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv preprint arXiv:2406.17557.

Sennrich R, Haddow B, Birch A (2016) Neural Machine Translation of Rare Words with Subword Units (BPE). In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp 1715–1725.

Zaharia M, Xin R S, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin M J, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark: A Unified Engine for Big Data Processing. Communications of the ACM 59(11):56–65.