Chapter 5: Tokenization and Serialization¶

Chapter Summary¶

Tokenization is the bridge connecting raw text and neural networks. High-quality corpora after cleaning need to be converted into numeric sequences that models can understand before they can be fed into Transformers for training. This chapter delves into how tokenizers work, including the three mainstream algorithms BPE, WordPiece, and Unigram; introduces how to build and extend vocabularies for specific domains; and finally discusses data mixing and curriculum learning strategies, which determine the presentation order and proportions of different data types during training.

Scenario Introduction¶

Your team is training a large model specialized for code. After preliminary experiments with the standard GPT-2 tokenizer, you discover a strange phenomenon: the model often makes errors at indentation in generated code, splitting four spaces into multiple different tokens, causing inconsistent indentation. Worse yet, some common programming keywords like def and return are split into multiple subwords, requiring the model to use additional context to understand their meaning.

After analysis, you find the problem lies in the tokenizer. The GPT-2 tokenizer was trained on web text and does not handle the special structure of code (such as whitespace, camelCase, special symbols) well. Designing a specialized tokenizer for code tasks has become a key step to improving model performance.

This example shows: the tokenizer is by no means a "preprocessing detail" that can be ignored—it has a substantial impact on model capabilities.

5.1 Tokenizer Principles¶

The core task of a tokenizer is to segment continuous text strings into discrete token sequences and map each token to an integer ID. This seemingly simple task actually involves complex algorithm design and engineering trade-offs.

5.1.1 Why Subword Tokenization?¶

In the early days of deep learning, natural language processing typically used word-level or character-level tokenization. Word-level tokenization treats each complete word as a token—the advantage is clear semantics, the disadvantage is a huge vocabulary (needing to cover all possible words) and inability to handle out-of-vocabulary (OOV) words. Character-level tokenization treats each character as a token—the advantage is a tiny vocabulary with no OOV problem, the disadvantage is excessively long sequences making it difficult for models to capture long-range dependencies.

Subword tokenization is a compromise. It segments text into units smaller than words but larger than characters. High-frequency words remain intact; low-frequency words are split into smaller subword units. For example, "unhappiness" might be split into "un" + "happi" + "ness". This approach both controls vocabulary size and retains certain semantic information, while being able to handle unseen vocabulary through subword combination.

Figure 5-1: Tokenization Granularity Comparison — Trade-offs between word-level, character-level, and subword-level

Currently, almost all mainstream large language models use subword tokenization. The GPT series uses BPE, BERT uses WordPiece, and T5 and LLaMA use SentencePiece (supporting BPE and Unigram). Understanding the principles of these algorithms is the foundation for tokenizer customization and optimization.

5.1.2 BPE: Byte Pair Encoding¶

BPE (Byte Pair Encoding) was originally a data compression algorithm, later introduced to neural machine translation by Sennrich et al. in 2015, becoming the most widely used subword tokenization algorithm.

The core idea of BPE is very intuitive: start from the character level, repeatedly merge the most frequently occurring adjacent token pairs until reaching the target vocabulary size. Specific steps:

Split all training text into character sequences, each character as initial token
Count frequency of all adjacent token pairs
Merge the highest-frequency token pair into a new token
Repeat steps 2-3 until vocabulary reaches target size

Below is a simplified BPE training implementation:

from collections import Counter, defaultdict

def train_bpe(corpus: list, vocab_size: int) -> dict:
    """
    Train BPE tokenizer

    Args:
        corpus: Training corpus list
        vocab_size: Target vocabulary size

    Returns:
        Merge rules dictionary
    """
    # Initialize: split each word into characters, add end-of-word marker
    word_freqs = Counter()
    for text in corpus:
        for word in text.split():
            # Add end-of-word marker </w> to distinguish same character in middle vs end of word
            word_freqs[' '.join(list(word)) + ' </w>'] += 1

    merges = {}
    vocab = set()

    # Initial vocabulary is all characters
    for word in word_freqs:
        for char in word.split():
            vocab.add(char)

    while len(vocab) < vocab_size:
        # Count adjacent token pair frequencies
        pair_freqs = defaultdict(int)
        for word, freq in word_freqs.items():
            tokens = word.split()
            for i in range(len(tokens) - 1):
                pair = (tokens[i], tokens[i + 1])
                pair_freqs[pair] += freq

        if not pair_freqs:
            break

        # Find highest-frequency pair
        best_pair = max(pair_freqs, key=pair_freqs.get)

        # Merge this pair
        new_token = best_pair[0] + best_pair[1]
        merges[best_pair] = new_token
        vocab.add(new_token)

        # Update word frequency table
        new_word_freqs = {}
        for word, freq in word_freqs.items():
            new_word = word.replace(
                best_pair[0] + ' ' + best_pair[1], 
                new_token
            )
            new_word_freqs[new_word] = freq
        word_freqs = new_word_freqs

    return merges

def apply_bpe(text: str, merges: dict) -> list:
    """Apply BPE tokenization"""
    tokens = list(text) + ['</w>']

    while True:
        # Find mergeable pairs
        pairs = [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)]
        merge_pair = None
        for pair in pairs:
            if pair in merges:
                merge_pair = pair
                break

        if merge_pair is None:
            break

        # Perform merge
        new_tokens = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == merge_pair:
                new_tokens.append(merges[merge_pair])
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        tokens = new_tokens

    return tokens

An important variant of BPE is Byte-level BPE, introduced by GPT-2. Traditional BPE operates at the character level and needs to handle Unicode encoding issues. Byte-level BPE operates directly at the byte level, mapping each byte to a printable character, thus avoiding encoding issues and natively supporting any language. This is why the GPT series can process text in any language.

5.1.3 WordPiece: BERT's Choice¶

WordPiece is the tokenization algorithm developed by Google for BERT, very similar to BPE, with the main difference in the criterion for selecting merge pairs.

BPE selects the most frequently occurring pair for merging. WordPiece selects the pair that maximizes training data likelihood. Specifically, for candidate pair (A, B), WordPiece computes the language model probability gain of the merged vocabulary on training data, and selects the pair with the greatest gain for merging.

In practice, this means WordPiece tends to merge pairs whose "probability of co-occurrence is far higher than the product of independent occurrence probabilities." This criterion makes WordPiece more sensitive to low-frequency but meaningful patterns.

Another characteristic of WordPiece is using the ## prefix to identify non-word-initial subwords. For example, "playing" might be tokenized as ["play", "##ing"]. This representation clearly distinguishes subword position in the original word, helping the model understand vocabulary structure.

# WordPiece tokenization example (using HuggingFace tokenizers)
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize WordPiece tokenizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

# Train
trainer = WordPieceTrainer(
    vocab_size=30000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
tokenizer.train(files=["corpus.txt"], trainer=trainer)

# Use
output = tokenizer.encode("unhappiness")
print(output.tokens)  # ['un', '##happi', '##ness']

5.1.4 Unigram: A Probabilistic Perspective on Tokenization¶

Unigram tokenization was proposed by Kudo in 2018, adopting a completely different approach from BPE/WordPiece. BPE and WordPiece are bottom-up methods—starting from small units and gradually merging into larger units. Unigram is top-down—starting from a large vocabulary containing all possible subwords and gradually pruning to target size.

Unigram models tokenization as a probability problem. Given vocabulary V and probability P(t) for each token, the tokenization result for a text is the segmentation that maximizes total probability:

\[P(x_1, x_2, ..., x_n) = \prod_{i=1}^{n} P(x_i)\]

The training process uses the EM algorithm: E-step computes expected occurrence count of each token under current vocabulary; M-step updates token probabilities. Then delete tokens whose deletion has minimal impact on total likelihood until reaching target vocabulary size.

A unique advantage of Unigram is its natural support for probability modeling of multiple tokenization results. For a given text, there may be multiple valid segmentation ways; Unigram can assign a probability to each. This is very useful in certain application scenarios (e.g., multi-hypothesis processing in speech recognition).

5.1.5 Comparison of the Three Algorithms¶

The three mainstream subword tokenization algorithms each have their characteristics; selection requires trade-offs based on specific scenarios.

Algorithm	Core Idea	Advantages	Disadvantages	Typical Applications
BPE	Bottom-up, frequency-driven merging	Simple and intuitive, fast training	Greedy strategy may not be optimal	GPT series, LLaMA
WordPiece	Bottom-up, likelihood-driven merging	Sensitive to low-frequency meaningful patterns	Higher computation complexity	BERT, DistilBERT
Unigram	Top-down, probability modeling	Theoretically elegant, supports multiple segmentations	Slower training	T5, mT5, ALBERT

Figure 5-2: Comparison of BPE, WordPiece, and Unigram Tokenization Algorithms

In actual engineering, SentencePiece is the most commonly used tokenization toolkit. It supports both BPE and Unigram algorithms, provides language-agnostic preprocessing (not dependent on space-based tokenization), and integrates seamlessly with mainstream deep learning frameworks.

import sentencepiece as spm

# Train SentencePiece model
spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='my_tokenizer',
    vocab_size=32000,
    model_type='bpe',  # or 'unigram'
    character_coverage=0_9995,
    num_threads=16
)

# Load and use
sp = spm.SentencePieceProcessor(model_file='my_tokenizer.model')
tokens = sp.encode('Hello, world!', out_type=str)
print(tokens)  # ['▁Hello', ',', '▁world', '!']
ids = sp.encode('Hello, world!')
print(ids)  # [1234, 56, 789, 10]

5.2 Vocabulary Design and Extension¶

The vocabulary is the core component of a tokenizer. Vocabulary size, coverage, and structure directly affect model performance and efficiency.

5.2.1 Vocabulary Size Trade-offs¶

Vocabulary size is one of the most important hyperparameters in tokenizer design. A larger vocabulary means more tokens retained as complete units, shorter sequences, but a larger embedding matrix and more parameters; a smaller vocabulary means more words split into subwords, longer sequences, but fewer model parameters.

Mainstream large model vocabulary sizes typically range from 32K to 128K. GPT-2 uses 50,257, LLaMA uses 32,000, GPT-4 is reported to use approximately 100,000. When selecting vocabulary size, consider:

Computation efficiency: The larger the vocabulary, the more parameters in the embedding and output layers. For a d-dimensional model with vocabulary size V, the embedding matrix contains V × d parameters. When V increases from 32K to 128K, this increases 4x.

Sequence length: The larger the vocabulary, the more characters each token covers on average, and the fewer tokens the same text is split into. This is especially important for long documents, as Transformer computation complexity is quadratic in sequence length.

Rare word handling: The larger the vocabulary, the more rare words can be retained as complete tokens, reducing UNK and over-segmentation issues. But this also means rare tokens see fewer training samples, potentially leading to poor embedding quality.

# Analyze impact of different vocabulary sizes on sequence length
def analyze_vocab_size_impact(text: str, vocab_sizes: list) -> dict:
    """Analyze impact of vocabulary size on tokenization results"""
    import sentencepiece as spm

    results = {}
    for vocab_size in vocab_sizes:
        # Train tokenizers with different vocabulary sizes
        spm.SentencePieceTrainer.train(
            input='corpus.txt',
            model_prefix=f'tokenizer_{vocab_size}',
            vocab_size=vocab_size,
            model_type='bpe'
        )

        sp = spm.SentencePieceProcessor(model_file=f'tokenizer_{vocab_size}.model')
        tokens = sp.encode(text)

        results[vocab_size] = {
            'num_tokens': len(tokens),
            'chars_per_token': len(text) / len(tokens),
            'compression_ratio': len(text.encode('utf-8')) / (len(tokens) * 2)
        }

    return results

5.2.2 Multilingual Vocabulary Design¶

When training multilingual models, vocabulary design faces additional challenges: how to balance coverage of different languages within limited vocabulary space?

A common problem is the "Vocabulary Curse." If a tokenizer is trained directly on multilingual corpus, high-resource languages (e.g., English) will occupy most vocabulary space, while low-resource languages have severely insufficient coverage. This causes low-resource language text to be over-segmented, sequence length to inflate, and model performance to degrade.

Common strategies to address this include:

Corpus balancing: Before training the tokenizer, oversample or undersample corpus from different languages to make weights more balanced across languages.

Temperature sampling: Similar to the multilingual data balancing strategy discussed in Chapter 3, use temperature parameter to control sampling probability of different languages.

Language-specific character coverage: Ensure basic character sets of each target language are included in the vocabulary even if their frequency is low. SentencePiece provides the character_coverage parameter to control this.

# Multilingual tokenizer training example
import sentencepiece as spm

# Use character coverage to ensure multilingual support
spm.SentencePieceTrainer.train(
    input='multilingual_corpus.txt',
    model_prefix='multilingual_tokenizer',
    vocab_size=64000,
    model_type='unigram',
    character_coverage=0_9999,  # High coverage ensures rare characters included
    input_sentence_size=10000000,
    shuffle_input_sentence=True,
    # Special handling for CJK characters
    byte_fallback=True  # Fall back to byte level for unknown characters
)

5.2.3 Domain-Specific Vocabulary Extension¶

When applying pre-trained models to specific domains (e.g., medical, legal, code), one often encounters the problem of domain terminology being over-segmented. This not only increases sequence length but may also affect the model's understanding of professional concepts.

Vocabulary extension is an effective solution. The basic idea: add new domain-specific tokens while preserving the original vocabulary.

from transformers import AutoTokenizer

def extend_tokenizer(base_tokenizer_name: str, 
                     domain_terms: list,
                     output_dir: str) -> None:
    """
    Extend pre-trained tokenizer vocabulary

    Args:
        base_tokenizer_name: Base tokenizer name
        domain_terms: List of domain-specific terms
        output_dir: Output directory
    """
    # Load base tokenizer
    tokenizer = AutoTokenizer.from_pretrained(base_tokenizer_name)

    print(f"Original vocabulary size: {len(tokenizer)}")

    # Filter already existing tokens
    new_tokens = []
    for term in domain_terms:
        if term not in tokenizer.get_vocab():
            new_tokens.append(term)

    # Add new tokens
    num_added = tokenizer.add_tokens(new_tokens)
    print(f"Added {num_added} new tokens")
    print(f"New vocabulary size: {len(tokenizer)}")

    # Save extended tokenizer
    tokenizer.save_pretrained(output_dir)

    return tokenizer

# Example: Extend vocabulary for medical domain
medical_terms = [
    '冠状动脉',
    '心肌梗死',
    '动脉粥样硬化',
    'COVID-19',
    'mRNA疫苗',
    '计算机断层扫描',
    # ... more terms
]

tokenizer = extend_tokenizer(
    'meta-llama/Llama-2-7b',
    medical_terms,
    './medical_tokenizer'
)

After vocabulary extension, the model's embedding matrix needs to be extended accordingly. Embeddings for new tokens are typically initialized to random values or the average of existing related tokens, then learned through continued pre-training for meaningful representations.

from transformers import AutoModelForCausalLM

def resize_model_embeddings(model_name: str, 
                            tokenizer,
                            output_dir: str) -> None:
    """Resize model embedding layer to match extended vocabulary"""
    model = AutoModelForCausalLM.from_pretrained(model_name)

    # Resize embedding layer
    model.resize_token_embeddings(len(tokenizer))

    # Optional: Initialize new embeddings with mean of similar tokens
    # This usually gives better results than random initialization

    model.save_pretrained(output_dir)

5.2.4 Vocabulary Design Best Practices¶

Based on industry experience, here are some best practices for vocabulary design:

Reserve sufficient special token positions: Reserve some token IDs for future special tokens (e.g., new control symbols, domain markers). Many tokenizers reserve 100-1000 positions.

Ensure reasonable segmentation of numbers and code symbols: Numbers are important in many tasks, but standard tokenizers often handle them poorly. Consider keeping single digits as independent tokens or using special number encoding strategies.

Test edge cases: Before finalizing vocabulary, test various edge cases: very long words, special characters, mixed-language text, code snippets. Ensure tokenization results meet expectations.

Document vocabulary decisions: Record vocabulary size, training corpus, special token list, etc., to facilitate subsequent model iteration and troubleshooting.

5.3 Data Mixing and Curriculum Learning¶

After determining the tokenizer, the next key question is: how to organize and present training data? In what proportions should data from different sources and qualities be mixed? Does the order of data during training matter?

5.3.1 Data Mixing Strategies¶

As discussed in Chapter 3, high-quality pre-training datasets typically mix multiple sources: web, books, code, papers, dialogue, etc. Each source has different data volume and quality; simply mixing by original proportions is often not optimal.

Static mixing is the simplest strategy: determine mixing proportions for each source before training begins, shuffle data, then train sequentially. This method is simple to implement but lacks flexibility.

# Static data mixing example
import random

def static_mix(data_sources: dict, target_size: int) -> list:
    """
    Statically mix multiple data sources

    Args:
        data_sources: {source_name: (data_list, weight)}
        target_size: Target dataset size

    Returns:
        Mixed data list
    """
    mixed_data = []

    # Compute sample count for each source
    total_weight = sum(w for _, w in data_sources.values())

    for source_name, (data, weight) in data_sources.items():
        num_samples = int(target_size * weight / total_weight)

        # If insufficient data, repeat sampling
        if len(data) < num_samples:
            sampled = random.choices(data, k=num_samples)
        else:
            sampled = random.sample(data, num_samples)

        mixed_data.extend(sampled)

    random.shuffle(mixed_data)
    return mixed_data

# Usage example
data_sources = {
    'web': (web_data, 0_6),
    'books': (book_data, 0_15),
    'code': (code_data, 0_1),
    'papers': (paper_data, 0_1),
    'wikipedia': (wiki_data, 0_05)
}

mixed = static_mix(data_sources, target_size=1000000)

Figure 5-3: Static vs. Dynamic Mixing Strategy Comparison

Dynamic mixing allows adjusting mixing proportions during training. Some research suggests optimal data ratios may differ across training stages. For example, training early with more diverse data helps the model establish broad language understanding; training later with increased high-quality data proportion improves fine-grained capabilities.

class DynamicDataMixer:
    """Dynamic data mixer"""

    def __init__(self, data_sources: dict, schedule: list):
        """
        Initialize dynamic mixer

        Args:
            data_sources: Data source dictionary
            schedule: [(step_threshold, weights_dict), ...]
                     Use different mixing weights at different training steps
        """
        self.data_sources = data_sources
        self.schedule = sorted(schedule, key=lambda x: x[0])
        self.current_step = 0

    def get_weights(self) -> dict:
        """Get weights for current step"""
        for step_threshold, weights in reversed(self.schedule):
            if self.current_step >= step_threshold:
                return weights
        return self.schedule[0][1]

    def sample_batch(self, batch_size: int) -> list:
        """Sample one batch"""
        weights = self.get_weights()
        batch = []

        for source_name, weight in weights.items():
            num_samples = int(batch_size * weight)
            data = self.data_sources[source_name]
            batch.extend(random.choices(data, k=num_samples))

        random.shuffle(batch)
        self.current_step += 1
        return batch[:batch_size]

# Usage example: Emphasize diversity early, quality later
schedule = [
    (0, {'web': 0_5, 'books': 0_2, 'code': 0_15, 'papers': 0_1, 'wiki': 0_05}),
    (100000, {'web': 0_4, 'books': 0_25, 'code': 0_15, 'papers': 0_15, 'wiki': 0_05}),
    (500000, {'web': 0_3, 'books': 0_3, 'code': 0_2, 'papers': 0_15, 'wiki': 0_05}),
]

mixer = DynamicDataMixer(data_sources, schedule)

5.3.2 Curriculum Learning¶

Curriculum learning is a training strategy inspired by human learning. The core idea: have the model learn "easy" samples first, then gradually transition to "hard" samples. This strategy has been proven in multiple studies to accelerate convergence and improve final performance.

In pre-training scenarios, "easy" and "hard" can be defined in multiple ways:

Length-based: Short text is usually easier to learn than long text. The curriculum can start with short sequences and gradually increase length.

Perplexity-based: Text with low perplexity (text the language model is more "familiar" with) can be considered "easy" samples. A small pre-trained model can be used to evaluate sample difficulty, then samples ordered by difficulty for the main model.

Noise-level-based: High-quality, low-noise text first, then gradually introduce lower-quality but potentially unique-information text.

import numpy as np

class CurriculumScheduler:
    """Curriculum learning scheduler"""

    def __init__(self, 
                 data: list, 
                 difficulty_scores: list,
                 total_steps: int,
                 strategy: str = 'linear'):
        """
        Initialize curriculum scheduler

        Args:
            data: Data list
            difficulty_scores: Difficulty score for each sample (higher = harder)
            total_steps: Total training steps
            strategy: Curriculum strategy ('linear', 'sqrt', 'exp')
        """
        self.data = np.array(data)
        self.difficulty_scores = np.array(difficulty_scores)
        self.total_steps = total_steps
        self.strategy = strategy

        # Sort by difficulty
        sorted_indices = np.argsort(self.difficulty_scores)
        self.sorted_data = self.data[sorted_indices]
        self.sorted_scores = self.difficulty_scores[sorted_indices]

    def get_curriculum_fraction(self, current_step: int) -> float:
        """
        Compute data fraction to use at current step

        Return value in [0, 1], indicating proportion of easiest data to use
        """
        progress = current_step / self.total_steps

        if self.strategy == 'linear':
            return progress
        elif self.strategy == 'sqrt':
            return np.sqrt(progress)
        elif self.strategy == 'exp':
            return 1 - np.exp(-3 * progress)
        else:
            return progress

    def sample_batch(self, current_step: int, batch_size: int) -> list:
        """Sample batch according to current progress"""
        fraction = self.get_curriculum_fraction(current_step)

        # Determine available data range
        available_size = max(int(len(self.sorted_data) * fraction), batch_size)
        available_data = self.sorted_data[:available_size]

        # Random sample from available range
        indices = np.random.choice(len(available_data), size=batch_size, replace=True)
        return available_data[indices].tolist()

Figure 5-4: Curriculum Learning Principle — Gradual transition from easy to hard samples

5.3.3 Data Sampling and Batch Construction¶

In actual training, how data is organized affects both efficiency and effectiveness. Here are some important engineering considerations:

Pack strategy: To fully utilize compute resources, multiple short sequences are typically packed into a fixed-length sequence. This reduces computation waste from padding. The key question is how to handle attention masks after packing—different documents should not attend to each other.

def pack_sequences(sequences: list, max_length: int, eos_token_id: int) -> list:
    """
    Pack multiple short sequences to fixed length

    Args:
        sequences: List of token id sequences
        max_length: Target sequence length
        eos_token_id: End-of-sequence token ID

    Returns:
        List of packed sequences, each of length max_length
    """
    packed = []
    current_pack = []
    current_length = 0

    for seq in sequences:
        seq_with_eos = seq + [eos_token_id]

        if current_length + len(seq_with_eos) <= max_length:
            current_pack.extend(seq_with_eos)
            current_length += len(seq_with_eos)
        else:
            # Current pack full, start new one
            if current_pack:
                # Pad to max_length
                current_pack.extend([eos_token_id] * (max_length - current_length))
                packed.append(current_pack)

            current_pack = seq_with_eos
            current_length = len(seq_with_eos)

    # Handle last pack
    if current_pack:
        current_pack.extend([eos_token_id] * (max_length - current_length))
        packed.append(current_pack)

    return packed

Document boundary handling: When packing sequences, need to create a "document boundary mask" to ensure the model does not perform attention across document boundaries during generation.

Data loading efficiency: For TB-scale datasets, data loading itself may become a bottleneck. Common optimization methods include: storing preprocessed data in binary format (e.g., numpy memmap), multi-process parallel loading, prefetching the next batch.

5.3.4 Serialization and Storage Formats¶

After tokenization, token sequences need to be stored in efficient format for fast reading during training.

Common storage formats include:

NumPy memmap: Store token IDs as numpy array, access via memory mapping. Advantage is simple and direct, supports random access; disadvantage is no compression support, larger storage.

import numpy as np

def save_as_memmap(token_ids: list, output_path: str):
    """Save token ID list as memmap format"""
    arr = np.array(token_ids, dtype=np.uint16)  # Assume vocabulary < 65536
    fp = np.memmap(output_path, dtype='uint16', mode='w+', shape=arr.shape)
    fp[:] = arr[:]
    fp.flush()

def load_memmap(path: str, shape: tuple):
    """Load memmap format token IDs"""
    return np.memmap(path, dtype='uint16', mode='r', shape=shape)

Arrow/Parquet: Use Apache Arrow format, supports compression and efficient columnar access. HuggingFace Datasets uses this format internally.

Custom binary format: Some large projects use custom binary formats optimized for specific access patterns. For example, the binary packing format used by GPT-NeoX.

# Use HuggingFace Datasets for tokenized data
from datasets import Dataset

def tokenize_and_save(raw_data: list, tokenizer, output_dir: str):
    """Tokenize and save as Datasets format"""

    def tokenize_function(examples):
        return tokenizer(
            examples['text'],
            truncation=True,
            max_length=2048,
            return_attention_mask=False
        )

    # Create Dataset
    ds = Dataset.from_dict({'text': raw_data})

    # Tokenize
    tokenized_ds = ds.map(
        tokenize_function,
        batched=True,
        num_proc=16,
        remove_columns=['text']
    )

    # Save
    tokenized_ds.save_to_disk(output_dir)

5.4 Complete Data Preparation Pipeline¶

Connect the steps discussed above to build a complete pipeline from raw text to training-ready data.

from dataclasses import dataclass
from typing import Optional
import sentencepiece as spm

@dataclass
class DataPrepConfig:
    """Data preparation configuration"""
    # Tokenizer config
    tokenizer_path: str
    max_seq_length: int = 2048

    # Data mixing config
    mix_weights: dict = None  # {source: weight}

    # Curriculum learning config
    use_curriculum: bool = False
    curriculum_strategy: str = 'linear'

    # Output config
    pack_sequences: bool = True
    output_format: str = 'arrow'  # 'arrow', 'memmap', 'jsonl'

class DataPreparationPipeline:
    """Data preparation pipeline"""

    def __init__(self, config: DataPrepConfig):
        self.config = config
        self.tokenizer = spm.SentencePieceProcessor(model_file=config.tokenizer_path)

    def tokenize_document(self, text: str) -> list:
        """Tokenize single document"""
        return self.tokenizer.encode(text)

    def process_source(self, source_path: str, source_name: str) -> list:
        """Process single data source"""
        documents = self.load_documents(source_path)

        tokenized = []
        for doc in documents:
            tokens = self.tokenize_document(doc['text'])
            if len(tokens) > 10:  # Filter too-short documents
                tokenized.append({
                    'input_ids': tokens,
                    'source': source_name,
                    'length': len(tokens)
                })

        return tokenized

    def mix_sources(self, sources: dict) -> list:
        """Mix multiple data sources"""
        mixed = []
        weights = self.config.mix_weights or {s: 1_0 for s in sources}
        total_weight = sum(weights.values())

        # Determine sample count per source
        total_samples = sum(len(data) for data in sources.values())

        for source_name, data in sources.items():
            weight = weights.get(source_name, 1_0) / total_weight
            num_samples = int(total_samples * weight)

            if len(data) >= num_samples:
                sampled = random.sample(data, num_samples)
            else:
                sampled = random.choices(data, k=num_samples)

            mixed.extend(sampled)

        random.shuffle(mixed)
        return mixed

    def pack_and_save(self, data: list, output_path: str):
        """Pack and save data"""
        if self.config.pack_sequences:
            sequences = [d['input_ids'] for d in data]
            packed = pack_sequences(
                sequences, 
                self.config.max_seq_length,
                self.tokenizer.eos_id()
            )
        else:
            packed = [d['input_ids'] for d in data]

        # Select output format based on config
        if self.config.output_format == 'arrow':
            self.save_as_arrow(packed, output_path)
        elif self.config.output_format == 'memmap':
            self.save_as_memmap(packed, output_path)
        else:
            self.save_as_jsonl(packed, output_path)

    def run(self, source_paths: dict, output_path: str):
        """Run complete pipeline"""
        # 1. Process each data source
        sources = {}
        for source_name, path in source_paths.items():
            print(f"Processing {source_name}...")
            sources[source_name] = self.process_source(path, source_name)

        # 2. Mix data
        print("Mixing data sources...")
        mixed = self.mix_sources(sources)

        # 3. Optional: Apply curriculum ordering
        if self.config.use_curriculum:
            print("Applying curriculum ordering...")
            mixed = self.apply_curriculum(mixed)

        # 4. Pack and save
        print("Packing and saving...")
        self.pack_and_save(mixed, output_path)

        print(f"Done! Saved {len(mixed)} samples to {output_path}")

Figure 5-5: Complete Pipeline from Raw Text to Training-Ready Data

5.5 Chapter Summary¶

This chapter systematically introduced the core technologies of tokenization and data serialization.

In tokenizer principles: subword tokenization is the mainstream choice for current large models, achieving good balance between vocabulary size and sequence length. BPE uses frequency-driven bottom-up merging strategy, simple and efficient; WordPiece uses likelihood-driven merging criterion, more sensitive to low-frequency meaningful patterns; Unigram uses top-down probability modeling, theoretically more elegant. SentencePiece is the most commonly used toolkit, supporting multiple algorithms and language-agnostic processing.

In vocabulary design: vocabulary size requires trade-offs between computation efficiency, sequence length, and rare word handling; mainstream models typically use 32K-128K. Multilingual vocabulary design needs to balance coverage across languages and avoid the "vocabulary curse." Domain-specific vocabulary extension can improve professional terminology handling but requires extending the model embedding layer accordingly.

In data mixing: static mixing is simple and direct; dynamic mixing allows adjusting proportions during training. Curriculum learning strategy starts with easy samples and gradually transitions to hard ones, which can accelerate convergence and improve performance. Data packing and efficient storage formats are crucial for large-scale training.

Figure 5-6: Chapter 5 Knowledge Structure — Three themes: Tokenization Algorithms, Vocabulary Design, Data Organization

Next Chapter Preview¶

With this, we have completed all content on text pre-training data engineering. In the next chapter "Image-Text Pairs Processing," we will enter the field of multimodal data engineering. You will learn how to process LAION-5B style image-text paired data, how to use img2dataset for high-concurrency image downloads, and how to build multimodal data cleaning pipelines.

Consider this question as you enter the next chapter: How should the "quality" of an image be defined? Besides resolution and clarity, what other dimensions need to be considered?