Appendix A: Tools and Frameworks Quick Reference¶

A.1 Purpose of This Appendix¶

This appendix supports the engineering implementation layer of the book. It is not trying to list every tool you may have heard of. It answers a more practical question: when a team has decided to collect data, clean it, evaluate it, track experiments, release assets, and support teaching reproduction, how should it choose tools at different stages and understand the responsibility boundaries between them?

In real projects, tool selection usually fails not because one framework is wrong, but because problems from different layers are mixed together. A team that lacks version governance may try to solve it by adding another data-processing script. A team that lacks reviewable evaluation may keep spending budget on training. A team that lacks teaching images and course environments may assume that publishing a repository is the same as making the project reproducible.

This appendix therefore organizes tools by the data-engineering lifecycle rather than by popularity or vendor name.

Keep three principles in mind. First, tools do not replace process. If the process is unclear, changing tools usually only moves confusion into a new interface. Second, interfaces and boundaries matter most in data engineering: acquisition, cleaning, annotation, evaluation, release, and feedback must each have clear responsibilities. Third, the closer a project is to open benchmarks or university collaboration, the more it should prioritize auditability, handoff, and teachable reproducibility, not only one-off throughput or benchmark scores.

A.2 Tool Overview by Lifecycle¶

Table A-1 gives a coarse but directly usable mapping. It is not the only correct answer, but it helps teams put the problem at the right layer first.

Stage	Main Goal	Common Deliverables	Priority Tool Categories	Typical Risk
Data ingestion	Turn external sources into manageable inputs	Crawl outputs, raw files, ingestion logs	Crawling frameworks, connectors, object storage	Sources cannot be traced; collection criteria drift
Data cleaning	Denoise, deduplicate, standardize, decontaminate	Cleaning rules, abnormal sample pools, quality reports	Batch-processing frameworks, rule engines, quality validators	Cleaning is not reproducible; high-value samples are deleted
Annotation and augmentation	Create trainable supervision signals	Annotation tasks, QA workflows, augmentation artifacts	Annotation platforms, review systems, synthetic generation frameworks	Guideline drift; review chain breaks
Training preparation	Package data into model-consumable formats	JSONL/Parquet/Arrow, splits, indexes	Data-format libraries, tokenization and packing tools	Training and evaluation inputs diverge
Evaluation attribution	Build comparable experiments	Metric scripts, slice reports, attribution ledgers	Evaluation frameworks, experiment tracking, dashboards	Only averages are checked; evidence is not preserved
Open release	Create reusable assets	Cards, licenses, baseline bundles	Data cards, model cards, version-release tools	Versions cannot roll back; baselines become invalid
Teaching and reproduction	Let others reproduce work stably	Teaching images, experiment instructions, fixed versions	Containers, environment managers, course repositories	Environments drift mid-semester; scripts fail

The table's most important role is not to decide the tool immediately. It is to confirm what kind of problem is being solved. Many debates become easier once they move from "which framework do I prefer?" to "what deliverable must this stage produce?"

A.3 Data Ingestion, Storage, and Version Governance Tools¶

A.3.1 Ingestion Tools Standardize Sources First¶

If data sources span webpages, document repositories, enterprise tables, object storage, and third-party open datasets, manual download plus ad hoc scripts will not remain sustainable. The ingestion layer should fix where data comes from, when it comes in, and under what rules.

Category	Representative Tools / Frameworks	Scenarios	Strengths	Extra Concerns
Web crawling	Scrapy, Trafilatura	Public web text, news, knowledge bases	Mature ecosystem, easy customization	robots, copyright, update frequency
API/DB connectors	Airbyte, Fivetran-style connectors	SaaS, databases, internal business sources	Standardized ingestion, easier incrementality	Field-change management, permission minimization
Document import	Unstructured, Apache Tika	PDF, Office, scanned documents	Unified document entry	OCR errors, layout parsing bias
Object-storage access	S3/OSS/MinIO SDKs	Images, audio, video, large files	Suitable for lakehouse and offline processing	Lifecycle policies and cost control

For large-model data engineering, the ingestion layer should retain two records. The raw-entry record explains source, retrieval time, authorization scope, and filters. The engineering-entry record explains format conversion, anonymization, slicing, and sampling before downstream use. Without these records, cleaning, evaluation, and compliance boundaries become hard to explain.

A.3.2 Storage Is About Layering, Not Just Size¶

Many teams initially put everything into one object-storage bucket. This is convenient short term, but soon nobody can tell which layer is raw, which layer is cleaned, or which version downstream systems consumed. Storage must have at least basic layering.

Layer	Recommended Carrier	Description
Raw	Object storage saved as-is	Preserve the original source; avoid overwriting
Staging	Parquet/JSONL/Arrow intermediate formats	For cleaning, sampling, and quality checks
Curated	Trainable/evaluable standard sets	Official versions for training and evaluation
Release	Release package, card, baseline bundle	For external use or course reproduction

Object storage stores files but does not automatically provide version semantics. If the team expects long-term evolution, add versioning tools such as DVC, lakeFS, or a data-lake solution with snapshots. The value is practical: teams can answer which data version an experiment consumed, whether a public leaderboard maps back to a specific split, and whether a course environment is locked to the release-time version.

A.3.3 Version-Governance Tools Are Team Memory Systems¶

Git is excellent for text configuration and scripts, but it is not designed to directly manage large data assets. Data engineering often uses Git as the control plane and a large-file version system as the data plane.

Tool	Best For	Typical Use
Git	Code, configuration, schema, documentation	Process definitions, evaluation scripts, release notes
Git LFS	Medium-sized binaries	Small model files, sample data
DVC	Large-file and data-version references	Dataset versions, experiment-input binding
lakeFS	Branches and commits over object storage	Lakehouse-style data governance and collaboration
Delta Lake / Apache Iceberg	Large tabular data governance	Large-scale structured samples and metadata

For cross-institution dataset construction, public evaluation, and teaching reproduction, a minimal combination is often enough: Git for scripts and specifications, DVC or an equivalent for data versions, object storage for large files, and release pages for external documentation. This combination is easy to hand off, easy to reproduce in courses, and consistent with the governance language used in Part 8 and Part 12. Concrete data-versioning commands, remote configuration, and pipeline syntax should follow the official DVC documentation (DVC Contributors 2026).

A.4 Cleaning, Validation, and Training Preparation Tools¶

A.4.1 Choose Batch-Processing Frameworks by Data Form¶

Cleaning tools often fall into the habit of using one big framework for everything. Text, tables, document images, audio, and agent trajectories have very different processing needs. Split by data form.

Data Form	Recommended Processing Mode	Common Tools
Large-scale line-level text	Batch map/filter/reduce	Spark, Ray Data, Beam
Documents and tables	Streaming extraction plus structural validation	Unstructured, Pandas, Arrow
Multimodal samples	Metadata batch processing plus file references	Ray, PyArrow, object-storage indexes
Audio and video	Offline transcoding and feature extraction	FFmpeg, torchaudio, decord
Agent trajectories	Structured event streams and replay	JSONL, Parquet, custom validators

Spark is mature and stable for heavy batch processing and enterprise platforms. Ray Data is closer to Python, model inference, and multimodal processing. Beam is useful when unified batch/stream semantics matter. For many book projects, labs, and courses, the main bottleneck is not whether the system is distributed enough; it is unclear data contracts, unstable fields, and missing recovery paths for abnormal samples.

A.4.2 Quality Validation Should Explain Failures¶

Quality validation is not about achieving zero errors. It is about classifying errors and creating write-back actions. Frameworks such as Great Expectations are useful for structured rules, while documents, multimodal samples, and reasoning data often require custom validators.

Validation Layer	Question to Answer	Tool Form
Structure	Are JSON/table fields complete?	Schema validators, Pydantic
Statistics	Did distributions drift or outliers spike?	Profiling, dashboards
Semantics	Is the sample self-consistent and on task?	LLM judges, human spot checks
Task	Does it still satisfy training or evaluation protocols?	Special scripts, task validators

For the specialized datasets in Part 12, validators should ideally output three objects: a failed-sample pool, failure-reason categories, and repair suggestions. Cleaning then becomes a process of translating problems into next actions.

A.4.3 Deduplication, Decontamination, and Splitting Should Be Governed Separately¶

Do not combine deduplication, decontamination, and train/validation/test splitting into one opaque step. Use three steps:

Detect exact and near duplicates.
Check evaluation contamination and benchmark isolation.
Create official splits and freeze the version.

Text tasks commonly use MinHash, SimHash, and n-gram overlap. Document and image tasks must consider visual and layout-level near duplicates. Code and reasoning tasks must also watch template contamination, question-bank contamination, and evaluation-prompt leakage. A mature process lets future readers know whether a sample was excluded because of duplication, contamination, or split policy.

A.5 Annotation, Experiment Tracking, and Evaluation Tools¶

A.5.1 Annotation Platforms Should Be Judged by the QA Chain¶

The core of an annotation platform is not a polished UI. It is whether the platform supports task definition, review, arbitration, and write-back.

Scenario	Key Capability	Common Platform Direction
Text classification/extraction	Rule-based annotation and QA sampling	Label Studio, Doccano
Preference/ranking	Pairwise comparison, arbitration, review	Custom platforms, questionnaire-style systems
Document/multimodal	Region annotation, box selection, OCR linkage	Label Studio, CVAT-style tools
Speech	Waveform playback, slicing, speaker and emotion tags	Speech-focused annotation platforms

If a project will become an open benchmark or course experiment, preserve annotation version, guideline version, review conclusion, and disputed-sample list. Keeping only final labels makes it hard to reconstruct why boundary samples were defined in a certain way.

A.5.2 Experiment Tracking Must Bind Data Versions¶

Tools such as MLflow and Weights & Biases are often misused by recording only model parameters and metrics while omitting data versions, slice results, and evaluation-script versions. Logs then look rich but cannot explain where improvement came from. If MLflow is used as the experiment-tracking entry point, run records, artifact management, and model registry details should follow the official MLflow documentation (MLflow Authors 2026).

Track at least:

Field	Description
dataset_version	Training or evaluation data version
split_version	Split-policy version
eval_script_version	Metric-script version
prompt_or_template_version	Prompt or template version
slice_report_uri	Location of the slice report
writeback_decision	Whether the data strategy was changed

With these fields, experiment tracking moves from "what was run" to "why it was run, why it improved, and why it can be trusted."

A.5.3 Evaluation Frameworks Need Slices and Evidence Preservation¶

Large-model evaluation tools are multiplying, but the engineering needs in Part 12 are broader than a single benchmark run. We need reproducible metrics, explainable slices, saved evidence, and traceable versions.

Evaluation frameworks should:

Support multiple metrics in parallel, not only one total score.
Export slice reports and error samples.
Save evaluation inputs and outputs structurally.
Support reruns and historical version comparisons.

Only then can evaluation results enter the release checks in Appendix B and cost budgets in Appendix C.

A.6 Specialized Tools for Multimodal, RAG, and Agent Scenarios¶

A.6.1 In Document and Multimodal Pipelines, Parsing and Judgment Are Different¶

In document understanding, table parsing, chart reasoning, and multimodal RAG, teams often collapse OCR, layout analysis, retrieval, and final QA into one model black box. A better toolchain separates four layers.

Capability Layer	Role	Common Tool Direction
Parsing	Extract text, regions, and structure from raw files	OCR, document parsers, layout models
Storage	Store chunks, bounding boxes, page numbers, evidence metadata	Vector databases, object storage, structured tables
Retrieval	Recall candidate evidence bundles	BM25, vector search, hybrid retrieval
Judgment	Compose answers, refuse when needed, cite evidence	LLMs, rule validation, judges

This separation lets the team identify whether the problem is failure to extract, failure to retrieve, or failure to use retrieved evidence correctly.

A.6.2 Agent Data Toolchains Should Treat Trajectories as Assets¶

Agent tool-use data differs from ordinary QA data because intermediate states are themselves training assets. Function choice, arguments, observations, error recovery, and stopping conditions should not be treated as temporary logs.

Agent tooling should support:

Saving complete event sequences.
Replaying key steps.
Binding observations to final answers.
Extracting failed trajectories into specialized evaluation sets.

Without these capabilities, a team may get good final accuracy but still be unable to explain whether behavior is stable or convert the result into teaching experiments.

A.7 Minimal Combinations That Can Be Implemented Directly¶

A.7.1 Lab or Course Combination¶

Code and specifications: Git
Data versions: DVC
Storage: object storage or shared network storage
Cleaning and processing: Python + Ray/Pandas
Annotation: Label Studio or Doccano
Experiment tracking: MLflow
Release: Hugging Face Hub or a project website

This is suitable for cross-institution specialized datasets, course reproduction, and medium-scale research projects. It is lightweight and relatively easy to hand off.

If a dataset is organized and distributed through the Hugging Face Datasets ecosystem, the loading script, dataset card, and split configuration should follow the official Hugging Face Datasets documentation (Hugging Face 2026).

A.7.2 Enterprise Data Platform Combination¶

Workflow scheduling: Airflow
Distributed processing: Spark
Lakehouse governance: Iceberg/Delta
Quality validation: Great Expectations
Experiment tracking: MLflow or an internal platform
Release: tiered internal/external repositories and audit dashboards

The goal is not fastest one-off development, but stable boundaries under multi-person collaboration.

A.7.3 Multimodal and Agent-Heavy Combination¶

Files and metadata: object storage plus structured tables
Batch processing: Ray Data
Document parsing: OCR / Unstructured / custom pipelines
Retrieval and evidence: vector database plus rule indexes
Trajectory records: JSONL/Parquet plus replay tools
Evaluation attribution: specialized scripts plus experiment tracking

This fits the problem space of Chapters 38-41 because it naturally supports unified governance of documents, tables, charts, multimodal evidence, and agent tool trajectories.

A.8 Ten Questions Worth Asking During Tool Selection¶

Tooling should be selected around problems, not the other way around.

Does the tool solve acquisition, governance, evaluation, or release?
Can it connect stably to the existing versioning system?
Can failed samples be exported and recovered separately?
Does it support fixed versions for teaching reproduction?
Are permission control, audit, and least exposure easy to implement?
Does it support structured metadata rather than only file piles?
Can it support multimodal or multi-turn trajectories, not only single text rows?
Is handoff cost acceptable when team members change?
Is there a risk of deep vendor lock-in or single-engineer lock-in?
If the team creates a public benchmark in a year, can the tool still support it?

If four or five of these cannot be answered, the team should improve process design before deploying the tool broadly.

A.8.1 Common Tool-Selection Mistakes¶

Mistake	Surface Reason	Actual Problem
Use one large platform for everything	A single platform seems simpler	Stage boundaries are hidden behind UI
Look only at throughput and benchmarks	Faster tools feel more advanced	Audit, handoff, teaching reproduction, and failed-sample recovery are ignored
Follow one engineer's past experience	It worked in a previous project	Organizational knowledge and fallback plans are missing
Deploy tools first and add process later	"Let's get it running first"	Interfaces become messy and rework grows

These mistakes are dangerous because each contains a partial truth. A unified platform can reduce early integration cost; one experienced engineer can move quickly; throughput matters. But if these reasons replace lifecycle judgment, governance tends to collapse after a few versions.

A.8.2 Recommended Mapping for the Later Parts and Appendices¶

The later parts of the book connect naturally to tool choices:

Chapters 44-45: pre-training and post-training recipes need batch processing, version governance, experiment tracking, and data cards.
Chapters 46-48: reasoning, multimodal, and generative scenarios need trajectory records, evaluation slices, storage layering, and inference services.
Chapters 38-43: specialized datasets need fact checking, sample schemas, build pipelines, evaluation protocols, compliance audits, and reproducibility boundaries.
Appendices A-H translate those capabilities into operational checklists and templates for project managers, teaching assistants, platform teams, and maintainers.

This reminds readers that appendices are not secondary extras. They translate engineering capabilities from the main text into operational language.

A.9 Quick-Reference Fields to Maintain Long Term¶

To turn this appendix into a team asset, maintain a tool inventory table and update it quarterly.

Field	Description
tool_name	Tool name
stage	Lifecycle stage
owner	Responsible person or group
current_version	Current version in use
replacement_plan	Replacement or upgrade plan
dependency_risk	External dependency risk
teaching_ready	Whether it can enter course images
public_release_ready	Whether it is suitable for open release

This turns tool knowledge into organizational memory. Even after team members change, the team can still understand why a tool was chosen, where it can be replaced, and where it must not be changed casually.

A.10 Summary¶

This appendix organizes common data-engineering tools and frameworks from a lifecycle perspective.

First, tool selection should start from deliverables and failure modes, not popularity.

Second, long-term tool combinations are usually not one large platform, but a bounded combination of version control, data storage, batch processing, quality validation, evaluation tracking, and release governance.

Third, for university collaboration, open benchmarks, and teaching reproduction, handoff, reproducibility, and auditability are often more important than one-off throughput.

References¶

Gebru T, Morgenstern J, Vecchione B, Vaughan J W, Wallach H, Daumé III H, Crawford K (2021) Datasheets for Datasets. Communications of the ACM 64(12): 86-92. https://doi.org/10.1145/3458723.

Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji I D, Gebru T (2019) Model Cards for Model Reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp 220-229. https://doi.org/10.1145/3287560.3287596.

Pushkarna M, Zaldivar A, Kjartansson O, Cicconi P, Chen V, Efrat A, Zou Y, Mueller J, Taly A, Ehyaei A, Karkkainen K, Marathe A, Han X, Mittal A, Schuster T, Yarmand M, Sohn H, Dwarakanath N C, McCann B (2022) Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp 1776-1826. https://doi.org/10.1145/3531146.3533231.

DVC Contributors (2026) Data Version Control Documentation. https://dvc.org/doc.

MLflow Authors (2026) MLflow Documentation. https://mlflow.org/docs/latest/.

Hugging Face (2026) Hugging Face Datasets Documentation. https://huggingface.co/docs/datasets.