Appendix C: Cost Estimation and Resource Templates¶

C.1 Purpose of This Appendix¶

The most underestimated part of a data-engineering project is often not technical complexity but cost structure. Many teams ask only how many GPUs are needed for training, while leaving unanswered questions such as: who collects and cleans samples, how many person-days annotation review requires, how much long-term evaluation and leaderboard maintenance costs, how much multimodal storage and egress traffic will cost, and whether teaching images need fixed resources for an entire semester. The resource consumption and carbon emissions of large-scale training have been discussed in dedicated studies, showing that training budgets should not be treated merely as temporary engineering expenses (Patterson et al. 2021).

This appendix does not provide a universal price list. It provides a method, tables, and review templates for cost estimation. If the method is stable, teams can quickly update estimates even when unit prices change. Without a method, knowing one moment's unit price is not enough for reliable budgeting.

C.2 Why Costs Should Be Split by Lifecycle¶

Budget debates often lose focus because different roles look at different cost objects. Researchers may look only at training GPUs. Platform teams care about storage and feedback flows. Course owners care about images and teaching-assistant time. Management cares about quarterly investment and delivery cadence.

A more useful split for data-engineering projects is lifecycle-based:

Cost Category	Core Question	Typical Unit
Data ingestion	What does it cost to obtain samples?	Person-days, API calls, crawl jobs
Cleaning and processing	What does it cost to turn samples into usable assets?	CPU/GPU hours, person-days
Annotation and review	How is supervision created?	Per sample, per hour, per QA round
Training and inference	What does model consumption of data cost?	GPU hours, tokens, throughput
Evaluation and release	What do comparison, leaderboard, and public maintenance cost?	Task batches, human review, person-days
Teaching and operations	What does long-term reproducibility cost?	Images, accounts, teaching-assistant time

This split forces the team to recognize a key reality: the expensive part in the long term is often not the first training run, but governance and maintenance afterward.

C.3 Overall Estimation Formula¶

A practical total-cost formula is:

\[ \text{Total Cost} = C_{\text{ingest}} + C_{\text{clean}} + C_{\text{annotate}} + C_{\text{train/infer}} + C_{\text{evaluate}} + C_{\text{release/teach}} \]

For execution, each item can be split into fixed cost, variable cost, and rework-risk reserve:

\[ C_i = C_i^{fixed} + C_i^{variable} + C_i^{risk} \]

Many budgets lose control not because the variable term is wrong, but because the risk term is missing. Re-annotation, anonymization rework, leaderboard review, sample withdrawal, and mid-semester image repair are not rare edge cases in real projects.

C.4 Data Ingestion and Cleaning Cost Templates¶

C.4.1 Ingestion Cost Is Not Just "How Much Was Collected"¶

The ingestion stage should estimate at least three costs: ingestion-script development, data-pull execution, and source/license verification.

Item	Unit	Notes
Ingestion script development	Person-days	Crawling, API access, parser adaptation
Data pull jobs	Batches	Includes retries
Source and license verification	Person-days	Partners, agreements, log organization
Storage landing	TB/month	Raw-layer storage
Data spot check	Person-days	First quality baseline

If a project has many sources, source verification should be listed separately rather than hidden inside script development, because it often becomes a hidden critical path for launch.

C.4.2 Cleaning Cost Often Underestimates Rework¶

Cleaning cost usually includes rule development, batch-processing resources, abnormal-sample review, and reruns.

Cost Item	Estimation Method
Rule development	Number of rules x average design and test hours
Batch processing	Data volume x unit processing resource
Failed-sample review	Number of abnormal samples x review time per sample
Rerun cost	Main-pipeline resource per rerun x expected rerun count

In document, multimodal, and speech projects, rerun cost is often ignored. Once OCR strategy, deduplication thresholds, or audio-slicing rules change, entire intermediate layers may have to be regenerated.

C.5 Annotation, Review, and Synthetic Data Cost Templates¶

C.5.1 Separate First-Pass Annotation from Review¶

Estimating annotation as one price per sample almost always underestimates the budget. Split annotation into:

First-pass annotation.
Review.
Arbitration.
Guideline iteration.
Platform and management cost.

Item	Unit	Notes
First-pass annotation	Item/page/minute	Depends on task granularity
Review	Item/page/minute	Estimate by sampling rate
Arbitration	Item/page/minute	Budget for high-dispute samples
Guideline iteration	Person-days	Guidelines and QA meetings
Platform maintenance	Month	Accounts, permissions, export

For preference alignment, multimodal region annotation, or controllable speech-emotion tasks, arbitration and guideline iteration can become bigger bottlenecks than first-pass annotation.

C.5.2 Synthetic Data Is Not Zero Labor Cost¶

When teams hear that large models can generate synthetic data, they often assume cost will drop sharply. In practice, the first generation pass may become cheaper, but verification and filtering may not.

Cost Item	Description
Prompt/template design	Who defines the generation protocol
API or inference cost	Tokens or GPU cost for generation
Filtering and scoring	Automatic judges and rule filters
Human spot checks	Verification that samples are actually usable
Regeneration	Reprocessing failed samples

The safest budget is not "the API bill for generating 100,000 samples." It is the full-chain cost of generation, filtering, sampling, and rework.

C.6 Training and Inference Resource Templates¶

C.6.1 Training Estimates Should Not Look Only at GPU Count¶

Training budgets are often simplified to "how many cards for how many days." The real cost also depends on effective throughput, probability of failed reruns, and number of tuning rounds. Large-scale training systems such as Megatron-LM show that model parallelism, data parallelism, and pipeline parallelism can significantly affect throughput, memory footprint, and failure-recovery cost (Narayanan et al. 2021).

Item	Unit	Notes
Training GPUs	GPU hours	Formal training
Tuning GPUs	GPU hours	Small-scale experiments
Evaluation inference	GPU hours / tokens	Includes slice evaluation
Data preparation CPU	Core hours	Tokenization, packing, validation
Failed-rerun reserve	Percentage	Keep as a separate line

If a team does not reserve resources for failed reruns, the budget usually becomes inaccurate after the second experiment. For engineering projects, a 10-30% risk reserve is often more valuable than pretending the first estimate is exact.

C.6.2 Split Inference Cost by Scenario¶

Inference cost should be split into at least three scenarios. For long-context and high-concurrency serving, memory-management mechanisms such as PagedAttention have become important references for serving-cost estimation, and vLLM's engineering documentation provides a practical entry point for deployment and tuning (Kwon et al. 2023; vLLM Project 2026).

Scenario	Characteristics	Estimation Focus
Offline batch evaluation	Batchable and queueable	Throughput, concurrency, result caching
Online service	Latency-sensitive	Peak concurrency, context length, fallback strategy
Teaching experiment	Time-window concentration	Semester images, account limits, retry cost

Mixing these three makes it hard to explain why online budgets are high or why course weeks produce resource spikes.

C.7 Evaluation, Leaderboard, and Public Maintenance Costs¶

C.7.1 Long-Term Benchmark Cost Often Exceeds Initial Release Cost¶

A public benchmark costs more than its first release. Continuing costs include:

Baseline reproduction and upgrades.
Submission validation.
Manual review of suspicious results.
Dispute handling.
Documentation and image maintenance.
Teaching-version locking.

Item	Cycle	Unit	Estimation Method
Baseline reruns	Quarterly / half-yearly	GPU hours	Number of baselines x cost per evaluation
Submission review	Monthly	Person-days	Submission count x review hours
Issue handling	Monthly	Person-days	Historical average issue volume
Documentation update	Per version release	Person-days	Card, FAQ, release notes
Teaching image maintenance	Semester	Person-days / instances	Pre-course freeze and in-semester patches

Without this budget line, a public benchmark can be lively at launch and unreliable half a year later.

C.7.2 Why Evaluation Cost Often Rises Late¶

Early in a project, evaluation may feel cheap because the team runs a few baselines internally. Once a dataset enters public release, courses, or multi-team collaboration, evaluation cost rises because:

More slices are added, so one evaluation is no longer just one total score.
More baselines must preserve historical comparability.
Suspicious results require human review.
Course experiments need a stable evaluation environment that cannot follow every mainline change.

Maintain two ledgers: internal R&D evaluation and public-maintenance evaluation. The first serves fast iteration; the second serves stable comparison.

C.8 Storage, Network, and Archival Costs¶

C.8.1 Multimodal Projects Must List Storage Separately¶

Text projects can sometimes survive rough disk estimates. Document, image, audio, video, and agent-trajectory projects cannot.

Layer	Description
Raw files	Largest layer, but not always frequently accessed
Intermediate artifacts	OCR, features, transcoding, indexes
Curated versions	Official training and evaluation inputs
Release images	External release and course reproduction versions
Archive layer	Cold storage and long-term preservation

Without this layering, teams often discover late that training was not the expensive part; permanently retaining every intermediate artifact was. If Kubernetes is used to host training, evaluation, or teaching environments, resource quotas, storage volumes, namespaces, and job lifecycle should also be included in the budget sheet, with resource objects and scheduling semantics following the official Kubernetes documentation (Kubernetes Authors 2026).

C.8.2 Archival Strategy Determines Maintainability Over the Next Three Years¶

Archiving is not simply compressing old versions. It determines whether future review and reproduction remain possible.

A good archive budget answers:

Which versions must be kept long term.
Which versions keep only metadata and pointers.
Which course images must be frozen by semester.
Which public releases must be rollback-capable.

Projects without an archive strategy often save a little storage cost while losing reviewability.

C.8.3 Network Egress and External Calls Should Not Be Hidden Under "Other"¶

In multimodal projects, RAG projects, and open leaderboards, network and external-call costs are often placed under a vague "other" category. This weakens budget judgment.

Category	Common Source	Why It Must Be Separate
Egress traffic	Object-storage downloads, course image pulls	Peaks often align with course milestones
External APIs	Judges, OCR, translation, anonymization	Cost can scale rapidly with call volume
External synchronization	Public release, image replication, cross-region backup	Strongly tied to compliance and availability strategy

Separating these costs lets the team ask whether growth is caused by real project scale, release style, course usage, or dependency choices.

C.9 Three Reusable Resource Templates¶

C.9.1 Small Research or Course Template¶

For labs, single courses, and prototypes.

Module	Budgeting Logic
Data processing	Person-days first, resources second
Annotation	Small, high-quality samples; increase review ratio
Training	Estimate experiment rounds before formal runs
Evaluation	Emphasize slices and reviews rather than leaderboard expansion
Teaching	List images and teaching-assistant hours separately

C.9.2 Medium Team Project Template¶

For multi-member collaboration, quarterly delivery, and internal or external users.

Module	Budgeting Logic
Data ingestion	Build fixed entry points and version strategy
Cleaning governance	Separate rerun cost and abnormal-review cost
Training and inference	Split formal training, tuning, and evaluation ledgers
Release maintenance	Treat benchmark and issue operations as long-term items
Compliance	Count approval, anonymization, and withdrawal flows as management cost

C.9.3 Open Benchmark or University Collaboration Template¶

For assets that require public release, course reproduction, and long-term maintenance.

Module	Budgeting Logic
Data versioning	Keep at least raw, curated, and release layers
Documentation	Do not omit cards, FAQs, or release notes
Leaderboard	Budget for review, recheck, and removal mechanisms
Teaching	List frozen semester versions and experiment scripts
Handoff	Reserve cleanup time for each semester or version

C.9.4 Example Estimation Order for a University Dataset Release¶

For a university collaboration dataset moving from organization and evaluation to public release, estimate in this order:

Ingestion and license verification, because release depends on it.
Cleaning, anonymization, and version packaging, because stable release depends on them.
Baselines and evaluation scripts, because comparable release depends on them.
Teaching images, leaderboard operations, and issue handling, because long-term survival depends on them.

This order is opposite to many teams' intuition, but it matches the idea that a dataset is not merely pre-training raw material. It is an engineering asset that consumes maintenance resources over time.

C.9.5 Budget Ledger Fields to Keep¶

Do not create a new spreadsheet from scratch every time. Maintain a continuous budget ledger.

Field	Description
budget_cycle	Quarter, semester, or version cycle
asset_scope	Dataset, benchmark, or course involved
planned_cost	Budgeted value
actual_cost	Actual value
variance_reason	Reason for variance
reusable_asset_output	Long-term asset created in this cycle
one_off_cost	One-time cost
maintenance_cost	Future maintenance cost

With these fields, a team can learn which investments become long-term assets and which expensive items are only reactive repairs.

C.10 Quarterly Reviews Should Not Ask Only "Did We Overspend?"¶

The common mistake in budget reviews is checking only whether spending exceeded the plan. More useful questions are:

Did overspending come mainly from training, rework, annotation, or maintenance?
Which previously invisible cost must now enter the budget table?
Which expenses created sustainable assets, and which merely patched holes?
Were teaching, public benchmark, and production versions improperly mixed?
Which layer should be controlled next quarter, rather than which number?

The purpose of a budget is not to turn the project into a finance game. It is to explain which kind of data value the resources are buying.

C.10.1 Three Charts Worth Keeping in Quarterly Reviews¶

For communicable budget reviews, keep three summary charts:

Lifecycle cost distribution: ingestion, cleaning, annotation, training, evaluation, and maintenance.
Budget-variance source: rework, call-volume growth, image maintenance, or human review.
Asset-creation comparison: which reusable versions, templates, or public assets were created by this quarter's spending.

These charts translate financial language back into engineering language.

C.11 Summary¶

This appendix establishes cost-estimation methods and resource templates for data-engineering projects.

First, costs must be split by lifecycle; otherwise teams see only training cost and miss governance cost.

Second, the easiest items to underestimate are not unit prices, but rework, review, maintenance, and teaching reproduction.

Third, mature cost management is not only about saving money. It makes the relationship between resource investment and data-asset value explainable.

References¶

Patterson D, Gonzalez J, Le Q, Liang C, Munguia L, Rothchild D, So D, Texier M, Dean J (2021) Carbon Emissions and Large Neural Network Training. arXiv preprint arXiv:2104.10350.

Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Catanzaro B (2021) Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. arXiv:2104.04473.

Kwon W, Li Z, Zhuang S, Sheng Y, Zheng L, Yu C H, Gonzalez J E, Zhang H, Stoica I (2023) Efficient Memory Management for Large Language Model Serving with PagedAttention. In: Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pp 611-626. https://doi.org/10.1145/3600006.3613165.

Kubernetes Authors (2026) Kubernetes Documentation. https://kubernetes.io/docs/.

vLLM Project (2026) vLLM Documentation. https://docs.vllm.ai/.