Chapter 31: Architecture and Task Boundaries for Data Engineering Agents¶

Jun Yu; Zhili Wang; Zhongyi Liu

Abstract¶

Traditional data engineering pipelines depend heavily on engineers writing DAGs, configuring schedulers, monitoring job state, and debugging data quality incidents by hand. As data sources multiply, cleaning rules become more complex, and labeling demand grows, the marginal cost of manual maintenance rises quickly. Data engineering agents attempt to change this pattern by assigning repetitive, rule-heavy, multi-step decisions to agent systems.

But as agents receive more autonomy, a deeper question appears: where are the task boundaries? How autonomous should the agent be? Which decisions must pass a human gate? How can tool calls remain auditable? How should multiple agents divide responsibility without conflict?

This chapter defines a layered architecture for data engineering agents. It centers on six components: Planner, Tool Executor, Verifier, Human Gate, Memory, and Lineage. For each component, we define responsibility boundaries, communication protocols, and failure isolation strategies. The chapter inherits the reasoning, tool-use, and memory mechanisms from Chapters 18-20, and builds on the platform, versioning, and observability foundations from Chapters 24-26.

To keep the discussion concrete, the chapter uses DataAgent as a running reference. Its YAML orchestration, FlexAgent, NL2SQL sub-agent, Semantic Service, workspace, and A2A/SDK/CLI entry points map well to the path from "can call tools" to "configurable, auditable, and serviceable." The point is not to make one agent "smarter." It is to make a group of agents collaborate in a more orderly way.

Keywords¶

Data engineering; Data Engineering Agent; automated data engineering; permission governance; human-AI collaboration

Learning Objectives¶

After reading this chapter, you should be able to:

Explain why agent systems in data engineering need layered architecture.
Define the boundaries of Planner, Tool Executor, Verifier, Human Gate, Memory, and Lineage.
Design cross-layer communication protocols and failure isolation strategies.
Distinguish recommendation, semi-automatic, approval-based, and autonomous automation levels.
Build a minimum viable Data Engineering Agent and evaluate its return on investment.
Understand how DataAgent can act as a reference implementation for semantic layers, main/sub-agent collaboration, tool boundaries, and operational audit.

Scenario: When an Agent Crosses the Boundary¶

A data team spends three months building a "fully automatic data cleaning agent." During the first week after launch, the results look excellent. The agent identifies 80 percent of abnormal formats and fixes more than 200 field-level quality issues. In the second week, trouble begins. While fixing a timestamp field, the agent rewrites the entire table's updated_at column to the current time. Twelve downstream pipelines receive polluted data. Engineers spend three days finding the root cause: the agent's repair plan executed without human review, and the Verifier checked only format correctness, not semantic reasonableness.

The incident exposes three systemic problems:

Problem 1: Planner and Executor are merged. The planning module and execution module use the same model instance. After the Planner generates a "fix timestamp" plan, the Executor runs it immediately, with no independent checkpoint.

Problem 2: Human Gate is only decorative. The system has a review step, but it auto-approves after a 30-second timeout. During peak hours, engineers cannot review every item, so the gate becomes a 30-second delay.

Problem 3: Lineage records are incomplete. The lineage system records only that "table X changed." It does not record who initiated the change, which rule was used, original values, or whether validation happened before execution. Root cause analysis becomes guesswork.

Core Engineering Pain Points¶

Autonomy versus reliability. More autonomy increases automation benefit, but also raises recovery cost when something fails.
Unclear cross-layer responsibility. When the Planner is wrong, the Verifier misses an issue, or the Executor performs an unsafe operation, responsibility cannot be assigned to a concrete component.
Missing auditability. Without a unified lineage protocol, logs from different layers use different formats and cross-layer queries require manual stitching.
Poor Human Gate design. The gate either blocks pipeline efficiency or becomes a formality.

31.1 Why Data Engineering Needs Agents¶

31.1.1 The Manual Ceiling in Traditional Data Engineering¶

Four high-frequency task categories in LLM data engineering are naturally suited to agentization.

Parser exception repair. As sources grow from dozens to hundreds, PDF layout failures, HTML structure drift, API schema changes, and mixed encodings increase nonlinearly. Each new source does not require only one new rule; it requires a set of rules and coordination with existing rules. An agent can detect failure types, select alternate parsers, generate temporary repair rules, and submit a validation report (Mialon et al. 2023; Wang et al. 2023; Xi et al. 2023).

Cleaning-rule iteration. Data cleaning is not a one-time job. Business changes, model training feedback, and updated quality thresholds all trigger rule changes. In a traditional workflow, cleaning-rule backlogs grow faster than engineers can process them. An agent can generate rule candidates from sampled defects, validate them in a sandbox, and submit a diff report for human approval.

Evaluation writeback. When model evaluation finds a data quality problem, such as low accuracy in a domain slice, the feedback must flow back to data production. Today this often happens through chat messages and manual investigation, taking days. An agent can read evaluation reports, locate relevant data batches and production steps, generate repair suggestions, and route them to the right team (Nakano et al. 2021; Park et al. 2023).

Alert attribution. Data platforms generate many alerts every day: delayed jobs, abnormal volume, declining quality metrics. Human root-cause analysis requires logs, lineage, change records, and business context, often taking more than 30 minutes per alert. An agent can aggregate signals, generate candidate root causes, rank them by confidence, and reduce MTTR.

31.1.2 Benefit and Risk Matrix¶

Table 31-1 summarizes the corresponding comparison and engineering considerations.

Table 31-1: Benefit and risk assessment for agentizing high-frequency data engineering tasks.

Task type	Average manual time	Time after agentization	Risk level	Suggested automation level
Parser exception detection and attribution	30 min	2 min	Low	Semi-automatic: agent suggests, human confirms
Cleaning-rule generation	2 hr	10 min	Medium	Approval-based: sandbox validation plus human approval
Evaluation writeback localization	4 hr	15 min	Medium	Semi-automatic
Alert root-cause analysis	45 min	3 min	Low	Recommendation: agent outputs candidates, human decides
Bulk data repair execution	1 hr	5 min	High	Approval-based plus rollback plan
Schema migration	8 hr	30 min	High	Approval-based plus multi-level review

Agentization does not mean pursuing full automation everywhere. The correct automation level depends on task risk and manual effort. High-risk tasks still require approval gates and rollback even if the execution itself can be automated.

31.1.3 Engineering Evaluation Framework for Agentization¶

Teams need a structured decision framework before agentizing a task.

Dimension 1: Structurability. Agents are strongest not at "creative work" in the abstract, but at structured reasoning: decomposing a task into explicit steps. If the hard part is finding the right sequence of steps, the task is a good candidate. If the hard part is subjective judgment, such as judging the literary value of a passage, agentization is less suitable. Work such as ReAct, Tree of Thoughts, Graph of Thoughts, PAL, Self-Refine, and Reflexion shows that structured reasoning usually requires making thoughts, actions, programmatic intermediate representations, and self-feedback explicit rather than relying on one-shot generation (Yao, Zhao et al. 2023; Yao, Yu et al. 2023; Besta et al. 2024; Gao et al. 2023; Madaan et al. 2023; Shinn et al. 2023).

Dimension 2: Exception frequency. If exceptions dominate the task, agentization can become inefficient: the agent proposes a plan, humans reject it, the agent tries again, and the loop becomes slower than direct manual work. A useful rule of thumb: tasks with exception rates below 30 percent are suitable; above 70 percent, humans should stay in the lead.

Dimension 3: Reversibility of errors. Reversible operations, such as format normalization, can be authorized more aggressively. Irreversible operations, such as deletion, require multi-person approval.

Dimension 4: Marginal benefit of replacing manual work. Prioritize high-frequency, rule-based, time-consuming tasks such as alert attribution, format repair, and rule generation. Delay low-frequency, high-judgment, low-effort tasks such as architecture design and compliance policy definition.

Dimension 5: Compatibility with existing systems. Agentization should not require a full platform rewrite. A good agent architecture can read existing quality reports, call existing cleaning tools, and write into existing lineage systems.

31.1.4 Common Organizational Resistance¶

Resistance is often organizational rather than technical.

Trust crisis: "agents are unreliable." Engineers have seen too much confident nonsense from AI systems. Start in shadow mode, where the agent observes and recommends without acting. Let engineers compare agent decisions with their own over several weeks.

Career anxiety: "automation will replace me." Make the positioning explicit: the agent removes repetitive work, not engineering judgment. Engineers move from rule executors to rule designers and exception specialists.

Efficiency concern: "approval will slow us down." Human Gate design must optimize for speed: present enough context, make decisions easy, and auto-pass low-risk operations through policy rather than through accidental neglect.

31.1.5 From Abstract Capability to Engineering Anchor: Why DataAgent¶

If Agentic Data Engineering is discussed only conceptually, it is easy to reduce it to "let the model call more tools." That misses the real difficulty. Tool calling is not the hardest part; the hard part is keeping tool calls inside configurable, verifiable, replayable, and deliverable system boundaries. Research on MRKL, Toolformer, Gorilla, and ToolLLM similarly shows that tool capability becomes engineering-usable only after tools are registered, selected, parameterized, and verified (Karpas et al. 2022; Schick et al. 2023; Patil et al. 2023; Qin et al. 2024).

DataAgent is useful as an engineering anchor because it puts several key questions into one runnable framework:

YAML as agent. Models, tools, scenario prompts, workspace, databases, and semantic services enter declarative configuration, reducing migration cost from experiment to application.
Main/sub-agent collaboration. The main agent understands business questions and organizes answers. The NL2SQL sub-agent performs structured querying, so the main agent does not guess tables, fields, or SQL directly.
Semantic layer enhancement. The Semantic Service turns tables, fields, metric definitions, and business descriptions into retrievable schema clues, aligning natural-language questions to business semantics before SQL generation.
Result assetization. SQL, CSV, reports, and trajectories are written to the workspace. An answer becomes a reviewable, regression-testable, auditable data asset rather than a one-off chat response.
Service boundaries. CLI, Python SDK, and A2A Server entry points let DataAgent evolve from a local tool into a reusable enterprise data capability.

This chapter uses DataAgent as a reference for how the six-layer architecture lands in practice. The full enterprise semantic query assistant appears later as Part 14 Project 15.

31.2 Agentic Data Engineering Architecture: Six Layers¶

31.2.1 Architecture Overview¶

Task boundaries must be constrained through layering. Each layer owns one type of responsibility, communicates through structured protocols, and does not share internal mutable state with other layers. Multi-agent conversation frameworks and surveys of autonomous agents both emphasize that complex agent systems need role separation, message protocols, and state isolation to reduce runaway risk (Wu et al. 2023; Wang et al. 2023; Xi et al. 2023).

Figure 31-1 illustrates the corresponding workflow or structure.

Figure 31-1: Six-layer architecture for data engineering agents.

Table 31-2 summarizes the corresponding comparison and engineering considerations.

Table 31-2: Responsibility boundaries and failure modes of six layers.

Layer	Core responsibility	Boundary that must not be crossed	Typical failure mode
Planner	Generate task plans, split substeps, select tools	Must not execute tools directly	Bad plan not detected before execution
Tool Executor	Call tools and pass parameters according to plan	Must not change the plan or skip verification	Wrong parameters, unhandled tool output
Verifier	Validate quality and semantics of execution results	Must not create new plans or modify data	Missed anomalies, excessive alerts
Human Gate	Approve high-risk operations	Must not auto-approve or rewrite plans	Timeout auto-approval, reviewer fatigue
Memory	Store agent state, decisions, preferences	Must not participate in the real-time execution path	Memory pollution, stale state
Lineage	Record complete lineage and audit trail	Must not block execution flow	Missing fields, inconsistent log format

31.2.2 Planner: Task Decomposition and Dependency Orchestration¶

The Planner is the "brain" that converts a high-level intent, such as "clean all abnormal values in table A," into atomic executable steps. The core challenge is not whether it can decompose the task, but whether it chooses the right granularity.

Too coarse: "fix all quality issues" is not executable. Too fine: repairing 100 rows as 100 independent steps is inefficient and makes dependencies hard to manage.

Planner design principles:

Step atomicity. Each step must be completable by one tool call with clear inputs and outputs.
Explicit dependencies. Data dependencies between steps must be declared in the plan.
Rollback path. Every write step must include a rollback plan.
Cost estimation. The Planner should estimate time and resource cost for each step and trigger review when thresholds are exceeded.

31.2.3 Tool Executor: Tool Registry and Safe Calls¶

The Tool Executor maps abstract plan steps to concrete tool calls. It maintains a Tool Registry where each tool declares capability boundaries, parameter schema, risk level, required permission, and rollback support.

Table 31-3 summarizes the corresponding comparison and engineering considerations.

Table 31-3: Example agent tool registry.

Tool	Capability	Risk	Required permission	Rollback support
`data_profiler`	Generate data quality report	Low	read	N/A
`field_fixer`	Fix abnormal values in one field	Medium	write	Yes, original values saved
`table_rewriter`	Rewrite table data in bulk	High	write	Yes, snapshot restore
`schema_migrator`	Execute schema changes	High	admin	Yes, version rollback
`rule_generator`	Generate cleaning rules from samples	Medium	read+write	Yes, rule versioning
`lineage_query`	Query lineage graph	Low	read	N/A
`alert_aggregator`	Aggregate alerts	Low	read	N/A
`pipeline_trigger`	Trigger downstream pipelines	High	admin	No, human confirmation required

Hard constraints for the Executor:

It must not call tools outside the registry.
It must check Human Gate approval before high-risk tool calls.
It must write all input parameters and outputs to lineage logs.

31.2.4 Verifier: Multi-Layer and Semantic Validation¶

The Verifier checks more than "did the operation run." It validates results at three levels:

Format layer. Field type, non-null constraints, and value ranges follow the schema.
Statistical layer. Distribution statistics do not drift unexpectedly after repair.
Semantic layer. Cross-field consistency constraints still hold, such as start_time earlier than end_time.

Verification results have three grades: Pass, Warn, and Block. Block-level anomalies must go to the Human Gate and cannot be retried automatically.

31.2.5 Human Gate: Design Principles¶

Human Gate is the layer most easily hollowed out. Effective design follows four principles:

Tiered approval. Not every operation needs human approval. Route by risk level and impact scope.
Timeout escalation, not auto-approval. A timeout should notify backup reviewers, pause related pipelines, or move the system to a safe state.
Complete review context. Approval must include original data, before/after diff, impact analysis, and rollback plan.
Traceable decisions. Reviewer, time, decision basis, and outcome must be written to lineage.

31.2.6 Memory: State Persistence and Context Management¶

Memory stores cross-session state:

Short-term memory. Current task context, intermediate results, pending approvals.
Long-term memory. Historical decision patterns, user preferences, common incident handling.
Shared memory. State needed by multiple agents, such as table owner or ongoing schema changes.

The main governance risk is memory pollution. Stale, wrong, or conflicting memory can degrade agent decisions. Memory must support TTL and version tags so that decisions are not based on obsolete state.

31.2.7 Lineage: End-to-End Operation Audit¶

Lineage records not only where data came from, but also who changed it, why, what the old value was, and what validation happened afterward. A complete record should include:

operation_id: unique operation identifier.
timestamp: operation time.
agent_id: executing agent instance.
tool_name: called tool.
input_snapshot: hash or reference before operation.
output_snapshot: hash or reference after operation.
plan_step_id: corresponding Planner step.
verification_result: Verifier result.
human_approval: whether human approval happened and who approved.

31.2.8 Cross-Layer Communication Protocols¶

The six-layer architecture works only if adjacent layers communicate through structured protocols and cannot bypass one another.

Planner to Executor:

Listing 31-1 provides a JSON data example.

{
  "step_id": "step_20240601_042",
  "tool_name": "field_fixer",
  "tool_params": {
    "target_table": "user_events",
    "target_field": "event_date",
    "fix_strategy": "date_normalization",
    "expected_format": "yyyy-MM-dd"
  },
  "preconditions": ["step_20240601_041.status == 'completed'"],
  "rollback_plan": {
    "method": "snapshot_restore",
    "snapshot_id": "snap_20240601_041"
  },
  "cost_estimate": {
    "estimated_rows": 5000,
    "estimated_duration_sec": 30
  },
  "risk_level": "medium"
}

Listing 31-1: JSON data example.

Executor to Verifier:

Listing 31-2 provides a JSON schema example.

{
  "step_id": "step_20240601_042",
  "tool_name": "field_fixer",
  "status": "completed",
  "input_snapshot_hash": "sha256:abc123...",
  "output_snapshot_hash": "sha256:def456...",
  "affected_rows": 4823,
  "affected_fields": ["user_events.event_date"],
  "execution_duration_ms": 28500,
  "warnings": ["3 rows had ambiguous date formats, defaulted to yyyy-MM-dd"]
}

Listing 31-2: JSON schema example.

Verifier to Human Gate: the Verifier submits a VerificationReport containing format, statistical, and semantic checks, confidence scores, and recommended action.

Hard constraints:

No skipped middle layer. Executor cannot send results directly to Human Gate; Planner cannot query Lineage directly without the Memory-mediated interface.
No shared mutable state. Layers communicate by messages, not shared memory.
Timeout and degradation. Each layer call must have a timeout. On timeout, the system follows predefined degradation rules based on risk, usually escalating to Human Gate or safely terminating.

31.2.9 Failure Isolation and Degradation¶

Table 31-4 summarizes the corresponding comparison and engineering considerations.

Table 31-4: Degradation strategies for six-layer failures.

Failed layer	Failure type	Degradation strategy	Recovery condition
Planner	Plan generation timeout or unreasonable plan	Use templated plan based on known successful plans	Resume after three successful manual plans
Executor	Tool call failure	Skip step and mark for manual handling	Tool recovers and validation passes
Verifier	Verification service unavailable	Escalate all operations to Human Gate	Verification service recovers
Human Gate	Reviewer unavailable	Escalate to backup reviewer or pause nonurgent operations	Reviewer confirms
Memory	Memory read failure	Use default config and mark memory unavailable	Memory storage recovers
Lineage	Log write failure	Cache logs locally and backfill later	Lineage storage recovers

The key principle is that degradation must increase human involvement. When automation is unreliable, the system should become more conservative, not continue in an unsafe automatic mode.

31.2.10 Where DataAgent Fits¶

DataAgent can be understood as an Agentic Data Engineering framework for enterprise data tasks. It does not put every capability into one large model. Instead, configuration, tools, sub-agents, semantic layer, and workspace turn natural-language tasks into manageable engineering units.

Table 31-5 summarizes the corresponding comparison and engineering considerations.

Table 31-5: Mapping DataAgent to the six-layer architecture.

Architecture layer	DataAgent capability	Meaning for data engineering agents
Planner	FlexAgent, ReAct main agent, `SCENARIO`, task prompts	Convert business questions into executable steps and decide when to call tools or sub-agents
Tool Executor	`TOOLS.local_functions`, MCP tools, A2A tools, `nl2sql_sub_agent_tool`	Register tool calls explicitly and avoid bypassing tool boundaries
Verifier	NL2SQL Validator, SQL explain, metadata match, executor preview	Add structured checks before SQL execution and answer interpretation
Human Gate	Human feedback nodes, approval workflows, external approval systems	Keep high-risk queries, writes, and cross-system triggers under human confirmation
Memory	Context, message history, history writer, memory indexer	Preserve session state, historical decisions, and reusable context
Lineage	Context trajectory, workspace files, SQL/CSV/report artifacts, tool return records	Leave replayable evidence for audit, review, and regression tests

DataAgent is not claimed to contain every production governance capability. Its value here is showing the carrying points a practical data engineering agent needs. For example, the NL2SQL sub-agent separates Perceptor, Generator, Validator, Reflector, Executor, and Selector. This is closer to production than allowing the main agent to generate SQL and answer directly, because failures can be located: schema recall failure, SQL generation failure, validation failure, execution failure, or result interpretation failure.

DataAgent has three roles in this part:

Architecture reference. Chapter 31 maps it to the six layers.
Capability base. Chapters 32-34 can connect collection, cleaning, evaluation, and DataOps capabilities as tools, sub-agents, or A2A services.
Governance sample. Chapter 35 extends workspace isolation, path allowlists, tool authorization, audit logs, and service authentication from DataAgent-like runtime boundaries.

31.3 Task Boundaries and Automation Levels¶

31.3.1 Four-Level Automation Model¶

Not all data engineering tasks are suitable for fully automatic execution. This chapter proposes a four-level automation model, where each level corresponds to different agent permissions and human involvement.

Table 31-6 summarizes the corresponding comparison and engineering considerations.

Table 31-6: Four-level automation matrix.

Level	Name	Agent role	Human role	Typical tasks	Hard constraint
L1	Recommendation	Analyze and suggest	Decide and execute	Alert attribution, quality report interpretation	Agent performs no writes
L2	Semi-automatic	Analyze and draft operations	Review draft and confirm execution	Parser repair, cleaning-rule suggestions	Writes require human confirmation
L3	Approval-based	Analyze, propose, sandbox validate	Approve plan and monitor execution	Schema migration, bulk repair	High-risk operations require multi-person approval
L4	Autonomous	Complete bounded flow independently	Post-hoc audit and sampling	Format normalization, deduplication	Scope must be predefined

31.3.2 Steps That Always Require Human Review¶

The following steps must retain human review at any automation level:

Schema changes. Adding, deleting, renaming, or changing field types has broad impact and is difficult to fully roll back.
Data deletion. Physical deletion requires confirmation and recorded rationale.
Rule release. A new cleaning rule must be approved before moving from sandbox to production.
Cross-system action. Triggering downstream pipelines or notifying external teams requires confirmation.
Cost-sensitive operations. Full-table scans or large reprocessing jobs above cost thresholds require approval.

31.3.3 Human-AI Collaboration Flow¶

Figure 31-2 illustrates the corresponding workflow or structure.

Figure 31-2: Human-AI collaboration flow by risk level.

31.4 Minimum Viable Data Engineering Agent¶

31.4.1 MVP Definition¶

A minimum viable Data Engineering Agent should complete this loop independently:

Read a data quality report from a tool such as data_profiler.
Generate a repair plan by ranking defects and producing field-level repair steps.
Call cleaning tools such as field_fixer.
Submit a diff report comparing before/after snapshots and marking repair items and risks.
Wait for acceptance by sending the report to Human Gate before archival.

31.4.2 Technology Choices¶

Table 31-7 summarizes the corresponding comparison and engineering considerations.

Table 31-7: Technology choices for MVP components.

Component	Minimal implementation	Recommended implementation
Planner	Rule-based step templates	LLM plus few-shot prompts and tool descriptions
Executor	Python function wrappers	Structured Tool Registry plus parameter validation
Verifier	SQL constraint checks	Great Expectations / Soda plus custom semantic rules
Human Gate	Slack/DingTalk approval message	Dedicated approval dashboard plus timeout escalation
Memory	JSON file persistence	Vector database plus structured state storage
Lineage	Operation log table	OpenLineage / Marquez plus custom extensions

31.4.3 A Semantic Query MVP With DataAgent¶

The previous MVP focuses on data quality repair. In DataAgent's semantic query scenario, the MVP can be narrower and lower risk: the agent does not modify production data. It converts natural language to structured queries, writes results to disk, and generates a report.

Listing 31-3 provides a data-agent task flow example.

Business question
  -> main agent understands intent
  -> Semantic Service retrieves candidate tables, fields, and metric definitions
  -> NL2SQL sub-agent generates, validates, and executes SQL
  -> SQL and CSV are written to workspace
  -> main agent explains or reports from results
  -> trajectory, tool returns, and artifacts enter audit records

Listing 31-3: Data-agent task flow example.

This usually maps to L1-L2. The agent may read metadata, generate SQL, execute read-only queries, and save results, but it should not rewrite schemas, publish metric definitions, or trigger downstream production pipelines.

Table 31-8 summarizes the corresponding comparison and engineering considerations.

Table 31-8: Configuration gates for a DataAgent semantic query MVP.

Configuration surface	Typical content	MVP gate
`MODEL`	chat model, temperature, base URL, API key	Config loads; secrets are not stored in repository or written config
`SCENARIO`	task description, tool-call constraints, output format	Database queries must go through `nl2sql_sub_agent_tool`
`TOOLS`	local functions, MCP tools, A2A tool registration	Register only necessary tools; do not expose raw SQL executor to main agent
`DATABASE` / `METAVISOR`	database connection, Semantic Service address, candidate retrieval config	Use read-only account first and verify schema recall and SQL scope
`WORKSPACE`	artifact path and allowed path	SQL, CSV, and reports can be written only to authorized directories

31.4.4 MVP Evaluation Metrics¶

Track the following after launch:

Repair coverage: defects repaired by agent divided by total defects.
Repair accuracy: repairs requiring no human correction divided by total agent repairs.
Human approval pass rate: Human Gate approvals divided by submitted approvals. Very high may mean the gate is too loose; very low may mean poor agent quality or bad thresholds.
Mean time to repair: end-to-end time from detection to repair.
Rollback rate: proportion of agent operations rolled back, ideally near zero.

31.4.5 End-to-End MVP Example¶

Input quality report for orders:

order_date: 3 percent of records use MM/dd/yyyy instead of yyyy-MM-dd, about 1,500 rows.
customer_email: 5 percent contain invisible characters, leading/trailing spaces, or full-width characters, about 2,500 rows.
amount: 0.5 percent are negative, possibly refunds, about 250 rows.

Planner output

Table 31-9 summarizes the corresponding comparison and engineering considerations.

Table 31-9: Planner output.

Step	Tool	Target	Estimated rows	Risk
Step 1	`field_fixer`	Normalize `order_date`	~1500	Low
Step 2	`field_fixer`	Clean `customer_email`	~2500	Low
Step 3	`data_profiler`	Analyze context for negative `amount`	~250	Low, read-only
Step 4	`rule_generator`	Generate handling suggestion from negative-value analysis	~250	Medium

The Executor automatically runs Steps 1 and 2. Step 3 finds that 92 percent of negative amounts have corresponding refund IDs and 8 percent remain unexplained.

Verifier results:

Step 1: order_date compliance improves from 97 percent to 100 percent with no side effects. Pass.
Step 2: customer_email anomaly rate drops from 5 percent to 0.1 percent. Pass.
Step 3: analysis report is logically consistent. Pass.

Human Gate approves Step 4's suggestion to mark the unexplained 8 percent as pending human review. Lineage records the full chain: who changed what, with which tool, before/after hashes, validation results, and approval.

31.4.6 MVP Limitations¶

MVPs must not be mistaken for production systems:

Limited plan templates. The Planner may fail on quality problems not covered by templates.
Incomplete verification. Basic format and statistical checks may miss deep semantic errors, such as currency-unit confusion.
Human Gate bottleneck. Early agents often submit many items for approval; poor workflow design can make the gate the slowest layer.
Weak memory. MVP memory is often simple, so the agent cannot reuse historical experience across sessions.

31.5 Case Review: From MVP to Production Agent¶

An e-commerce data team evolves through three stages:

Stage 1: MVP. The agent reads quality reports, generates repair plans, calls cleaning tools, and submits diff reports. In the first month it handles 60 percent of field-level anomalies with about 78 percent accuracy. The main weakness is compound anomalies: one field has both format and semantic issues, and the agent produces contradictory repair steps.

Stage 2: semantic verification. Statistical distribution checks and cross-field consistency are added. Accuracy rises to 91 percent, but the Verifier becomes too sensitive and queues too many warnings for Human Gate.

Stage 3: Human Gate tiering. The team applies the four-level automation model. Field-level format fixes become L4 autonomous within a bounded scope. Table-level schema changes remain L3 approval-based. The daily approval queue drops from more than 200 items to about 15.

Key Lessons¶

Do not chase full automation immediately. The MVP goal is to prove that the agent can do the right thing, not everything.
Verifier thresholds require tuning. Too loose misses problems; too strict causes alert fatigue.
Human Gate design determines deployability. Bad approval experience makes engineers bypass the gate; overly loose approval makes the gate meaningless.

Why Layered Architecture Beats a Monolithic Agent¶

In the opening incident, Planner and Executor shared one model instance, with no independent Verifier or real Human Gate. A hallucination or wrong judgment could pass straight into the data layer.

The value of layering is defense in depth:

Planner failure: Verifier detects unreasonable plans before execution.
Executor failure: Verifier checks format, distribution, and semantics and blocks or rolls back.
Verifier miss: Human reviewers may still notice obvious diff anomalies.
Human Gate mistake: Lineage preserves who approved what, when, and why.

The architecture assumes every layer can fail and prevents one failure from becoming catastrophic.

Production Agent Maturity Model¶

Table 31-10 summarizes the corresponding comparison and engineering considerations.

Table 31-10: Data engineering agent maturity model.

Maturity	Characteristics	Automation	Human intervention	Typical timeline
L0: Manual	All tasks done by humans	0%	100%	Baseline
L1: Recommendation	Agent analyzes and suggests; human executes	< 30%	Frequent	1-2 months
L2: Semi-automatic	Agent executes low-risk tasks; high-risk needs approval	30-60%	Daily	3-6 months
L3: Conditional autonomy	Agent operates within bounded scope; humans handle exceptions	60-85%	Occasional	6-12 months
L4: High autonomy	Agent manages most tasks; humans audit afterward	> 85%	Exceptional	12+ months

The bottleneck is often organizational trust rather than technology. Engineers need time to observe behavior before granting more authority.

31.6 Checklist: Data Engineering Agent Architecture Review¶

[ ] Are Planner and Executor separated at the code level?
[ ] Does every write step include a rollback plan?
[ ] Does every registered tool declare risk level and required permission?
[ ] Does the Verifier cover format, statistical, and semantic checks?
[ ] Does Human Gate use timeout escalation rather than auto-approval?
[ ] Does every approval request include diff, impact scope, and rollback context?
[ ] Does Lineage cover who, when, which tool, what changed, and validation result?
[ ] Does Memory use TTL to avoid stale decisions?
[ ] Are automation levels explicitly assigned?
[ ] Is there an alert threshold for agent operation rollback rate?

31.7 Chapter Links¶

Chapters 18-20: reasoning, tool use, and agent memory provide foundations for Planner and Memory.
Chapters 24-26: DataOps flywheel, data versioning, and platform observability support Lineage and Verifier.
Chapter 32: applies this architecture to collection, parsing, and cleaning.
Chapter 33: extends Human Gate design into labeling and evaluation.
Chapter 34: extends Lineage and Memory into operational autonomy.
Chapter 35: deepens permissions, security, and human-AI collaboration.
Part 14 Project 15: builds an enterprise semantic query assistant with DataAgent.

31.8 Further Reading and Discussion¶

Common Architecture Mistakes¶

Mistake 1: "The stricter the Verifier, the better." Overly strict thresholds create alert floods and reviewer fatigue. Start by blocking only clear data damage, then tighten thresholds based on false-positive and false-negative measurements.

Mistake 2: "More approvers means safer." Multi-person approval increases safety but also delay. Approval count should scale with risk, not be uniform.

Mistake 3: "Memory is optional." Without memory, the agent reasons from scratch for similar issues and may make inconsistent decisions. Even a minimal JSON record of historical decision patterns can have high ROI.

Mistake 4: "Lineage belongs to operations, not agent design." Lineage is not only for postmortems. The agent needs lineage before planning so it can estimate downstream impact.

Integration Path With Existing Data Platforms¶

Start read-only. Deploy L1 recommendation agents first.
Validate in sandbox. Test write behavior against production snapshots.
Pilot on one noncritical table. Grant field-level writes and observe for two to four weeks.
Expand gradually. Increase scope by data classification and automation level.
Deploy platform-wide. Enable agents for eligible tables while retaining Human Gate as the final safety layer.

Chapter Summary¶

This chapter organized the core problems, processing flow, and acceptance criteria for "architecture and task boundaries of data engineering agents" in large-model data engineering. Its contribution is to place concepts, data objects, quality signals, and engineering delivery into one narrative so readers can judge which steps must be explicitly recorded and which results must be verified through sampling, evaluation, or audit.

The methods in this chapter should be applied by jointly considering data sources, business goals, model capability, cost budget, and compliance requirements. For scenarios involving sensitive information, cross-system calls, automated decisions, or public release, human review, version freeze, permission control, and exception rollback should be retained. Example flows should not be generalized directly into production commitments; this is also consistent with production-grade MLOps and AI risk-governance requirements for traceability, risk classification, and human oversight.

For boundary design, the chapter proposed four automation levels: recommendation, semi-automatic, approval-based, and autonomous. It clarified that schema changes, data deletion, rule release, and cross-system actions must require human review, and it proposed assessing whether a task is suitable for agentization through dimensions such as structurability, exception rate, and reversibility. Finally, the chapter demonstrated two minimum viable agent paths: data-quality repair and DataAgent semantic query, and used the maturity model to show that the bottleneck from L0 to L4 is mainly organizational trust rather than technology alone.

References¶

Besta M, Blach N, Kubicek A, Gerstenberger R, Podstawski M, Gianinazzi L, Gajda J, Lehmann T, Niewiadomski H, Nyczyk P, Hoefler T (2024) Graph of Thoughts: Solving Elaborate Problems with Large Language Models. In: Proceedings of the AAAI Conference on Artificial Intelligence 38(16):17682-17690. https://doi.org/10.1609/aaai.v38i16.29720.

Gao L, Madaan A, Zhou S, Alon U, Liu P, Yang Y, Callan J, Neubig G (2023) PAL: Program-aided Language Models. In: Proceedings of the 40th International Conference on Machine Learning, pp 10764-10799. arXiv:2211.10435.

Karpas E, Abend O, Belinkov Y, Lenz B, Lieber O, Ratner N, Shoham Y, Bata H, Levine Y, Leyton-Brown K, Muhlgay D, Rozen N, Schwartz E, Shashua A, Shuster K, Tenenbaum J, Wolf L, Zettlemoyer L, Riedel S (2022) MRKL Systems: A Modular, Neuro-Symbolic Architecture That Combines Large Language Models, External Knowledge Sources and Discrete Reasoning. arXiv preprint arXiv:2205.00445.

Kreuzberger D, Kuhl N, Hirschl S (2023) Machine Learning Operations (MLOps): Overview, Definition, and Architecture. IEEE Access 11:31866-31879. arXiv:2205.02302.

Madaan A, Tandon N, Gupta P, Hallinan S, Gao L, Wiegreffe S, Alon U, Dziri N, Prabhumoye S, Yang Y, Gupta S, Majumder B P, Hermann K, Welleck S, Yazdanbakhsh A, Clark P (2023) Self-Refine: Iterative Refinement with Self-Feedback. In: Advances in Neural Information Processing Systems 36. arXiv:2303.17651.

Mialon G, Dessi R, Lomeli M, Nalmpantis C, Pasunuru R, Raileanu R, Roziere B, Schick T, Dwivedi-Yu J, Celikyilmaz A, Grave E, LeCun Y, Scialom T (2023) Augmented Language Models: A Survey. Transactions on Machine Learning Research.

Nakano R, Hilton J, Balaji S, Wu J, Ouyang L, Kim C, Hesse C, Jain S, Kosaraju V, Saunders W, Jiang X, Cobbe K, Eloundou T, Krueger G, Button K, Knight M, Chess B, Schulman J (2021) WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.

Park J S, O'Brien J C, Cai C J, Morris M R, Liang P, Bernstein M S (2023) Generative Agents: Interactive Simulacra of Human Behavior. In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, Article 2. https://doi.org/10.1145/3586183.3606763.

Patil S G, Zhang T, Wang X, Gonzalez J E (2023) Gorilla: Large Language Model Connected with Massive APIs. arXiv preprint arXiv:2305.15334.

Qin Y, Liang S, Ye Y, Zhu K, Yan L, Lu Y, Lin Y, Cong X, Tang X, Qian B, Zhao S, Tian R, Xie R, Zhou J, Gerstein M, Li D, Liu Z, Sun M (2024) ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. In: International Conference on Learning Representations.

Schick T, Dwivedi-Yu J, Dessi R, Raileanu R, Lomeli M, Hambro E, Zettlemoyer L, Cancedda N, Scialom T (2023) Toolformer: Language Models Can Teach Themselves to Use Tools. In: Advances in Neural Information Processing Systems 36. arXiv:2302.04761.

Shinn N, Cassano F, Gopinath A, Narasimhan K, Yao S (2023) Reflexion: Language Agents with Verbal Reinforcement Learning. In: Advances in Neural Information Processing Systems 36. arXiv:2303.11366.

Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, Chen Z, Tang J, Chen X, Lin Y, Zhao W X, Wei Z, Wen J-R (2023) A Survey on Large Language Model based Autonomous Agents. arXiv preprint arXiv:2308.11432.

Wu Q, Bansal G, Zhang J, Wu Y, Li B, Zhu E, Jiang L, Zhang X, Zhang S, Liu J, Awadallah A H, White R W, Burger D, Wang C (2023) AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv preprint arXiv:2308.08155.

Xi Z, Chen W, Guo X, He W, Ding Y, Hong B, Zhang M, Wang J, Jin S, Zhou E, Zheng R, Fan X, Wang X, Xiong L, Zhou Y, Wang W, Jiang C, Zou Y, Liu X, Yin Z, Dou S, Weng R, Cheng W, Zhang Q, Qin W, Zheng Y, Qiu X, Huang X, Gui T (2023) The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv preprint arXiv:2309.07864.

Yao S, Zhao J, Yu D, Du N, Shafran I, Narasimhan K, Cao Y (2023) ReAct: Synergizing Reasoning and Acting in Language Models. In: International Conference on Learning Representations. arXiv:2210.03629.

Yao S, Yu D, Zhao J, Shafran I, Griffiths T L, Cao Y, Narasimhan K (2023) Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In: Advances in Neural Information Processing Systems 36. arXiv:2305.10601.