Chapter 34: DataOps Agents and Platform Autonomy¶

Yang Luo; Zhili Wang; Jun Yu

Abstract¶

Data platform operations have long been treated as necessary but low-value labor: monitoring alerts, debugging incidents, rolling back versions, and analyzing cost. Each task consumes engineering time, yet none directly creates business value. As data platforms scale to hundreds of pipelines, PB-level storage, and thousands of tables, manual operations reach their limit. Alert fatigue causes missed signals, root cause analysis depends on personal experience, rollback decisions become hesitant, and cost black holes go unnoticed.

The goal of DataOps Agents is to upgrade the data platform from "being operated" to "self-operating." Agents aggregate alerts and locate root causes, generate rollback and repair plans for approval, analyze cost anomalies and propose optimizations, and draft tickets and postmortems. Autonomy does not mean unattended operation. High-risk actions such as data rollback, index rebuild, and pipeline restart must pass approval gates.

This chapter builds on Chapters 24-26: DataOps flywheel, version management, experiment tracking, and platform observability. It turns manual operations into agent-driven autonomous loops.

Keywords¶

DataOps Agent; platform autonomy; root-cause localization; data rollback; cost governance; operations automation

Learning Objectives¶

After reading this chapter, you should be able to:

Design an agent pipeline from alert aggregation to root cause analysis.
Understand how agents generate rollback and repair plans and route them through approval.
Analyze cost anomalies such as storage growth, reprocessing cost, and GPU feeding failures.
Build workflows for automatic ticket generation and postmortem drafts.
Evaluate MTTR improvement from DataOps Agents in production platforms.

Scenario: An Alert Storm at 3 A.M.¶

At 3:14 a.m., a data platform monitoring system sends the first alert: user_behavior_etl is delayed by more than 30 minutes. During the next 15 minutes, alerts arrive in a flood:

3:17: recommendation_feature fails because it depends on user_behavior_etl.
3:19: training_data_export has abnormal volume: expected 2 million rows, produced 300,000.
3:22: model_training_pipeline pauses due to missing training data.
3:25: storage usage spikes; intermediate tables grow to three times the usual size.
3:28: the labeling platform errors because some tasks reference rolled-back data batches.

The on-call engineer wakes up and starts checking alerts one by one. The alerts are scattered across systems, with no aggregation and no context. It takes 40 minutes to locate the root cause: an upstream business database schema change added a required field, causing an ETL WHERE condition to filter out 85 percent of records. The schema-change email had arrived eight hours earlier and sat unread.

Core Engineering Pain Points¶

No alert aggregation or prioritization. Twelve alerts have one root cause, but the engineer must inspect them one by one.
Root cause analysis depends on personal experience. The engineer who knows the pipeline is on vacation.
Rollback decisions are uncertain. No system suggests whether to roll back, which version to use, or which downstream systems will be affected.
Cost anomalies are not tracked in time. Storage growth and reprocessing cost appear on the monthly bill after losses are real.

34.1 Alert-to-Root-Cause Agents¶

34.1.1 Alert Aggregation and Prioritization¶

The first capability of a DataOps Agent is intelligent alert aggregation. The agent reads four data categories and turns scattered alerts into root-cause candidates.

Figure 34-1 illustrates the corresponding workflow or structure.

Figure 34-1: Alert-to-root-cause agent flow.

Data sources:

Metrics: job latency, volume, error rate, resource usage.
Logs: task execution logs, stack traces, data quality check logs.
Lineage: upstream dependencies, field-level lineage, data flow graph.
Change records: code commits, configuration changes, schema changes, dependency upgrades.

34.1.2 Root Cause Candidate Generation¶

The agent aligns timelines and uses causal signals to generate candidates.

Table 34-1 summarizes the corresponding comparison and engineering considerations.

Table 34-1: Root cause candidate types and detection logic.

Root cause type	Pattern	Detection logic	Confidence
Upstream schema change	Change precedes first alert and affects all downstream tasks	Cross-check change records and alert timeline	0.90
Volume spike/drop	Volume deviates > 3 sigma from historical mean	Time-series anomaly detection	0.85
Insufficient resources	CPU or memory reaches limit	Correlate resource metrics with job performance	0.80
Code defect	Alert coincides with recent deployment	Cross-check deployment records and alert timeline	0.75
External dependency failure	Multiple jobs fail while depending on the same API	Dependency-graph topology analysis	0.70

The agent outputs three to five candidates ranked by confidence. Each candidate includes evidence: log snippets, metric charts, and change records. High-confidence candidates, such as above 0.85, may receive direct repair suggestions. Medium and low-confidence candidates require human confirmation.

34.1.2.1 Confidence Calibration for Root Cause Analysis¶

The confidence score itself must be calibrated. If the agent says "90 percent confidence," is it actually correct 90 percent of the time? Overconfident scoring will erode engineer trust.

Calibration methods:

Historical backtesting. Review the previous month's root-cause candidates and compare them with confirmed causes.
Calibration curve. Bucket confidence from 0 to 1 and compute actual accuracy in each bucket.
Bias correction. If high-confidence buckets are less accurate than claimed, show corrected confidence in alerts.

Confidence-to-action mapping

Table 34-2 summarizes the corresponding comparison and engineering considerations.

Table 34-2: Confidence-to-action mapping.

Raw confidence	Calibrated confidence	Automation action
> 0.90	Usually consistent	Push root cause and repair suggestion
0.75-0.90	Check for bias	Push candidate and recommend confirmation
0.50-0.75	Usually lower than raw	Push candidate and require human analysis
< 0.50	Usually lower than raw	Push alert aggregation only; no root-cause guess

34.1.2.2 Cross-System Root Cause Correlation¶

Data platform incidents often span upstream ETL, storage, downstream consumers, and schedulers. Analysis cannot stay inside one log system.

Implementation strategies:

Unified timeline. Align all logs to one timezone and one minute-level event sequence.
Dependency graph overlay. Overlay alerts on the lineage graph to show propagation.
Change-event correlation. Overlay deployments, config changes, and schema changes on the timeline.

If an alert cluster starts within five minutes of a change event, the agent should mark that change as highly suspicious even before direct proof is available.

34.1.3 Alert Fatigue Governance¶

Alert fatigue is one of the most common anti-patterns in data platform operations. When alerts are too frequent, engineers start ignoring them and miss real incidents.

Aggregation strategies:

Causal aggregation. Merge alerts from the same root cause into one group.
Time-window aggregation. Merge repeated alerts of the same type within a window, such as five minutes, and attach count and trend.
Dynamic priority adjustment. Adjust priority based on business impact and data volume. Low-priority 3 a.m. alerts can wait until working hours.

Table 34-3 summarizes the corresponding comparison and engineering considerations.

Table 34-3: Example alert aggregation effect.

Raw alerts	Aggregated alerts	Aggregation rate	Daily alerts handled by engineers
200+	15-20	~90%	From 50+ to 12
500+	25-35	~93%	From 80+ to 18

34.2 Version Rollback and Repair Plan Agents¶

34.2.1 Automatic Rollback Plan Generation¶

After a data quality issue is confirmed, the agent generates rollback and repair plans. Rollback is not simply "return to the previous version." It must account for dependencies, data loss, and repair cost.

Rollback plan generation:

Impact analysis. Use lineage to identify downstream tables and jobs.
Rollback candidate generation. List recent snapshots, usually three to five, with data differences and rollback cost.
Repair versus rollback comparison. Compare fixing current data with rolling back to a historical version.
Rerun plan. List downstream jobs that must rerun and estimate duration.

34.2.2 Rollback Approval Workflow¶

Rollback is high risk and requires approval.

Figure 34-2 illustrates the corresponding workflow or structure.

Figure 34-2: Rollback approval workflow.

Table 34-4 summarizes the corresponding comparison and engineering considerations.

Table 34-4: Approval matrix for rollback and repair.

Operation	Risk	Approval	Rollback plan
Field-level data repair	Low	Agent automatic plus post-hoc audit	Keep original values
Table-level data repair	Medium	Owner single review	Snapshot rollback
Single-table rollback	Medium	Owner single review	Pre-rollback snapshot
Multi-table rollback	High	Multi-level approval	Global snapshot plus downstream notice
Rule revocation	Medium	Owner plus rule author	Rule version rollback
Index rebuild	High	Platform admin approval	Keep old index until new one passes validation

34.2.3 Pipeline Self-Healing¶

Beyond rollback and repair, a DataOps Agent can attempt limited self-healing after detecting known failures.

Self-healing is bounded. The agent may only execute operations in a predefined safe operation set. Anything outside that set requires human approval.

Table 34-5 summarizes the corresponding comparison and engineering considerations.

Table 34-5: Self-healing permissions by data classification.

Self-healing operation	L0 public data	L1 internal data	L2 sensitive data	L3 confidential data
Job rerun	Automatic	Automatic	Automatic, once	Approval
Parameter adjustment, such as longer timeout	Automatic	Automatic	Approval	Approval
Switch data source replica	Automatic	Automatic	Approval	Not allowed
Skip noncritical step	Automatic	Approval	Approval	Not allowed
Field-level data repair	Automatic	Approval	Approval	Not allowed
Scale compute resources	Automatic with quota	Approval	Approval	Approval

Figure 34-3 illustrates the corresponding workflow or structure.

Figure 34-3: Pipeline self-healing decision flow.

The key principle is: repair known failure patterns; do not explore unknown problems autonomously. If a new failure pattern appears, the agent escalates it and records the pattern. After human-confirmed repair, that pattern may enter the future self-healing set.

34.2.4 Data Consistency During Rollback¶

The hardest part of rollback is not restoring a snapshot. It is preserving business consistency after rollback. If table A is rolled back but table B, which depends on A, is not, B may contain orphan data.

Rollback plans must include:

Dependency analysis. Query all downstream tables and jobs.
Consistency validation. Check references, foreign keys, and aggregate correctness.
Cascade rollback suggestions. If downstream risk exists, suggest cascading rollback or downstream repair.
Notification. Inform model training, analytics, labeling, and other consumers of affected time range and recommended action.

34.2.5 Decision Support Matrix for Data Rollback¶

Table 34-6 summarizes the corresponding comparison and engineering considerations.

Table 34-6: Rollback versus repair decision support.

Factor	Rollback option	Repair option	Agent analysis
Time cost	Snapshot restore plus downstream reruns	Script development plus validation	Estimate both using historical data
Data loss	New data after snapshot may be lost	Repair may introduce new errors	Quantify data loss
Downstream impact	Downstream jobs must rerun and delivery may delay	Downstream jobs may not be affected	Estimate impact through lineage
Risk	Rollback may fail	Incomplete repair may be found later	Estimate from historical success rates

The agent does not decide for the team. It ensures decision-makers have enough information: if rollback takes three hours, loses two hours of new data, and affects three downstream teams, while repair takes eight hours with medium integrity risk, the tradeoff is explicit.

34.3 Cost Alerting and Capacity Optimization Agents¶

34.3.1 Automatic Cost Anomaly Detection¶

Data platform cost is often ignored because billing is monthly. By the time the bill shows an anomaly, the loss has happened. Cost alerting agents provide real-time monitoring.

Storage growth detection. Monitor table growth rates and alert when growth exceeds historical baseline. Common causes include uncleared intermediate tables, unbounded logs, and overly aggressive snapshots.

Reprocessing cost analysis. When pipelines need reruns, estimate compute cost and compare it with repair cost.

Labeling backlog cost. Annotation backlog causes schedule delay, overtime, idle work, and wasted training compute. The agent monitors backlog level and trend.

GPU feeding anomalies. Training jobs consume data at stable rates. If data supply stops or quality fails and training idles, GPU spend is wasted. The agent monitors data starvation in training jobs.

34.3.2 Automatic Optimization Suggestions¶

Table 34-7 summarizes the corresponding comparison and engineering considerations.

Table 34-7: Cost anomaly detection and optimization suggestions.

Cost anomaly	Detection	Suggestion	Estimated saving
Intermediate table growth	Table size growth > daily mean by 5%	Set TTL and archive old partitions	20-40% storage
Duplicate computation	Same input triggers repeated reruns	Add intermediate-result cache	15-30% compute
Excessive snapshots	Snapshot frequency exceeds business need	Reduce frequency and use tiered retention	30-50% snapshot storage
Inefficient queries	Full scans > 20% of queries	Add index or partition pruning	10-25% query cost
Idle resources	CPU/GPU utilization < 30%	Shrink cluster or use Spot instances	20-50% compute

34.3.3 Capacity Planning and Forecasting¶

DataOps Agents should warn before resource exhaustion.

Storage forecast. Use historical growth and upcoming business plans, such as new sources or model training, to predict one-, three-, and six-month storage needs. If storage is forecast to reach 80 percent capacity within 30 days, generate an expansion suggestion.

Compute forecast. Analyze training workload patterns: periodic peaks, event-driven spikes after releases, and trends caused by data growth. The agent can recommend reserved instances, identify batch jobs that can run off-peak, and expand elastic resources when Spot prices are low.

Labeling capacity forecast. Based on downstream training plans and historical annotation speed, forecast whether labeling will become a bottleneck. If so, notify project managers to adjust priority, add annotators, or change quality standards.

34.3.4 Cost Attribution and Team Accountability¶

Optimization requires knowing who spends what. Agents allocate platform cost by team, project, task type, and data source.

Table 34-8 summarizes the corresponding comparison and engineering considerations.

Table 34-8: Cost attribution dimensions.

Attribution dimension	Cost types	Attribution method
Team	Storage, compute, labeling labor	Resource owner tags
Project	GPU training, inference	Project labels on training jobs
Data source	Collection, storage, cleaning	Source tags and lineage
Experiment	Training, evaluation	Experiment tracking ID

Each month, the agent generates team cost reports showing storage, compute, labeling, and collection trends. If a team's cost exceeds budget by more than 20 percent, the agent explains whether the driver is data growth or inefficient use.

34.4 Ticket and Postmortem Automation Agents¶

34.4.1 Automatic Ticket Generation¶

When human intervention is needed, the DataOps Agent creates a structured ticket with:

Issue summary. One sentence explaining what happened.
Impact scope. Affected tables, jobs, and downstream teams.
Root cause candidates. Ranked by confidence.
Suggested actions. Repair plan and estimated duration.
Urgency. P0-P3 based on impact and business priority.
Related links. Monitoring dashboards, log queries, and lineage analysis.

34.4.2 Automatic Postmortem Drafts¶

Postmortems are central to the DataOps flywheel, but engineers often skip them after intense repair work. Agents can draft postmortems from operation logs and lineage.

Postmortem draft template

Table 34-9 summarizes the corresponding comparison and engineering considerations.

Table 34-9: Postmortem draft template.

Element	Auto-fill source
Event timeline	Alert times, operation logs, approval records
Root cause analysis	Root-cause agent output
Impact assessment	Lineage analysis and downstream scope
Response process	Human Gate timestamps and agent execution logs
Repair plan	Executed rollback or repair operations and results
Prevention	Rule suggestions based on failure pattern
Action items	Generated TODO list with owners
Acceptance criteria	Metrics and thresholds to verify next week

34.4.3 Linking Experiment Tracking and Agents¶

DataOps Agents should connect to experiment tracking. When a model experiment shows data-related regression, the agent links it to data changes.

Example: math_reasoning accuracy drops from 78 percent to 71 percent. The agent:

Queries all training-data changes for that dimension during the experiment window.
Finds a cleaning rule released last week that affected 3 percent of math reasoning samples.
Correlates the rule change with performance regression and writes a report.
Recommends rolling back the rule and retraining for validation.

This closes the loop from data change to model performance and can reduce diagnosis from days to hours.

34.4.4 Automatic Operations Knowledge Base¶

Long-term value comes from accumulating reusable knowledge. After each incident, the agent writes these items into the operations knowledge base:

Failure fingerprint: metric combinations and log keywords.
Handling steps: executed steps and whether each was effective.
Timeline: complete sequence from incident to recovery.
Related information: alerts, change records, and communication records.

For future incidents, the agent retrieves similar cases: "this pattern is 87 percent similar to incident-2024-0315; the previous repair was..." Experience reuse is how DataOps Agents move from smart to wise.

34.5 Case Review: DataOps Agent in a Real Platform¶

After six months of deployment on a large data platform:

Alert response. The agent reduces more than 200 daily alerts into 15-20 root-cause candidates. MTTR drops from 45 minutes to 12 minutes. High-confidence root-cause candidates above 0.85 reach 91 percent accuracy.

Cost control. The agent detects seven storage-growth issues and helps clean 30 TB of unused intermediate data. Reprocessing-cost analysis helps teams choose repair instead of rerun in five incidents, saving about $8,000 in compute.

Postmortem culture. Automatic drafts reduce postmortem preparation from two hours to 15 minutes. The team moves from occasional postmortems to one postmortem for every incident.

Key Lessons¶

Alert aggregation must distinguish causality from correlation. Simultaneous alerts are not always causally linked.
Rollback approval cannot wait too long. One incident spreads because a reviewer is on vacation; timeout escalation is added afterward.
Cost suggestions need a soft landing. A table that looks unused may be someone's manual backup; optimization requires confirmation.

Extended Case: From Reactive Operations to Active Autonomy¶

The rollout is gradual:

Stage 1, months 0-2: shadow mode. The agent analyzes alerts and root causes in the background but performs no actions. Initial root-cause accuracy is 62 percent and rises to 85 percent after rule and feedback tuning.

Stage 2, months 3-4: recommendation mode. The agent sends aggregated alerts and root-cause analysis to on-call engineers. Engineers can view alert groups, candidates, and evidence in one click. MTTR drops from 45 minutes to 18 minutes.

Stage 3, months 5-6: semi-automatic mode. For high-confidence candidates above 0.85, the agent generates repair suggestions that engineers confirm before execution. Trust grows through retrospective comparisons between agent suggestions and actual repair results.

Stage 4, after month 7: conditional autonomy. The agent receives bounded automatic permissions for low-risk operations on L0 and L1 data, such as reruns and parameter adjustments, followed by audit.

Success Factors¶

Shadow mode builds trust cheaply.
Accuracy is not the only trust metric; explainability and predictability matter.
Gradual authorization is safer than one-time authorization.
Rollback capability is the foundation of authority.

Continuous MTTR Improvement¶

Table 34-10 summarizes the corresponding comparison and engineering considerations.

Table 34-10: Continuous MTTR Improvement.

Stage	MTTR	Agent automation	Engineer satisfaction
Manual baseline	45 min	0%	3.2/5, alert fatigue
Shadow mode	40 min	0%, analysis only	3.5/5
Recommendation mode	18 min	0%, suggestions	4.1/5
Semi-automatic mode	12 min	30%, low-risk automatic	4.3/5
Conditional autonomy	8 min	60%	4.5/5

34.6 Checklist: DataOps Agent Deployment¶

[ ] Does the alert aggregation agent read metrics, logs, lineage, and change records?
[ ] Are root-cause candidates ranked by confidence with supporting evidence?
[ ] Do rollback plans include impact analysis, candidate comparison, and rerun plans?
[ ] Is rollback approval tiered by impact scope?
[ ] Does cost monitoring cover storage growth, reprocessing, labeling backlog, and GPU feeding?
[ ] Are cost optimization suggestions human-confirmed before execution?
[ ] Do tickets include summary, impact scope, root-cause candidates, suggested actions, and urgency?
[ ] Do postmortem drafts include timeline, root cause, impact, and action items?
[ ] Are MTTR baseline and improvement targets defined?
[ ] Does approval timeout escalate rather than auto-approve?

34.7 Chapter Links¶

Chapter 24: DataOps flywheel and team organization are extended with agent-driven operational autonomy.
Chapter 25: rollback depends on version management and experiment tracking.
Chapter 26: alert agents depend on platform observability.
Chapter 31: this chapter follows the six-layer architecture, especially Human Gate and Lineage.
Chapter 32: collection and cleaning pipelines are part of what this chapter monitors.
Chapter 35: rollback approvals connect to the permission model and human-AI collaboration.

34.8 Further Reading: Deepening DataOps Agents¶

From Reactive to Predictive Operations¶

Most DataOps Agents today are reactive: detect, locate, and repair after failure. The next stage is predictive operations: predict and prevent before failure.

Technical path:

Anomaly pattern learning. Analyze metrics during the 24 hours before historical incidents and learn precursor patterns.
Risk scoring. Score every pipeline run using recent changes, volume fluctuation, dependency health, and failure history.
Preventive intervention. When risk exceeds threshold, scale resources, adjust scheduling, or notify teams.

Reactive versus predictive operations

Table 34-11 summarizes the corresponding comparison and engineering considerations.

Table 34-11: Reactive versus predictive operations.

Dimension	Reactive operations	Predictive operations
Trigger	After failure	Before failure
Detection	Alert thresholds	Pattern recognition and risk scoring
Intervention	Emergency repair	Preventive adjustment
Business impact	Interruption already happened	Interruption avoided
Maturity	Available	Requires historical data
Current accuracy	~85% root-cause localization	~60% failure prediction, needs tuning

Rethinking Permission Boundaries for Platform Autonomy¶

As autonomy increases, the question becomes: where is the final boundary?

Humans should retain final decision rights for:

Architectural changes. Platform topology, core storage migration, and infrastructure upgrades.
Cost and budget tradeoffs. Whether to spend more money for faster processing is a business decision.
Cross-team priority coordination. Agents should not decide whose workloads run first during resource contention.
Legal and compliance decisions. Agents can analyze gray areas, but legal teams decide.

The boundary can expand as reliability and trust grow, but every expansion requires evidence and rollback capability.

From DevOps to DataOps to AgentOps¶

The operations paradigm is evolving:

DevOps automates software delivery through CI/CD and infrastructure as code.
DataOps automates data delivery through pipelines, quality monitoring, and version management.
AgentOps automates agent operations: deployment, monitoring, evaluation, update, and retirement.

AgentOps does not replace DataOps. It adds a layer above it: agents manage data pipelines, and AgentOps manages agents. Operations dashboards must therefore combine data quality metrics with agent behavior metrics.

Chapter Summary¶

Starting from a 3 A.M. alert storm, this chapter discussed how DataOps Agents can move platform operations from passive response toward bounded autonomy. For alert-to-root-cause workflows, it presented alert aggregation and prioritization, root-cause candidate generation, and alert-fatigue governance to reduce mean time to repair. For recovery, it discussed automatic rollback-plan generation and approval workflows, pipeline self-healing, data consistency during version rollback, and decision-support matrices that keep human judgment in the loop for high-risk data rollback.

The methods in this chapter should be applied by jointly considering data sources, business goals, model capability, cost budget, and compliance requirements. For scenarios involving sensitive information, cross-system calls, automated decisions, or public release, human review, version freeze, permission control, and exception rollback should be retained. Example flows should not be generalized directly into production commitments; the governance, mapping, measurement, and management requirements of AI risk-management frameworks also apply to autonomy boundaries for DataOps Agents.

In the book's structure, this chapter sits in the agent-automation layer. It connects earlier platform foundations with later privacy, compliance, and specialized dataset cases. Readers can combine the chapter's frameworks, figures, references, and appendix checklists to turn the methods into reproducible, inspectable, and deliverable engineering workflows.

References¶

Amershi S, Begel A, Bird C, Devanbu P, Gall H, Kamar E, Nagappan N, Nushi B, Zimmermann T (2019) Software Engineering for Machine Learning: A Case Study. In: Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, pp 291-300. https://doi.org/10.1109/icse-seip.2019.00042.

Breck E, Polyzotis N, Roy S, Whang S E, Zinkevich M (2019) Data Validation for Machine Learning. In: Proceedings of Machine Learning and Systems 1, pp 334-347.

Dang Y, Lin Q, Huang P (2019) AIOps: Real-World Challenges and Research Innovations. In: Proceedings of the 41st International Conference on Software Engineering: Companion Proceedings, pp 4-5. https://doi.org/10.1109/icse-companion.2019.00023.

He S, He P, Chen Z, Yang T, Su Y, Lyu M R (2021) A Survey on Automated Log Analysis for Reliability Engineering. ACM Computing Surveys 54(6):1-37.

Huyen C (2022) Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications. O'Reilly Media.

Kreuzberger D, Kuhl N, Hirschl S (2023) Machine Learning Operations (MLOps): Overview, Definition, and Architecture. IEEE Access 11:31866-31879. arXiv:2205.02302.

Makinen S, Skogstrom H, Laaksonen E, Mikkonen T (2021) Who Needs MLOps: What Data Scientists Seek to Accomplish and How Can MLOps Help? In: Proceedings of the 2021 IEEE/ACM 1st Workshop on AI Engineering - Software Engineering for AI, pp 109-112.

Lwakatare L E, Raj A, Crnkovic I, Bosch J, Olsson H H (2020) Large-scale Machine Learning Systems in Real-world Industrial Settings: A Review of Challenges and Solutions. Information and Software Technology 127:106368. https://doi.org/10.1016/j.infsof.2020.106368.

NIST (2023) Artificial Intelligence Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology.

NIST (2024) Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. NIST AI 600-1.

Paleyes A, Urma R-G, Lawrence N D (2022) Challenges in Deploying Machine Learning: A Survey of Case Studies. ACM Computing Surveys 55(6):1-29. https://doi.org/10.1145/3533378.

Sambasivan N, Kapania S, Highfill H, Akrong D, Paritosh P, Aroyo L M (2021) "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp 1-15. https://doi.org/10.1145/3411764.3445518.

Tamburri D A (2020) Sustainable MLOps: Trends and Challenges. In: Proceedings of the 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp 17-23. https://doi.org/10.1109/synasc51798.2020.00015.

Testi M, Ballabio M, Frontoni E, Iannello G, Moccia S, Soda P, Vessio G (2022) MLOps: A Taxonomy and a Methodology. IEEE Access 10:63606-63618. https://doi.org/10.1109/access.2022.3181730.

Treveil M, Omont N, Stenac C, Lefevre K, Phan D, Zentici J, Lavoillotte A, Miyazaki M, Heidmann L (2020) Introducing MLOps: How to Scale Machine Learning in the Enterprise. O'Reilly Media.

Vela D, Sharp A, Zhang R, Nguyen T, Hoang A, Pianykh O S (2022) Temporal quality degradation in AI models. Scientific Reports 12:11654. https://doi.org/10.1038/s41598-022-15245-z.

Zhu J, He S, Liu J, He P, Xie Q, Zheng Z, Lyu M R (2019) Tools and Benchmarks for Automated Log Parsing. In: Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, pp 121-130.