Skip to content

Chapter 34: DataOps Agents and Platform Autonomy

Yang Luo; Zhili Wang; Jun Yu

Abstract

Data platform operations have long been treated as necessary but low-value labor: monitoring alerts, debugging incidents, rolling back versions, and analyzing cost. Each task consumes engineering time, yet none directly creates business value. As data platforms scale to hundreds of pipelines, PB-level storage, and thousands of tables, manual operations reach their limit. Alert fatigue causes missed signals, root cause analysis depends on personal experience, rollback decisions become hesitant, and cost black holes go unnoticed.

The goal of DataOps Agents is to upgrade the data platform from "being operated" to "self-operating." Agents aggregate alerts and locate root causes, generate rollback and repair plans for approval, analyze cost anomalies and propose optimizations, and draft tickets and postmortems. Autonomy does not mean unattended operation. High-risk actions such as data rollback, index rebuild, and pipeline restart must pass approval gates.

This chapter builds on Chapters 24-26: DataOps flywheel, version management, experiment tracking, and platform observability. It turns manual operations into agent-driven autonomous loops.

Keywords

DataOps Agent; platform autonomy; root-cause localization; data rollback; cost governance; operations automation

Learning Objectives

After reading this chapter, you should be able to:

  • Design an agent pipeline from alert aggregation to root cause analysis.
  • Understand how agents generate rollback and repair plans and route them through approval.
  • Analyze cost anomalies such as storage growth, reprocessing cost, and GPU feeding failures.
  • Build workflows for automatic ticket generation and postmortem drafts.
  • Evaluate MTTR improvement from DataOps Agents in production platforms.

Scenario: An Alert Storm at 3 A.M.

At 3:14 a.m., a data platform monitoring system sends the first alert: user_behavior_etl is delayed by more than 30 minutes. During the next 15 minutes, alerts arrive in a flood:

  • 3:17: recommendation_feature fails because it depends on user_behavior_etl.
  • 3:19: training_data_export has abnormal volume: expected 2 million rows, produced 300,000.
  • 3:22: model_training_pipeline pauses due to missing training data.
  • 3:25: storage usage spikes; intermediate tables grow to three times the usual size.
  • 3:28: the labeling platform errors because some tasks reference rolled-back data batches.

The on-call engineer wakes up and starts checking alerts one by one. The alerts are scattered across systems, with no aggregation and no context. It takes 40 minutes to locate the root cause: an upstream business database schema change added a required field, causing an ETL WHERE condition to filter out 85 percent of records. The schema-change email had arrived eight hours earlier and sat unread.

Core Engineering Pain Points

  1. No alert aggregation or prioritization. Twelve alerts have one root cause, but the engineer must inspect them one by one.
  2. Root cause analysis depends on personal experience. The engineer who knows the pipeline is on vacation.
  3. Rollback decisions are uncertain. No system suggests whether to roll back, which version to use, or which downstream systems will be affected.
  4. Cost anomalies are not tracked in time. Storage growth and reprocessing cost appear on the monthly bill after losses are real.

34.1 Alert-to-Root-Cause Agents

34.1.1 Alert Aggregation and Prioritization

The first capability of a DataOps Agent is intelligent alert aggregation. The agent reads four data categories and turns scattered alerts into root-cause candidates.

Figure 34-1 illustrates the corresponding workflow or structure.

Alert-to-root-cause agent flow

Figure 34-1: Alert-to-root-cause agent flow.

Data sources:

  1. Metrics: job latency, volume, error rate, resource usage.
  2. Logs: task execution logs, stack traces, data quality check logs.
  3. Lineage: upstream dependencies, field-level lineage, data flow graph.
  4. Change records: code commits, configuration changes, schema changes, dependency upgrades.

34.1.2 Root Cause Candidate Generation

The agent aligns timelines and uses causal signals to generate candidates.

Table 34-1 summarizes the corresponding comparison and engineering considerations.

Table 34-1: Root cause candidate types and detection logic.

Root cause type Pattern Detection logic Confidence
Upstream schema change Change precedes first alert and affects all downstream tasks Cross-check change records and alert timeline 0.90
Volume spike/drop Volume deviates > 3 sigma from historical mean Time-series anomaly detection 0.85
Insufficient resources CPU or memory reaches limit Correlate resource metrics with job performance 0.80
Code defect Alert coincides with recent deployment Cross-check deployment records and alert timeline 0.75
External dependency failure Multiple jobs fail while depending on the same API Dependency-graph topology analysis 0.70

The agent outputs three to five candidates ranked by confidence. Each candidate includes evidence: log snippets, metric charts, and change records. High-confidence candidates, such as above 0.85, may receive direct repair suggestions. Medium and low-confidence candidates require human confirmation.

34.1.2.1 Confidence Calibration for Root Cause Analysis

The confidence score itself must be calibrated. If the agent says "90 percent confidence," is it actually correct 90 percent of the time? Overconfident scoring will erode engineer trust.

Calibration methods:

  1. Historical backtesting. Review the previous month's root-cause candidates and compare them with confirmed causes.
  2. Calibration curve. Bucket confidence from 0 to 1 and compute actual accuracy in each bucket.
  3. Bias correction. If high-confidence buckets are less accurate than claimed, show corrected confidence in alerts.

Confidence-to-action mapping

Table 34-2 summarizes the corresponding comparison and engineering considerations.

Table 34-2: Confidence-to-action mapping.

Raw confidence Calibrated confidence Automation action
> 0.90 Usually consistent Push root cause and repair suggestion
0.75-0.90 Check for bias Push candidate and recommend confirmation
0.50-0.75 Usually lower than raw Push candidate and require human analysis
< 0.50 Usually lower than raw Push alert aggregation only; no root-cause guess

34.1.2.2 Cross-System Root Cause Correlation

Data platform incidents often span upstream ETL, storage, downstream consumers, and schedulers. Analysis cannot stay inside one log system.

Implementation strategies:

  • Unified timeline. Align all logs to one timezone and one minute-level event sequence.
  • Dependency graph overlay. Overlay alerts on the lineage graph to show propagation.
  • Change-event correlation. Overlay deployments, config changes, and schema changes on the timeline.

If an alert cluster starts within five minutes of a change event, the agent should mark that change as highly suspicious even before direct proof is available.

34.1.3 Alert Fatigue Governance

Alert fatigue is one of the most common anti-patterns in data platform operations. When alerts are too frequent, engineers start ignoring them and miss real incidents.

Aggregation strategies:

  1. Causal aggregation. Merge alerts from the same root cause into one group.
  2. Time-window aggregation. Merge repeated alerts of the same type within a window, such as five minutes, and attach count and trend.
  3. Dynamic priority adjustment. Adjust priority based on business impact and data volume. Low-priority 3 a.m. alerts can wait until working hours.

Table 34-3 summarizes the corresponding comparison and engineering considerations.

Table 34-3: Example alert aggregation effect.

Raw alerts Aggregated alerts Aggregation rate Daily alerts handled by engineers
200+ 15-20 ~90% From 50+ to 12
500+ 25-35 ~93% From 80+ to 18

34.2 Version Rollback and Repair Plan Agents

34.2.1 Automatic Rollback Plan Generation

After a data quality issue is confirmed, the agent generates rollback and repair plans. Rollback is not simply "return to the previous version." It must account for dependencies, data loss, and repair cost.

Rollback plan generation:

  1. Impact analysis. Use lineage to identify downstream tables and jobs.
  2. Rollback candidate generation. List recent snapshots, usually three to five, with data differences and rollback cost.
  3. Repair versus rollback comparison. Compare fixing current data with rolling back to a historical version.
  4. Rerun plan. List downstream jobs that must rerun and estimate duration.

34.2.2 Rollback Approval Workflow

Rollback is high risk and requires approval.

Figure 34-2 illustrates the corresponding workflow or structure.

Rollback approval workflow

Figure 34-2: Rollback approval workflow.

Table 34-4 summarizes the corresponding comparison and engineering considerations.

Table 34-4: Approval matrix for rollback and repair.

Operation Risk Approval Rollback plan
Field-level data repair Low Agent automatic plus post-hoc audit Keep original values
Table-level data repair Medium Owner single review Snapshot rollback
Single-table rollback Medium Owner single review Pre-rollback snapshot
Multi-table rollback High Multi-level approval Global snapshot plus downstream notice
Rule revocation Medium Owner plus rule author Rule version rollback
Index rebuild High Platform admin approval Keep old index until new one passes validation

34.2.3 Pipeline Self-Healing

Beyond rollback and repair, a DataOps Agent can attempt limited self-healing after detecting known failures.

Self-healing is bounded. The agent may only execute operations in a predefined safe operation set. Anything outside that set requires human approval.

Table 34-5 summarizes the corresponding comparison and engineering considerations.

Table 34-5: Self-healing permissions by data classification.

Self-healing operation L0 public data L1 internal data L2 sensitive data L3 confidential data
Job rerun Automatic Automatic Automatic, once Approval
Parameter adjustment, such as longer timeout Automatic Automatic Approval Approval
Switch data source replica Automatic Automatic Approval Not allowed
Skip noncritical step Automatic Approval Approval Not allowed
Field-level data repair Automatic Approval Approval Not allowed
Scale compute resources Automatic with quota Approval Approval Approval

Figure 34-3 illustrates the corresponding workflow or structure.

Pipeline self-healing decision flow

Figure 34-3: Pipeline self-healing decision flow.

The key principle is: repair known failure patterns; do not explore unknown problems autonomously. If a new failure pattern appears, the agent escalates it and records the pattern. After human-confirmed repair, that pattern may enter the future self-healing set.

34.2.4 Data Consistency During Rollback

The hardest part of rollback is not restoring a snapshot. It is preserving business consistency after rollback. If table A is rolled back but table B, which depends on A, is not, B may contain orphan data.

Rollback plans must include:

  1. Dependency analysis. Query all downstream tables and jobs.
  2. Consistency validation. Check references, foreign keys, and aggregate correctness.
  3. Cascade rollback suggestions. If downstream risk exists, suggest cascading rollback or downstream repair.
  4. Notification. Inform model training, analytics, labeling, and other consumers of affected time range and recommended action.

34.2.5 Decision Support Matrix for Data Rollback

Table 34-6 summarizes the corresponding comparison and engineering considerations.

Table 34-6: Rollback versus repair decision support.

Factor Rollback option Repair option Agent analysis
Time cost Snapshot restore plus downstream reruns Script development plus validation Estimate both using historical data
Data loss New data after snapshot may be lost Repair may introduce new errors Quantify data loss
Downstream impact Downstream jobs must rerun and delivery may delay Downstream jobs may not be affected Estimate impact through lineage
Risk Rollback may fail Incomplete repair may be found later Estimate from historical success rates

The agent does not decide for the team. It ensures decision-makers have enough information: if rollback takes three hours, loses two hours of new data, and affects three downstream teams, while repair takes eight hours with medium integrity risk, the tradeoff is explicit.

34.3 Cost Alerting and Capacity Optimization Agents

34.3.1 Automatic Cost Anomaly Detection

Data platform cost is often ignored because billing is monthly. By the time the bill shows an anomaly, the loss has happened. Cost alerting agents provide real-time monitoring.

Storage growth detection. Monitor table growth rates and alert when growth exceeds historical baseline. Common causes include uncleared intermediate tables, unbounded logs, and overly aggressive snapshots.

Reprocessing cost analysis. When pipelines need reruns, estimate compute cost and compare it with repair cost.

Labeling backlog cost. Annotation backlog causes schedule delay, overtime, idle work, and wasted training compute. The agent monitors backlog level and trend.

GPU feeding anomalies. Training jobs consume data at stable rates. If data supply stops or quality fails and training idles, GPU spend is wasted. The agent monitors data starvation in training jobs.

34.3.2 Automatic Optimization Suggestions

Table 34-7 summarizes the corresponding comparison and engineering considerations.

Table 34-7: Cost anomaly detection and optimization suggestions.

Cost anomaly Detection Suggestion Estimated saving
Intermediate table growth Table size growth > daily mean by 5% Set TTL and archive old partitions 20-40% storage
Duplicate computation Same input triggers repeated reruns Add intermediate-result cache 15-30% compute
Excessive snapshots Snapshot frequency exceeds business need Reduce frequency and use tiered retention 30-50% snapshot storage
Inefficient queries Full scans > 20% of queries Add index or partition pruning 10-25% query cost
Idle resources CPU/GPU utilization < 30% Shrink cluster or use Spot instances 20-50% compute

34.3.3 Capacity Planning and Forecasting

DataOps Agents should warn before resource exhaustion.

Storage forecast. Use historical growth and upcoming business plans, such as new sources or model training, to predict one-, three-, and six-month storage needs. If storage is forecast to reach 80 percent capacity within 30 days, generate an expansion suggestion.

Compute forecast. Analyze training workload patterns: periodic peaks, event-driven spikes after releases, and trends caused by data growth. The agent can recommend reserved instances, identify batch jobs that can run off-peak, and expand elastic resources when Spot prices are low.

Labeling capacity forecast. Based on downstream training plans and historical annotation speed, forecast whether labeling will become a bottleneck. If so, notify project managers to adjust priority, add annotators, or change quality standards.

34.3.4 Cost Attribution and Team Accountability

Optimization requires knowing who spends what. Agents allocate platform cost by team, project, task type, and data source.

Table 34-8 summarizes the corresponding comparison and engineering considerations.

Table 34-8: Cost attribution dimensions.

Attribution dimension Cost types Attribution method
Team Storage, compute, labeling labor Resource owner tags
Project GPU training, inference Project labels on training jobs
Data source Collection, storage, cleaning Source tags and lineage
Experiment Training, evaluation Experiment tracking ID

Each month, the agent generates team cost reports showing storage, compute, labeling, and collection trends. If a team's cost exceeds budget by more than 20 percent, the agent explains whether the driver is data growth or inefficient use.

34.4 Ticket and Postmortem Automation Agents

34.4.1 Automatic Ticket Generation

When human intervention is needed, the DataOps Agent creates a structured ticket with:

  1. Issue summary. One sentence explaining what happened.
  2. Impact scope. Affected tables, jobs, and downstream teams.
  3. Root cause candidates. Ranked by confidence.
  4. Suggested actions. Repair plan and estimated duration.
  5. Urgency. P0-P3 based on impact and business priority.
  6. Related links. Monitoring dashboards, log queries, and lineage analysis.

34.4.2 Automatic Postmortem Drafts

Postmortems are central to the DataOps flywheel, but engineers often skip them after intense repair work. Agents can draft postmortems from operation logs and lineage.

Postmortem draft template

Table 34-9 summarizes the corresponding comparison and engineering considerations.

Table 34-9: Postmortem draft template.

Element Auto-fill source
Event timeline Alert times, operation logs, approval records
Root cause analysis Root-cause agent output
Impact assessment Lineage analysis and downstream scope
Response process Human Gate timestamps and agent execution logs
Repair plan Executed rollback or repair operations and results
Prevention Rule suggestions based on failure pattern
Action items Generated TODO list with owners
Acceptance criteria Metrics and thresholds to verify next week

34.4.3 Linking Experiment Tracking and Agents

DataOps Agents should connect to experiment tracking. When a model experiment shows data-related regression, the agent links it to data changes.

Example: math_reasoning accuracy drops from 78 percent to 71 percent. The agent:

  1. Queries all training-data changes for that dimension during the experiment window.
  2. Finds a cleaning rule released last week that affected 3 percent of math reasoning samples.
  3. Correlates the rule change with performance regression and writes a report.
  4. Recommends rolling back the rule and retraining for validation.

This closes the loop from data change to model performance and can reduce diagnosis from days to hours.

34.4.4 Automatic Operations Knowledge Base

Long-term value comes from accumulating reusable knowledge. After each incident, the agent writes these items into the operations knowledge base:

  • Failure fingerprint: metric combinations and log keywords.
  • Handling steps: executed steps and whether each was effective.
  • Timeline: complete sequence from incident to recovery.
  • Related information: alerts, change records, and communication records.

For future incidents, the agent retrieves similar cases: "this pattern is 87 percent similar to incident-2024-0315; the previous repair was..." Experience reuse is how DataOps Agents move from smart to wise.

34.5 Case Review: DataOps Agent in a Real Platform

After six months of deployment on a large data platform:

Alert response. The agent reduces more than 200 daily alerts into 15-20 root-cause candidates. MTTR drops from 45 minutes to 12 minutes. High-confidence root-cause candidates above 0.85 reach 91 percent accuracy.

Cost control. The agent detects seven storage-growth issues and helps clean 30 TB of unused intermediate data. Reprocessing-cost analysis helps teams choose repair instead of rerun in five incidents, saving about $8,000 in compute.

Postmortem culture. Automatic drafts reduce postmortem preparation from two hours to 15 minutes. The team moves from occasional postmortems to one postmortem for every incident.

Key Lessons

  1. Alert aggregation must distinguish causality from correlation. Simultaneous alerts are not always causally linked.
  2. Rollback approval cannot wait too long. One incident spreads because a reviewer is on vacation; timeout escalation is added afterward.
  3. Cost suggestions need a soft landing. A table that looks unused may be someone's manual backup; optimization requires confirmation.

Extended Case: From Reactive Operations to Active Autonomy

The rollout is gradual:

Stage 1, months 0-2: shadow mode. The agent analyzes alerts and root causes in the background but performs no actions. Initial root-cause accuracy is 62 percent and rises to 85 percent after rule and feedback tuning.

Stage 2, months 3-4: recommendation mode. The agent sends aggregated alerts and root-cause analysis to on-call engineers. Engineers can view alert groups, candidates, and evidence in one click. MTTR drops from 45 minutes to 18 minutes.

Stage 3, months 5-6: semi-automatic mode. For high-confidence candidates above 0.85, the agent generates repair suggestions that engineers confirm before execution. Trust grows through retrospective comparisons between agent suggestions and actual repair results.

Stage 4, after month 7: conditional autonomy. The agent receives bounded automatic permissions for low-risk operations on L0 and L1 data, such as reruns and parameter adjustments, followed by audit.

Success Factors

  1. Shadow mode builds trust cheaply.
  2. Accuracy is not the only trust metric; explainability and predictability matter.
  3. Gradual authorization is safer than one-time authorization.
  4. Rollback capability is the foundation of authority.

Continuous MTTR Improvement

Table 34-10 summarizes the corresponding comparison and engineering considerations.

Table 34-10: Continuous MTTR Improvement.

Stage MTTR Agent automation Engineer satisfaction
Manual baseline 45 min 0% 3.2/5, alert fatigue
Shadow mode 40 min 0%, analysis only 3.5/5
Recommendation mode 18 min 0%, suggestions 4.1/5
Semi-automatic mode 12 min 30%, low-risk automatic 4.3/5
Conditional autonomy 8 min 60% 4.5/5

34.6 Checklist: DataOps Agent Deployment

  • [ ] Does the alert aggregation agent read metrics, logs, lineage, and change records?
  • [ ] Are root-cause candidates ranked by confidence with supporting evidence?
  • [ ] Do rollback plans include impact analysis, candidate comparison, and rerun plans?
  • [ ] Is rollback approval tiered by impact scope?
  • [ ] Does cost monitoring cover storage growth, reprocessing, labeling backlog, and GPU feeding?
  • [ ] Are cost optimization suggestions human-confirmed before execution?
  • [ ] Do tickets include summary, impact scope, root-cause candidates, suggested actions, and urgency?
  • [ ] Do postmortem drafts include timeline, root cause, impact, and action items?
  • [ ] Are MTTR baseline and improvement targets defined?
  • [ ] Does approval timeout escalate rather than auto-approve?
  • Chapter 24: DataOps flywheel and team organization are extended with agent-driven operational autonomy.
  • Chapter 25: rollback depends on version management and experiment tracking.
  • Chapter 26: alert agents depend on platform observability.
  • Chapter 31: this chapter follows the six-layer architecture, especially Human Gate and Lineage.
  • Chapter 32: collection and cleaning pipelines are part of what this chapter monitors.
  • Chapter 35: rollback approvals connect to the permission model and human-AI collaboration.

34.8 Further Reading: Deepening DataOps Agents

From Reactive to Predictive Operations

Most DataOps Agents today are reactive: detect, locate, and repair after failure. The next stage is predictive operations: predict and prevent before failure.

Technical path:

  1. Anomaly pattern learning. Analyze metrics during the 24 hours before historical incidents and learn precursor patterns.
  2. Risk scoring. Score every pipeline run using recent changes, volume fluctuation, dependency health, and failure history.
  3. Preventive intervention. When risk exceeds threshold, scale resources, adjust scheduling, or notify teams.

Reactive versus predictive operations

Table 34-11 summarizes the corresponding comparison and engineering considerations.

Table 34-11: Reactive versus predictive operations.

Dimension Reactive operations Predictive operations
Trigger After failure Before failure
Detection Alert thresholds Pattern recognition and risk scoring
Intervention Emergency repair Preventive adjustment
Business impact Interruption already happened Interruption avoided
Maturity Available Requires historical data
Current accuracy ~85% root-cause localization ~60% failure prediction, needs tuning

Rethinking Permission Boundaries for Platform Autonomy

As autonomy increases, the question becomes: where is the final boundary?

Humans should retain final decision rights for:

  1. Architectural changes. Platform topology, core storage migration, and infrastructure upgrades.
  2. Cost and budget tradeoffs. Whether to spend more money for faster processing is a business decision.
  3. Cross-team priority coordination. Agents should not decide whose workloads run first during resource contention.
  4. Legal and compliance decisions. Agents can analyze gray areas, but legal teams decide.

The boundary can expand as reliability and trust grow, but every expansion requires evidence and rollback capability.

From DevOps to DataOps to AgentOps

The operations paradigm is evolving:

  • DevOps automates software delivery through CI/CD and infrastructure as code.
  • DataOps automates data delivery through pipelines, quality monitoring, and version management.
  • AgentOps automates agent operations: deployment, monitoring, evaluation, update, and retirement.

AgentOps does not replace DataOps. It adds a layer above it: agents manage data pipelines, and AgentOps manages agents. Operations dashboards must therefore combine data quality metrics with agent behavior metrics.

Chapter Summary

Starting from a 3 A.M. alert storm, this chapter discussed how DataOps Agents can move platform operations from passive response toward bounded autonomy. For alert-to-root-cause workflows, it presented alert aggregation and prioritization, root-cause candidate generation, and alert-fatigue governance to reduce mean time to repair. For recovery, it discussed automatic rollback-plan generation and approval workflows, pipeline self-healing, data consistency during version rollback, and decision-support matrices that keep human judgment in the loop for high-risk data rollback.

The methods in this chapter should be applied by jointly considering data sources, business goals, model capability, cost budget, and compliance requirements. For scenarios involving sensitive information, cross-system calls, automated decisions, or public release, human review, version freeze, permission control, and exception rollback should be retained. Example flows should not be generalized directly into production commitments; the governance, mapping, measurement, and management requirements of AI risk-management frameworks also apply to autonomy boundaries for DataOps Agents.

In the book's structure, this chapter sits in the agent-automation layer. It connects earlier platform foundations with later privacy, compliance, and specialized dataset cases. Readers can combine the chapter's frameworks, figures, references, and appendix checklists to turn the methods into reproducible, inspectable, and deliverable engineering workflows.

References

Amershi S, Begel A, Bird C, Devanbu P, Gall H, Kamar E, Nagappan N, Nushi B, Zimmermann T (2019) Software Engineering for Machine Learning: A Case Study. In: Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, pp 291-300. https://doi.org/10.1109/icse-seip.2019.00042.

Breck E, Polyzotis N, Roy S, Whang S E, Zinkevich M (2019) Data Validation for Machine Learning. In: Proceedings of Machine Learning and Systems 1, pp 334-347.

Dang Y, Lin Q, Huang P (2019) AIOps: Real-World Challenges and Research Innovations. In: Proceedings of the 41st International Conference on Software Engineering: Companion Proceedings, pp 4-5. https://doi.org/10.1109/icse-companion.2019.00023.

He S, He P, Chen Z, Yang T, Su Y, Lyu M R (2021) A Survey on Automated Log Analysis for Reliability Engineering. ACM Computing Surveys 54(6):1-37.

Huyen C (2022) Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications. O'Reilly Media.

Kreuzberger D, Kuhl N, Hirschl S (2023) Machine Learning Operations (MLOps): Overview, Definition, and Architecture. IEEE Access 11:31866-31879. arXiv:2205.02302.

Makinen S, Skogstrom H, Laaksonen E, Mikkonen T (2021) Who Needs MLOps: What Data Scientists Seek to Accomplish and How Can MLOps Help? In: Proceedings of the 2021 IEEE/ACM 1st Workshop on AI Engineering - Software Engineering for AI, pp 109-112.

Lwakatare L E, Raj A, Crnkovic I, Bosch J, Olsson H H (2020) Large-scale Machine Learning Systems in Real-world Industrial Settings: A Review of Challenges and Solutions. Information and Software Technology 127:106368. https://doi.org/10.1016/j.infsof.2020.106368.

NIST (2023) Artificial Intelligence Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology.

NIST (2024) Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. NIST AI 600-1.

Paleyes A, Urma R-G, Lawrence N D (2022) Challenges in Deploying Machine Learning: A Survey of Case Studies. ACM Computing Surveys 55(6):1-29. https://doi.org/10.1145/3533378.

Sambasivan N, Kapania S, Highfill H, Akrong D, Paritosh P, Aroyo L M (2021) "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp 1-15. https://doi.org/10.1145/3411764.3445518.

Tamburri D A (2020) Sustainable MLOps: Trends and Challenges. In: Proceedings of the 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp 17-23. https://doi.org/10.1109/synasc51798.2020.00015.

Testi M, Ballabio M, Frontoni E, Iannello G, Moccia S, Soda P, Vessio G (2022) MLOps: A Taxonomy and a Methodology. IEEE Access 10:63606-63618. https://doi.org/10.1109/access.2022.3181730.

Treveil M, Omont N, Stenac C, Lefevre K, Phan D, Zentici J, Lavoillotte A, Miyazaki M, Heidmann L (2020) Introducing MLOps: How to Scale Machine Learning in the Enterprise. O'Reilly Media.

Vela D, Sharp A, Zhang R, Nguyen T, Hoang A, Pianykh O S (2022) Temporal quality degradation in AI models. Scientific Reports 12:11654. https://doi.org/10.1038/s41598-022-15245-z.

Zhu J, He S, Liu J, He P, Xie Q, Zheng Z, Lyu M R (2019) Tools and Benchmarks for Automated Log Parsing. In: Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, pp 121-130.