⚙️ Monitoring & Optimisation — Activities¶

Purpose

Overview of core activities and role assignments during the Monitoring & Optimisation phase, from operational monitoring to drift detection and cost control.

When to use this?

Your AI system is live and you need to maintain its performance, control costs, ensure compliance, and plan for continuous improvement. This page guides you through the operational activities of the Monitoring phase.

🎯 Objective¶

Execute the Monitoring & Optimisation phase activities to maintain system performance, detect and respond to drift, control costs, ensure ongoing compliance, and manage the system through its operational lifespan — including decommissioning when appropriate.

✅ Entry Criteria (Definition of Ready)¶

System is live with Gate 4 approved.
Monitoring dashboards and alerts are active.
Baseline metrics are recorded from the Go-live configuration.

⚙️ Core Activities¶

1. Operational Monitoring & MLOps¶

We monitor the "heartbeat" of the system. This is the continuous observation that detects degradation before it becomes an incident.

Real-time Performance Tracking:

Set Up Dashboards: Configure dashboards for critical metrics:
- Latency: Response time per request. Set alert threshold at p95 > [X] ms.
- Error Rate: Failures per transaction. Set alert threshold at > [X]% over [Y] minutes.
- Uptime: Availability percentage. Target: 99.9% or higher.
- Throughput: Transactions per second. Monitor for capacity planning.
Configure Alerts: Set up alerts for threshold violations. Route alerts to the right people:
- Critical alerts (system down, Hard Boundary violation): Page the on-call engineer immediately.
- Warning alerts (performance degradation, elevated error rate): Notify the team via Slack/email.
- Info alerts (usage spikes, cost anomalies): Log for review in the next cost meeting.
Establish Review Cadence:
- Daily: Check dashboards for anomalies.
- Weekly: Review alert history and resolve any open incidents.
- Monthly: Produce a performance summary for the AI Product Manager and CAIO.

Performance Degradation Monitoring (Drift Detection):

Data Drift: Statistically monitor whether production input data deviates from training data. Use metrics like Population Stability Index (PSI) or Kolmogorov-Smirnov test. Alert when drift exceeds the defined threshold.
Concept Drift: Monitor whether the relationship between data and outcomes changes. This is harder to detect — it requires comparing model predictions with actual outcomes over time. A decline in accuracy or F1 score is a signal.
Define Significant Degradation: Performance degradation is significant if any of the following occurs relative to the baseline:
- Factual accuracy drops ≥ 2 percentage points.
- Relevance score drops ≥ 0.3 on a 1–5 scale.
- Number of Major errors increases ≥ 50% over two consecutive measurement periods.

Data Loop Integration:

Feed Production Data Back: Set up a pipeline that feeds production data and outcomes back into the development environment. This is the feedback loop that enables continuous improvement.
Label Production Data: Where possible, have humans label production data to create new training examples. This is especially valuable for edge cases that were not in the original Golden Set.
Store for Analysis: Archive production data (with appropriate privacy protections) for drift analysis, retraining, and audit purposes.

Do not make automatic corrections on significant deviations

When the system shows significant degradation, investigate the cause first. Determine what adjustment is needed and how it can be implemented in a controlled manner — including verification and documentation. Automatic "fixes" can introduce new problems.

Practical Example

Situation: An insurance company had been running an AI-assisted claims triage system for 8 months. The system classified incoming motor insurance claims into "straight-through settlement," "investigation required," and "potential fraud referral." Initially, the system achieved 96.2% accuracy against the Golden Set. Approach: The MLOps Engineer configured dashboards tracking latency (p95 \< 800ms), error rate (\< 0.5%), and throughput (target: 200 claims/hour). Drift detection was set up using Population Stability Index (PSI) on input features: claim description length, vehicle age, damage type codes, and claimant history. After 6 months, the PSI for "damage type codes" crossed the 0.25 threshold — signalling significant data drift. Investigation revealed that a new regulation had changed how garages reported damage categories, introducing codes the model had never seen. The concept drift monitor also showed a 3.1 percentage point drop in factual accuracy over two consecutive months, exceeding the "significant degradation" threshold of ≥ 2pp. The team triggered a retraining event: they labelled 2,000 recent claims with the new damage codes, added them to the training set, and retrained the model. The new model scored 97.1% on the updated Golden Set. The Retraining Log recorded the date, trigger (drift alert), dataset size, model version, and Guardian approval. Meanwhile, the monthly cost review identified that 34% of inference calls were for duplicate queries (claimants calling multiple times about the same claim). Implementing a caching layer reduced API costs by 22% without affecting accuracy. Result: The system recovered to above-baseline accuracy within 3 weeks of the drift alert. The Transparency Report (quarterly) documented the drift event, the retraining, and the cost optimisation — providing an auditable record for the EU AI Act post-market surveillance requirement. The fairness audit, conducted by the Guardian on a sample of 500 production outputs, found no significant difference in error rates across claimant age groups (difference: 2.1%, well within the 10% threshold for Limited Risk). The decommissioning triggers were reviewed and updated: the economic trigger was adjusted from "cost per claim > €4.50" to "cost per claim > €5.20" to reflect the new baseline after caching optimisation.

2. Continuous Improvement & Retraining¶

Standing still means falling behind. We use production insights to improve the system through structured experiment loops.

Retraining Strategy:

Define Retraining Triggers:
- Periodic: Retrain on a fixed schedule (e.g., quarterly). Suitable for stable domains with slow change.
- On Drift Alert: Retrain when drift detection signals significant degradation. Suitable for dynamic domains.
- On New Data Threshold: Retrain when a sufficient volume of new labelled data is available. Suitable for high-volume systems.
Plan the Retraining Process:
- Assemble the new training dataset (existing data + new labelled production data).
- Run the training pipeline and evaluate against the Golden Set.
- Compare the new model's performance with the current production model.
- If the new model is better, proceed through the controlled change process (specification, validation, approval).
Document Retraining Events: Record every retraining event in the Retraining Log: date, trigger, dataset, model version, evaluation results, and approval status.

Experiment Loops:

Formulate Hypotheses: Based on production insights, user feedback, or drift analysis, formulate hypotheses for improvement. Example: "Adding the new policy documents to the knowledge base will improve accuracy on compliance questions by 5%."
Run Experiments: Test hypotheses using A/B tests or Canary releases. A/B tests compare two versions with different user groups. Canary releases roll out the new version to a small percentage of traffic and monitor for issues.
Evaluate Results: Compare experiment results with the hypothesis. Did the change improve performance? Did it introduce new issues? Document the findings.
Decide: Promote the change to production, iterate further, or abandon the hypothesis.

Backlog Management:

Collect Inputs: Gather bugs, improvements, and feature requests from user feedback, monitoring alerts, and stakeholder input.
Prioritise: Rank items based on impact (business value, risk reduction) and effort (development time, complexity).
Plan Sprints: Select items for the next development cycle. Ensure the sprint capacity matches the team's velocity.
Track Progress: Update the backlog as items are completed, deferred, or cancelled.

3. Cost Control & Energy Efficiency¶

Sustainability in euros and CO₂. We optimise the economic and ecological footprint of the AI system.

Cloud & API Optimisation:

Monthly Cost Review: Review compute (GPU/CPU) and token costs. Break down by model, endpoint, and user group. Identify cost drivers and anomalies.
Optimisation Techniques:
- Model Compression (Quantisation): Reduce model size and usage cost with minimal accuracy loss.
- Caching: Cache frequent queries to avoid redundant inference.
- Batching: Process multiple requests in a single inference call where latency permits.
- Model Selection: Use smaller, cheaper models for tasks that do not require the full capability of the largest model.
Set Budget Alerts: Configure alerts when costs approach or exceed the budget. This prevents surprise bills.

Sustainability Measurement (ESG):

Monitor Energy Consumption: Track the usage costs — the energy consumed per transaction. Use cloud provider tools or third-party calculators.
Report for ESG Goals: Include AI energy consumption in the organisation's ESG reporting. Compare with Green AI benchmarks.
Optimise for Sustainability: Apply the same optimisation techniques (quantisation, caching, batching) to reduce energy consumption alongside cost.

Resource Allocation:

Set Up Autoscaling: Configure infrastructure to scale up during peak demand and scale down during low demand. This avoids over-provisioning (waste) and under-provisioning (degraded performance).
Monitor Resource Utilisation: Track CPU, GPU, and memory utilisation. Identify underutilised resources that can be downsized.
Plan Capacity: Use usage trends to forecast future capacity needs. Plan procurement or decommissioning accordingly.

4. Ethical Oversight & Compliance Monitoring¶

Ongoing legal conformity. The EU AI Act requires post-market surveillance for high-risk systems — but the principle applies to all risk levels.

Post-Market Surveillance:

Continuous Scanning: Monitor production output for unforeseen bias, discrimination, or safety risks. Use automated tools (content filters, toxicity detectors) and human review (Guardian sampling).
Incident Tracking: Log every incident — Hard Boundary violations, user complaints, fairness concerns. Track resolution status and root cause.
Guardian Review: The Guardian performs periodic reviews (monthly for High Risk, quarterly for Limited Risk) of the surveillance findings and recommends actions.

Audit-ready Logging:

Retain Logs: Keep logs of decisions and human interventions for the retention period defined by the risk level:
- Minimal/Limited: standard 90 days, unless otherwise required.
- High Risk: standard 12 months (or longer if legally required).
Ensure Queryability: Logs must be queryable by the AI PM, compliance team, and auditors. Include: date/time, user/role, use case, model version, prompt version, sources used, output, and any human override.
Protect Privacy: Pseudonymise personal data in logs where required by GDPR.

Transparency Reports:

Produce Reports: Generate periodic reports (quarterly) covering:
- Safety: incidents, Hard Boundary violations, and resolutions.
- Performance: accuracy, drift, and reliability metrics.
- Compliance: audit findings, Guardian reviews, and regulatory updates.
Distribute to Stakeholders: Share reports with the CAIO, Business Sponsor, Guardian, and relevant stakeholders.
Act on Findings: Use the report to identify improvement areas and prioritise the backlog.

Fairness Audit (Bias Audit):

Sample Outputs: The Guardian selects a representative sample of production outputs for review.
Assess Tone and Quality: Evaluate whether outputs are fair, unbiased, and appropriate for the context.
Quantify Where Possible: If relevant groups can be distinguished, measure the difference in Major error rate between groups. For Limited Risk: ≤ 10%. For High Risk: ≤ 5%.
Document Findings: Record the audit results and any recommended mitigations.

5. Decommissioning¶

An AI system has a finite lifespan. Define in advance when shutdown is justified and execute a controlled wind-down.

Decommissioning Triggers:

Category	Trigger	Action
Technical	Drift exceeds threshold and retraining does not improve performance	System offline, root cause analysis
Economic	Cost per Productive Outcome rises > 50% above baseline after 2 quarters	CAIO review: stop or re-architect
Ethical/Legal	Critical fairness audit finding or new legislation renders system non-compliant	Immediate stop, Guardian review mandatory
Strategic	Use case disappears due to organisational change or better alternative available	Controlled wind-down per handover plan

Decommissioning Process:

Announcement: Inform users and stakeholders in advance (minimum 4 weeks). Explain the reason for decommissioning and the timeline.
Archiving: Retain the technical dossier, validation reports, and Kaizen Log per the retention policy. These may be needed for future audits or reference.
Knowledge Transfer: Document lessons learned in the Lessons Learned register. Share insights with other teams building AI systems.
Data Deletion: Delete or anonymise production data in accordance with GDPR. Document the deletion process and obtain Guardian confirmation.
Infrastructure Shutdown: Shut down compute instances, API keys, and monitoring pipelines. Verify that no residual costs are incurred.
Guardian Sign-off: The Guardian confirms that all Hard Boundaries obligations have been fulfilled, including data deletion and archival requirements.

👥 RACI¶

Role	Responsibility in Monitoring & Optimisation
MLOps Engineer	Responsible: Owner of monitoring pipelines, infrastructure, and stability.
AI Product Manager	Accountable: Guards Business KPIs, manages backlog, and owns the continuous improvement cycle.
Chief AI Officer (CAIO)	Consulted: Evaluates long-term ROI and strategic impact.
Data Scientist	Responsible: Analyses drift, performs retraining, and improves models.
Guardian (Ethicist)	Consulted: Performs ethical reviews, post-market surveillance, and fairness audits.

✅ Exit Criteria (Ongoing Phase)¶

The Monitoring phase activities are considered "healthy" when:

Dashboards show all metrics within acceptable thresholds.
Drift detection is active with defined alert thresholds.
Retraining schedule is defined and the last retraining event is documented.
Monthly cost review is completed.
Fairness audit is completed per the schedule for the risk level.
Decommissioning triggers are defined and documented.

Collaboration Mode: [Mode X — Name]. Drift thresholds and human checkpoint depend on mode. Required validation for this mode: → See Evidence Standards.

📦 Deliverables¶

Monitoring Dashboard — real-time metrics for technical, behavioural, and economic performance.
Drift Detection Report — periodic analysis of data drift and concept drift.
Retraining Log — record of retraining events, triggers, and results.
Cost Overview (Monthly) — updated cost analysis with optimisation recommendations.
Transparency Report (Quarterly) — safety, performance, and compliance summary.
Decommissioning Plan — triggers, process, and responsibilities.

Next step: Set drift thresholds and schedule the first quarterly review. → Use the Gate 4 Checklist as your starting point. → See also: Objectives | Drift Detection | Continuous Improvement

Version: 1.1 Date: 07 May 2026 Status: Final

Was this page helpful? Give feedback