⚙️ Validation — Activities¶

Purpose

Overview of the core activities and role assignments during the Validation phase, including the Validation Pilot (PoV) and Business Case preparation.

When to use this?

You have passed Gate 1 and are ready to run the Validation Pilot. This page guides you through assembling the test set, running the experiment, testing reliability, and building the Cost Overview.

🎯 Objective¶

Execute the Validation Pilot with a representative Golden Set, measure AI performance against the baseline, conduct reliability testing, and produce a Cost Overview that enables an informed Gate 2 decision.

✅ Entry Criteria (Definition of Ready)¶

Gate 1 (Go/No-Go Discovery) is approved.
The Golden Set is assembled with the minimum number of cases for the risk level (20 for Minimal, 50 for Limited, 150 for High).
The team has access to the models, tools, and data needed for experimentation.

⚙️ Core Activities¶

1. Validation Pilot (Proof of Value)¶

A small-scale experiment to test whether the AI understands the specific business context. The pilot is deliberately limited — it tests the core hypothesis, not the full solution.

Step 1 — Assemble the Test Set:

Extract 50–100 representative real-world examples from your data sources (adjust based on risk level requirements).
Ensure the test set covers: standard cases (80%), complex cases (15%), and adversarial cases (5%) for Limited Risk projects.
For each test case, record the expected outcome or assessment criteria. This is the Ground Truth.
Have a domain expert review and approve the test set before running the pilot.

Step 2 — Baseline Measurement:

Have human operators or the existing system process the same test set.
Record the baseline performance: accuracy, speed, error rate, and any other relevant metrics.
Document the baseline as the benchmark the AI must exceed.

Step 3 — AI Experiment:

Configure the AI with the current Steering Instructions and Knowledge Coupling.
Have the AI process the entire test set.
Record the AI's output for each test case.
Score the AI's output against the Ground Truth using the criteria defined in the Evidence Standards.

Step 4 — Compare and Conclude:

Compare AI performance against the baseline and the success criteria.
For Minimal Risk: factual accuracy ≥ 98%, 0 critical errors, ≤ 2 major errors.
For Limited Risk: factual accuracy ≥ 99%, 0 critical errors, ≤ 1 major error.
For High Risk: factual accuracy ≥ 99.5%, 0 critical errors, Guardian decides on major errors.
Document the conclusion: does the AI meet the threshold?

Do not adjust the test set to improve scores

If the AI does not meet the threshold, the result is valid. Do not remove difficult cases or add easier ones to inflate the score. Instead, investigate why the AI underperforms and decide whether improvement is feasible within the project constraints.

Practical Example

Situation: A mid-sized bank wanted to automate the initial classification of SME loan applications into three categories: "approve for fast-track," "requires manual review," and "decline with standard letter." The existing process took loan officers an average of 45 minutes per application. Approach: The team assembled a Golden Set of 75 applications (Limited Risk threshold: ≥ 50), covering 60 standard cases, 11 complex cases with unusual collateral structures, and 4 adversarial cases with deliberately incomplete financial statements. The baseline measurement showed loan officers achieved 94% accuracy with an average processing time of 45 minutes. The AI pilot (using a RAG setup with the bank's credit policy documents) scored 97.3% factual accuracy on the first run — exceeding the Limited Risk threshold of ≥ 99% was not met, but the team noted that 2 of the 3 errors were on adversarial cases where even human reviewers disagreed. Reliability testing across 5 runs showed a variation of only 0.8 percentage points, confirming stability. The bias check revealed no statistically significant difference in error rates across industry sectors. Result: The Validation Report (template) documented that the AI met the reliability threshold but fell short of the 99% factual accuracy target. The Cost Overview showed a projected ROI of 220% over 18 months based on time savings alone. The Business Sponsor approved Gate 2 with the condition that the team investigate the two adversarial-case errors during Development — either by strengthening the Steering Instructions or adding specific policy guidance to the Knowledge Coupling. The Technical Model Card (draft) captured the model configuration, prompt version, and Golden Set scores as a baseline for regression testing.

2. Reliability Testing¶

Statistical check whether the results are stable and not based on chance or a favourable test set composition.

Reproducibility: Run the AI multiple times (minimum 3 runs) on the same test set. Measure the variation in scores. If the variation exceeds 2 percentage points on factual accuracy, the system is not stable enough for production.
Edge Cases: Test the system with unusual, ambiguous, or extreme input. Document how the system responds. Does it refuse appropriately? Does it produce harmful output? Edge cases reveal the boundaries of the AI's capability.
Bias Detection: Analyse whether there are systematic errors in certain categories. For Limited Risk: if relevant groups can be distinguished, the difference in Major error rate between groups must be ≤ 10%. For High Risk: ≤ 5%, plus a described mitigation plan.

3. Cost Overview¶

A complete estimate of investment and operational costs. This enables the Business Sponsor and Finance to make an informed Go/No-Go decision.

Investment Costs¶

People: Development, training, management (FTEs). Estimate the effort for the Development phase based on the Validation Pilot findings.
Technology: Licences, cloud infrastructure, tools. Include one-time setup costs and recurring subscription fees.
Data: Cleaning, labelling, enrichment. If the Data Evaluation identified quality issues, include the cost of remediation.

Operational Costs (per month/year)¶

Usage Costs: Cloud/API costs per task or transaction. Estimate based on expected volume and current pricing.
Maintenance: Monitoring, updates, support. Include MLOps effort, model retraining, and incident response.
Risk Costs: Potential costs of errors or incidents. Estimate based on the risk level and the consequences of a Critical error.

Return on Investment (ROI)¶

Time Savings: How many hours do we save per week/month? Multiply by the hourly cost to get the financial value.
Quality Improvement: Fewer errors, higher customer satisfaction. Quantify where possible (e.g., reduced rework, fewer complaints).
Revenue Growth: New opportunities, faster turnaround. Estimate the incremental revenue attributable to the AI system.

Include ecological footprint in the Cost Overview

Sustainability is a cross-cutting concern. Estimate inference and training costs in CO₂ equivalents. Refer to Green AI guidelines for benchmarks.

👥 RACI¶

Role	Responsibility in Validation
Data Scientist	Responsible: Performing the Validation Pilot and reliability testing.
AI Product Manager	Accountable: Owner of the business case and ROI calculation (Cost Overview).
Business Sponsor	Consulted: Validates the test set and success criteria.
Finance	Consulted: Reviews the cost estimate and ROI calculation.
Stakeholders	Informed: Receive updates on progress.

✅ Exit Criteria (Gate 2 — PoV Investment)¶

The Validation phase activities are complete when:

Validation Pilot is completed with results documented.
Reliability testing confirms stable results (variation ≤ 2 percentage points).
Cost Overview is completed with ROI calculation.
Validation Report is drafted with evidence against the applicable standards.
Guardian has reviewed fairness check results and approved Hard Boundaries.

Collaboration Mode: [Mode X — Name]. Validate that the mode still matches the risk level at Gate 2. Required validation for this mode: → See Evidence Standards.

📦 Deliverables¶

Validation Report — pilot results, reliability testing, and conclusion.
Cost Overview — investment, operational costs, and ROI.
Technical Model Card (draft) — model, prompts, and configuration documentation.
Golden Set Test Results — detailed scores per test case.

Next step: Run the Validation Pilot and document the results in the Validation Report. → Use the Validation Report as your starting point. → See also: Objectives | Business Case | Gate 2 Checklist

Version: 1.1 Date: 07 May 2026 Status: Final

Was this page helpful? Give feedback