Cheatsheet — Evidence Standards¶
Source: Evidence Standards
Evidence Levels¶
| Level | Description | Example |
|---|---|---|
| L1 — Claim | Assertion without substantiation | "The model is accurate" |
| L2 — Indication | Single measurement or anecdote | One test result |
| L3 — Evidence | Repeatable measurement on representative set | Golden Set score on 200 items |
| L4 — Strong Evidence | Multiple methods, independently validated | Golden Set + human review + A/B test |
Minimum requirement for Gate 2: level L3 or higher.
Required Evidence per Artefact¶
| Artefact | Minimum level | Method |
|---|---|---|
| Output quality | L3 | Golden Set + automated metric |
| Fairness | L3 | Segmented analysis per group |
| Safety (High Risk) | L4 | Red Teaming + independent review |
| Latency | L3 | Load test (p95, p99) (p95 = 95th percentile — 95% of all requests are faster than this value) |
| Cost projection | L2 | Calculator + documented assumptions |
| Traceability | L3 | Audit trail demonstrated |
Evidence Documentation¶
Each piece of evidence must include at minimum:
- What was measured (metric, definition)
- How measured (method, tool)
- When measured (date, version)
- By whom assessed (reviewer, independence)
- Result (number + comparison with threshold)
Common Mistakes¶
Insufficient evidence
- Metric measured on training data instead of independent test set
- No baseline defined ("better than before" is not evidence)
- Only positive results reported (cherry picking)
- Evaluation performed by the development team itself (no independence)
Source: Evidence Standards | Validation Report
Was this page helpful?
Give feedback