Data Governance¶
Purpose
Bad data is the number one reason AI projects fail. This module provides a concrete framework for data quality, data lineage, data contracts and metadata management — so your AI system rests on a reliable data foundation.
When to use this?
From the Discovery phase (Phase 1) during Data Evaluation. Data governance is not a one-time activity: it runs through all phases. Start early, build incrementally.
DORA: healthy data ecosystems as AI amplifier [so-28]
The DORA AI Capabilities Model (2025) identifies healthy data ecosystems — high-quality, accessible and unified internal data — as one of the seven foundational capabilities that amplify the positive impact of AI adoption. This validates the importance of the data quality framework in this module. See External Evidence: DORA.
1. Data Quality Framework¶
Data quality is measured along six dimensions. Define concrete thresholds per dimension that match the risk level of the project.
| Dimension | Definition | Measurement Method | Example Threshold |
|---|---|---|---|
| Completeness | All expected records and fields are present | (records with value / total expected records) × 100% |
≥ 95% for critical fields |
| Accuracy | Values correspond to reality | Comparison with trusted sources or manual sample | ≥ 98% on sample of 200 records |
| Consistency | The same facts are represented identically across all systems | Cross-system comparisons, business rule checks | 0 conflicts in primary keys |
| Timeliness | Data is available within the required lead time | Measurement of ingestion latency | ≤ 4 hours for daily batch; ≤ 5 min for near-realtime |
| Uniqueness | No unwanted duplicates | Deduplication analysis on unique keys | ≤ 0.1% duplicates |
| Validity | Values comply with the defined format and domain rules | Schema validation, regex, domain lists | 100% of records match the schema |
Thresholds are project-specific
The example thresholds above are starting points. Adjust them based on risk level: a high-risk system (EU AI Act) requires stricter thresholds than an internal dashboard.
2. Data Lineage & Provenance¶
What is data lineage?¶
Data lineage is the complete description of the origin, transformations and movements of data — from source to model input and ultimately model output.
Why does it matter?¶
- Traceability: When unexpected model results occur, you can quickly identify which data is the cause.
- Debugging: Identify exactly where in the pipeline a data error was introduced.
- Compliance: The EU AI Act requires that the provenance of training data is demonstrable for high-risk systems.
- Reproducibility: Without lineage you cannot reliably repeat experiments.
How to implement?¶
Minimum requirements:
- Every dataset has a unique identifier and version number
- Transformation steps are recorded with input version, output version and timestamp
- Metadata tags include: source, owner, processing date, quality score
Tooling options:
| Category | Examples | Suitable for |
|---|---|---|
| Lightweight | dbt lineage graph, manual documentation | Small teams, L0-L1 |
| Mid-range | Apache Atlas, DataHub, OpenLineage | Growing organisations, L1-L2 |
| Enterprise | Collibra, Alation, Purview | Large organisations, L2-L3 |
Minimum requirements per risk level:
| Risk Level | Lineage Requirement |
|---|---|
| Low risk | Documentation of sources and main transformations |
| Limited risk | Automated lineage tracking, traceability to source level |
| High risk | Full end-to-end lineage with audit trail, immutable logs |
3. Data Contracts¶
What are data contracts?¶
A data contract is a formal agreement between a data producer (the team that delivers data) and a data consumer (the team that uses data). It prevents changes in upstream data from unexpectedly breaking your AI pipeline.
Components of a data contract¶
| Component | Description | Example |
|---|---|---|
| Schema | Expected fields, data types, nullable rules | customer_id: INT NOT NULL, name: VARCHAR(255) |
| SLA | Availability, refresh frequency, maximum latency | Daily before 06:00 UTC, 99.5% uptime |
| Ownership | Who is responsible for the data? | Customer Service Team (producer), ML Team (consumer) |
| Quality rules | Minimum quality requirements the producer guarantees | Completeness ≥ 98%, no duplicates on customer_id |
| Change policy | How are schema changes communicated? | Minimum 2 sprints advance notice, breaking changes via RFC |
| Escalation procedure | What happens when the contract is violated? | Alert to consumer, incident addressed within 4 hours |
Example contract template¶
# Data Contract — [Dataset Name]
contract_version: "1.0"
producer:
team: "Customer Service Team"
contact: "name@organisation.com"
consumer:
team: "ML Platform Team"
contact: "name@organisation.com"
dataset:
name: "customer_interactions"
format: "parquet"
location: "s3://data-lake/customer_interactions/"
schema:
- field: "customer_id"
type: "INT"
nullable: false
- field: "interaction_date"
type: "DATE"
nullable: false
- field: "channel"
type: "VARCHAR(50)"
nullable: false
allowed_values: ["email", "phone", "chat", "portal"]
sla:
refresh: "daily before 06:00 UTC"
availability: "99.5%"
quality_rules:
completeness: "≥ 98%"
uniqueness_on: "customer_id + interaction_date"
change_policy: "Breaking changes: minimum 2 sprints advance notice via RFC"
4. Data Versioning¶
Why?¶
Without data versioning you cannot guarantee that a model training run is reproducible. If training data changes without version tracking, debugging and auditing become impossible.
Approach¶
| Method | Description | When to use |
|---|---|---|
| DVC (Data Version Control) | Git-like versioning for datasets, stores metadata in git and data in remote storage | Small to medium datasets, teams already using git |
| Lakehouse (Delta Lake, Iceberg) | Time-travel via table versioning, ACID transactions on data lake | Large datasets, analytical workloads |
| Snapshots | Periodic copies of datasets with timestamp | Simplest approach, suitable for L0-L1 |
Minimum requirements:
- Every training dataset has a unique version number or hash
- The relationship model version ↔ data version is recorded in the model registry
- Previous versions are queryable for debugging and auditing
- Changes to datasets are logged (what changed, when, by whom)
5. Metadata Management¶
Good metadata makes data findable, understandable and reusable.
Minimum metadata per dataset¶
| Metadata Field | Description |
|---|---|
| Name | Unique, descriptive name |
| Description | What does this dataset contain? What is it used for? |
| Owner | Team or person responsible |
| Classification | Public / internal / confidential / secret |
| Schema | Field definitions, data types, constraints |
| Quality score | Current score on the six quality dimensions |
| Provenance | Sources and transformations (link to lineage) |
| Created / Last updated | Timestamps |
| Tags | Free-form tags for discoverability (e.g. customer_data, financial, PII) |
Data catalogue¶
Start simple
A shared spreadsheet or wiki page with the fields above is a perfectly fine starting point. Scale up to dedicated tooling as the number of datasets grows.
Tooling options: DataHub, Amundsen, Apache Atlas, Collibra, or a simple internal wiki.
6. Practical Checklist per Phase¶
Phase 1 — Discovery¶
- Data sources inventoried and documented
- Initial quality measurement performed (sample across the six dimensions)
- Data ownership established per source
- Privacy classification assigned (does the data contain PII?)
- Initial data lineage sketched (source → processing → usage)
Phase 2 — Validation¶
- Data contracts established with all relevant producers
- Automated quality controls set up in the pipeline
- Data versioning configured for training sets
- Metadata populated in the data catalogue
- Quality thresholds defined and agreed with the team
Phase 3 — Development¶
- Data contracts actively enforced (monitoring for violations)
- Full lineage tracking operational
- Quality reports automated and visible in dashboards
- Data versioning integrated with model registry
- Metadata up-to-date and searchable
Phase 4+ — Monitoring & Ongoing¶
- Continuous data quality monitoring active
- Drift detection on input data (not just model output)
- Periodic review of data contracts (at least quarterly)
- Data catalogue updated for new or modified datasets
- Audit trail available for compliance reviews
7. Related Modules¶
- Data Pipelines — technical standards for data ingestion, transformation and validation
- Data Evaluation (Phase 1) — initial data quality assessment in the Discovery phase
- Drift Detection — detection of shifts in data and model behaviour
- Data & Privacy Sheet — privacy aspects of data processing
- Evidence Standards — logging and auditability