1. Data Pipelines¶
Purpose
Standards for building and managing data pipelines that feed AI systems with reliable, traceable data.
1. Purpose¶
This module defines the standards for setting up and managing data pipelines that feed AI systems. A robust data pipeline is the backbone of every reliable AI solution.
2. Core Activities¶
Data Ingestion¶
Collecting data from source files into a central processing environment.
Minimum requirements:
- Sources are documented (where does the data come from?)
- Access rights are arranged and minimal (least privilege)
- Ingestion is repeatable and automated where possible
- Error handling is implemented (what happens on failed ingestion?)
Data Validation & Quality Controls¶
Checking whether incoming data meets expected schemas and quality standards.
Minimum requirements:
- Schema validation: data meets expected format
- Completeness check: critical fields are present
- Range check: values fall within expected bounds
- Anomaly detection: unexpected patterns are flagged
Recommended approach:
| Control Type | Example | Action on Failure |
|---|---|---|
| Critical | Required field missing | Pipeline stops, alert |
| Warning | Value outside expected range | Log, pipeline continues |
| Informational | Statistical deviation vs historical | Log for review |
Data Transformation¶
Converting raw data into a usable format for the AI model.
Minimum requirements:
- Transformation logic is documented and version-controlled
- Personally identifiable information (PII) is pseudonymised where necessary
- Transformations are reproducible (same input = same output)
Versioning & Reproducibility¶
Tracking data versions so that results are traceable.
Minimum requirements:
- Datasets are tagged with version numbers or timestamps
- Relationship between data version and model version is recorded
- Historical data is queryable for debugging/auditing
3. Basic vs Advanced¶
| Aspect | Basic (L0-L1) | Advanced (L2-L3) |
|---|---|---|
| Ingestion | Manual or scheduled batch | Event-driven, real-time where needed |
| Validation | Manual sampling | Automated controls in pipeline |
| Transformation | Scripts in repository | Documented, tested transformations |
| Versioning | File names with date | Data versioning tools (DVC, Delta Lake) |
| Monitoring | Periodic manual check | Dashboards with alerts |
4. Integration with Governance¶
- Traceability: Every model output must be traceable to the data version used.
- Privacy: Apply the rules from Data & Privacy Sheet to the pipeline.
- Logging: Log data ingestion and transformations according to Evidence Standards.
5. Go-Live Checklist¶
5. Go-Live Checklist
- Data ingestion runs stably in production environment
- Quality controls are implemented and tested
- Transformation logic has been reviewed and documented
- Data versioning is set up
- Monitoring and alerting are active
- Privacy measures are implemented and validated
6. Related Modules¶
Was this page helpful?
Give feedback