Reliable agentic AI depends on reliable data. Adopted from lakeFS, Microsoft CAF, and Databricks.
| Foundation | Why it matters | Owner |
|---|---|---|
| Data classification (public / internal / confidential / PII / regulated) | The basis for agent scoping and DLP | Platform / Security |
| Data lineage (where data comes from, how it transforms) | Required for audits and incident response | Platform |
| Golden datasets (curated, validated benchmarks for evals) | Lets us know whether agents are getting better or worse over time | CoE |
| Data version control (reproducibility, rollback) | Lets us roll back data, not just code, when something breaks | Platform |
| Validation gates at ingestion | Stops low-quality / biased data poisoning shared pipelines | Platform |
| Semantic consistency across systems | "Customer" must mean the same thing in CRM and ERP, or agents reason inconsistently | CoE + Data team |
Agents are downstream of data. Garbage in stays garbage out — except faster, at higher volume, with less human checkpointing.