Overview
A practical framework for Raw, Cleaned, and Curated layers, including traceability, quality checks, metric definitions, and AI-ready context.
- When to preserve raw source fidelity and when to standardize.
- How curated datasets become reusable data products.
- Why AI workflows require table schemas, business glossary, and historical reports.
Foundation
Keep the raw layer auditable
The raw layer should preserve source shape, arrival time, and ownership metadata so teams can replay ingestion and resolve data disputes without guessing.
- Capture source identifiers, file or stream offsets, and ingestion timestamps.
- Avoid business transformations before audit and replay requirements are satisfied.
Quality
Move standardization into governed cleaned layers
Cleaned data is where schema normalization, entity alignment, quality checks, and operational exceptions become explicit platform behavior.
- Validate freshness, uniqueness, completeness, and referential consistency.
- Record rejected records and quality exceptions as first-class operational data.
AI Context
Publish curated datasets as reusable data products
AI analytics depends on more than tables. Curated products should include metric definitions, business glossary terms, lineage, ownership, and retrieval-ready documentation.
- Attach semantic descriptions to tables, columns, metrics, and allowed joins.
- Expose historical reports and decision notes as retrievable business context.