| name | data-validation |
| description | **DATA VALIDATION SKILL** — Validate and enrich dataset sources using official reference files. USE FOR: ensuring data consistency with master sources like Subproceso-Proceso-Area.xlsx; validating process hierarchies; enriching datasets with official process/subprocess mappings; checking data source integrity. DO NOT USE FOR: general data analysis; creating new data sources; modifying existing business logic. INVOKES: pandas operations, file system tools for data validation and enrichment. |
Data Validation Skill
Overview
This skill provides reusable logic for validating and enriching datasets using official reference sources, particularly Subproceso-Proceso-Area.xlsx as the master source for process and subprocess hierarchies.
Key Components
Process Hierarchy Validation
The skill implements the official hierarchy where:
- Proceso = Parent (Padre)
- Subproceso = Child (Hijo)
Data Enrichment Logic
def enrich_with_process_hierarchy(df: pd.DataFrame, excel_path: Path) -> pd.DataFrame:
"""
Enrich dataset with official process hierarchy from Subproceso-Proceso-Area.xlsx
Args:
df: Dataset with 'Subproceso' column
excel_path: Path to Subproceso-Proceso-Area.xlsx
Returns:
Enriched DataFrame with 'Proceso', 'Unidad', 'Tipo de proceso' columns
"""
try:
df_procesos = pd.read_excel(excel_path, engine="openpyxl")
df_procesos["Subproceso_norm"] = df_procesos["Subproceso"].astype(str).str.strip()
df["Subproceso_norm"] = df["Subproceso"].astype(str).str.strip()
df = df.merge(
df_procesos[["Subproceso_norm", "Proceso", "Unidad", "Tipo de proceso"]]
.drop_duplicates("Subproceso_norm"),
on="Subproceso_norm",
how="left",
suffixes=("", "_excel")
)
df["Proceso"] = df["Proceso_excel"].fillna(df["Proceso"])
df["Unidad"] = df["Unidad"].fillna("")
df["Tipo de proceso"] = df["Tipo de proceso"].fillna("")
df = df.drop(columns=["Subproceso_norm", "Proceso_excel"])
except Exception as e:
st.warning(f"Could not enrich with process hierarchy: {e}")
return df
Validation Functions
def validate_process_sources(dataset_df: pd.DataFrame, excel_path: Path) -> dict:
"""
Validate that dataset processes match official Excel source
Returns:
{
"excel_processes": int,
"dataset_processes": int,
"missing_in_dataset": set,
"extra_in_dataset": set,
"hierarchy_consistent": bool
}
"""
df_excel = pd.read_excel(excel_path, engine="openpyxl")
excel_processes = set(df_excel['Proceso'].dropna().str.strip().str.upper())
dataset_processes = set(dataset_df['Proceso'].dropna().str.strip().str.upper())
return {
"excel_processes": len(excel_processes),
"dataset_processes": len(dataset_processes),
"missing_in_dataset": excel_processes - dataset_processes,
"extra_in_dataset": dataset_processes - excel_processes,
"hierarchy_consistent": len(excel_processes - dataset_processes) == 0
}
Usage Examples
In Data Loader
@st.cache_data(ttl=300, show_spinner="Loading and validating data...")
def cargar_dataset() -> pd.DataFrame:
from .data_validation import enrich_with_process_hierarchy
df = enrich_with_process_hierarchy(df, DATA_RAW / "Subproceso-Proceso-Area.xlsx")
return df
Validation Check
validation_result = validate_process_sources(df, DATA_RAW / "Subproceso-Proceso-Area.xlsx")
if not validation_result["hierarchy_consistent"]:
st.warning(f"⚠️ {len(validation_result['missing_in_dataset'])} processes missing from dataset")
Integration Points
- services/data_loader.py: Use
enrich_with_process_hierarchy() in cargar_dataset()
- Data validation: Call
validate_process_sources() to check data integrity
- Process filters: Ensure UI filters use validated process lists
File Structure
.github/skills/data-validation/
├── SKILL.md # This file
├── data_validation.py # Reusable functions
└── test_validation.py # Unit tests
Dependencies
- pandas
- openpyxl
- pathlib
- streamlit (for UI feedback)
Best Practices
- Cache enrichment: Use
@st.cache_data for expensive operations
- Error handling: Gracefully handle missing files or columns
- Normalization: Always strip and normalize text for reliable matching
- Validation: Run validation checks in development/testing environments
- Documentation: Update this skill when new reference sources are added