Operational framework for the DAAF orchestrator. Defines engagement modes, confirmation protocol, subagent dispatch, context budget, and reference-loading. Loaded exclusively by the orchestrator — not for subagents or user questions.
Data science methodology and method-selection routing for quantitative research. Covers EDA, data validation, descriptive analysis, causal inference (IV, DiD, RD, synthetic control), clustering/PCA/UMAP, supervised ML, geospatial analysis, and visualization design. Contains the canonical method-to-library routing tree: statsmodels (OLS/GLM/time series), pyfixest (FE/DiD), linearmodels (RE/GMM/SUR), svy (complex surveys), scikit-learn (clustering/prediction ML), geopandas (spatial). For implementation syntax, load the routed tool-specific skill.
Interpretation guidance for Urban Institute Education Data Portal datasets. The Portal is a curation layer over federal data: lowercase variable names, integer-encoded categoricals, standardized missing codes (-1 missing, -2 not applicable, -3 suppressed). Covers year definitions, grade encoding (grade=-1 is Pre-K, not missing), suppression rates, ODC-By licensing, and cross-source join identifiers. Load before analyzing any Portal data. Routes to source-specific deep-dive skills for individual datasets.
NHGIS — census geography crosswalks via Portal: links schools (ncessch) and colleges (unitid) to census tracts, block groups, CBSAs, and regions (1990-2020). Portal provides geography linkage tables ONLY — census demographic variables (population, income, poverty, race, educational attainment) are NOT available through the Portal and must be accessed directly from NHGIS (free IPUMS registration required). Use for linking education or institutional data to census geography for contextual analysis.
County Presidential Returns 2000-2024 (MIT MEDSL). Vote shares, party trends, turnout by county_fips (joins census/education data). Requires HARVARD_DATAVERSE_API_KEY set via environment_settings.txt. Critical: naive mode='TOTAL' filtering silently drops ~1,000 counties in post-2020 data where states report by vote mode (absentee, election-day, provisional) instead of totals — use 3-pattern reconstruction (TOTAL-present rows kept, breakdown-only counties summed across modes, empty-string mode rows reclassified). Categorical variables use uppercase strings, not Portal integer codes.
Reactive Python notebook system with cell reactivity, UI elements, SQL cells, plotting, and app deployment. DAAF's standard notebook format — stored as Git-friendly .py files, not .ipynb. For DAAF pipelines: Stage 9 notebooks compile existing executed scripts into cells verbatim as audit artifacts — no new analysis code or interactive widgets. Use when assembling Stage 9 research notebooks, building standalone interactive data apps, or converting Jupyter notebooks to marimo format.
High-performance data manipulation with lazy/eager execution, expressions, I/O (CSV, Parquet, JSON), aggregations, joins, string/datetime ops, and pandas interop. Covers performance optimization patterns and common anti-patterns. DAAF's default DataFrame library — all pipeline code uses Polars, not pandas. Use for any DataFrame operation, reading/writing Parquet files, or migrating existing pandas code to Polars.
Standards for Bash and PowerShell scripts in DAAF: preambles, quoting, error handling, cleanup, testing. Use when writing or reviewing .sh/.ps1 files (hooks, lifecycle, utilities). Not runtime safety — that is bash-safety.sh hook.