// Use this skill when creating or updating DAG configurations (dags.yaml), schema.yaml, and metadata.yaml files for BigQuery tables. Handles creating new DAGs when needed and coordinates test updates when queries are modified (invokes sql-test-generator as needed). Works with bigquery-etl-core, query-writer, and sql-test-generator skills.
| name | metadata-manager |
| description | Use this skill when creating or updating DAG configurations (dags.yaml), schema.yaml, and metadata.yaml files for BigQuery tables. Handles creating new DAGs when needed and coordinates test updates when queries are modified (invokes sql-test-generator as needed). Works with bigquery-etl-core, query-writer, and sql-test-generator skills. |
Composable: Works with bigquery-etl-core (for conventions), query-writer (for queries), and sql-test-generator (for test updates) When to use: Creating/updating DAG configurations, schema.yaml and metadata.yaml files, coordinating test updates when queries are modified
Generate and manage schema.yaml, metadata.yaml files, and DAG configurations following Mozilla BigQuery ETL conventions. This skill handles:
_live/_stable), /sql directory, or DataHubFor comprehensive documentation, see:
BEFORE creating or modifying metadata/schema/DAG files, READ these references:
Schema Discovery: READ references/schema_discovery_guide.md (for schema generation)
DAG Discovery: READ references/dag_discovery.md
DAG Creation: READ references/dag_creation_guide.md (when creating new DAGs)
Metadata YAML Guide: READ references/metadata_yaml_guide.md
Schema YAML Guide: READ references/schema_yaml_guide.md
Schema Description Improvements: READ references/schema_description_improvements.md
Glean Dictionary Patterns: READ references/glean_dictionary_patterns.md (for _live/_stable tables)
When creating a new DAG, READ and COPY from these templates:
Daily scheduled DAG? → READ assets/dag_template_daily.yaml
Hourly scheduled DAG? → READ assets/dag_template_hourly.yaml
Custom schedule DAG? → READ assets/dag_template_custom.yaml
When creating metadata.yaml, READ and COPY from these templates:
Daily partitioned table? → READ assets/metadata_template_daily_partitioned.yaml
Hourly partitioned table? → READ assets/metadata_template_hourly_partitioned.yaml
Full refresh table? → READ assets/metadata_template_full_refresh.yaml
When creating schema.yaml, READ and COPY from these templates:
Simple flat schema? → READ assets/schema_template_basic.yaml
Nested/repeated fields? → READ assets/schema_template_with_nested.yaml
Step 1: Find or create the appropriate DAG (use this priority order)
dags.yaml files using grep for keywords related to the dataset/productreferences/dag_discovery.md for common DAG patternsreferences/dag_creation_guide.md for when to create vs reuseassets/dag_template_daily.yaml, dag_template_hourly.yaml, or dag_template_custom.yamldags.yaml./bqetl dag validate <dag_name>Step 2: Create directory structure
./bqetl query create <dataset>.<table> --dag <dag_name>
Step 3: Create metadata.yaml
assets/ as a starting pointreferences/metadata_yaml_guide.md for detailed optionsStep 4: Create schema.yaml with intelligent descriptions
./bqetl query schema update <dataset>.<table> to auto-generate structurereferences/schema_discovery_guide.md):
/sql directory for schema structure (ALWAYS FIRST)/sql)_live/_stable table descriptions via https://dictionary.telemetry.mozilla.org//sql directory schema.yaml files for derived tables./bqetl query schema update <dataset>.<table> --use-global-schema./bqetl query schema update <dataset>.<table> --use-dataset-schema --use-global-schemareferences/schema_description_improvements.md for improvement workflowreferences/glean_dictionary_patterns.md for token-efficient Glean Dictionary usageStep 5: Deploy schema (if updating existing table)
./bqetl query schema deploy <dataset>.<table>
When modifying an existing query.sql, follow the test update workflow:
1. Check for existing tests:
ls tests/sql/<project>/<dataset>/<table>/
2. Update the query.sql file with your changes
3. Update schema.yaml if output changed:
./bqetl query schema update <dataset>.<table>
4. If tests exist, update them:
5. Run tests:
pytest tests/sql/<project>/<dataset>/<table>/ -v
6. Handle test failures:
See ../query-writer/references/test_update_workflow.md for complete details and the query modification checklist.
Before creating a new DAG, always search for existing DAGs that could be reused. Create a new DAG only when:
DO NOT create a new DAG if an existing DAG covers the same product area or schedule.
When no suitable DAG exists:
1. Present options to the user:
2. Choose appropriate template:
assets/dag_template_daily.yaml) - Most common for daily aggregationsassets/dag_template_hourly.yaml) - For real-time/frequent processingassets/dag_template_custom.yaml) - Weekly, multi-hour, or specific times3. Gather required information:
bqetl_<product> or bqetl_<product>_<schedule>4. Add DAG to dags.yaml:
dags.yaml5. Validate:
./bqetl dag validate <dag_name>
See references/dag_creation_guide.md for:
Schema files define BigQuery table structure with field names, types, modes, and descriptions.
Quick reference:
fields: listLocation:
bigquery_etl/schema/global.yaml (common telemetry fields)bigquery_etl/schema/<dataset_name>.yaml (dataset-specific fields)Apply base schema descriptions:
# Use global schema for common fields (submission_date, client_id, etc.)
./bqetl query schema update <dataset>.<table> --use-global-schema
# Use dataset-specific schema (e.g., ads_derived.yaml for ads data)
./bqetl query schema update <dataset>.<table> --use-dataset-schema
# Use both (dataset schema takes priority over global schema)
./bqetl query schema update <dataset>.<table> --use-dataset-schema --use-global-schema
How it works:
sub_date, it matches submission_date via aliassub_date → submission_date)Priority order: Dataset schema → Global schema → Missing description warning
Example global.yaml fields:
submission_date, client_id, country, sample_id, normalized_channeldau, wau, mau (activity metrics)attribution_campaign, attribution_source, attribution_mediumbigquery_etl/schema/global.yamlExample ads_derived.yaml fields:
impressions, clicks, revenue, cpm, click_ratecampaign_id, advertiser, creative_id, flight_idpartner_name, provider, rate_typebigquery_etl/schema/ads_derived.yamlPreview before applying (use helper script):
# Preview what descriptions would be applied without making changes
python scripts/preview_base_schema.py <dataset>.<table>
When generating schemas for derived tables, follow this workflow:
1. Discover source table schemas (use priority order):
/sql directory for schema structure/sql)2. Discover descriptions (use priority order):
_live/_stable table descriptions via https://dictionary.telemetry.mozilla.org//sql directory schema.yaml files for derived tables3. Apply base schema descriptions (RECOMMENDED for standard fields):
./bqetl query schema update <dataset>.<table> --use-global-schema--use-dataset-schema (applies ads_derived.yaml)submission_date, client_id, etc.python scripts/preview_base_schema.py <dataset>.<table>4. Check for ANY missing descriptions:
5. Use source descriptions as base:
6. Improve descriptions when needed:
7. Recommend source updates (for existing descriptions):
See references/schema_discovery_guide.md for:
/sql, Glean Dictionary, and DataHubSee references/glean_dictionary_patterns.md for:
See references/schema_description_improvements.md for:
Auto-generate from query:
./bqetl query schema update <dataset>.<table>
See references/schema_yaml_guide.md for:
Metadata files define table ownership, scheduling, partitioning, and BigQuery configuration.
Required fields:
friendly_name - Human-readable table namedescription - Multi-line description of purposeowners - List of email addresses and/or GitHub teams (e.g., mozilla/ads_data_team)
python scripts/detect_teams.py
/sqlCommon sections:
labels - application, schedule, table_type, dag, owner1scheduling - dag_name, date_partition_parameter, start_datebigquery.time_partitioning - type, field, expiration_daysbigquery.clustering - fields for clustering (max 4)See references/metadata_yaml_guide.md for:
Creating new table:
Updating existing query:
# Create new query directory with templates
./bqetl query create <dataset>.<table> --dag <dag_name>
# Auto-generate/update schema from query output
./bqetl query schema update <dataset>.<table>
# Deploy schema to BigQuery (updates existing table)
./bqetl query schema deploy <dataset>.<table>
# Validate query and metadata
./bqetl query validate <dataset>.<table>
# Run tests
pytest tests/sql/<project>/<dataset>/<table>/ -v
CRITICAL: Only use documented configurations or patterns from existing metadata files.
/sqlassets/When uncertain about a configuration:
references/metadata_yaml_guide.mdgrep -r "field_name" sql/*/metadata.yamlmozilla/team_name)python scripts/detect_teams.py to find relevant GitHub teams (see "Script Maintenance" for testing)./bqetl query schema update <dataset>.<table> --use-global-schema for common fieldssubmission_date, client_id, etc.This skill provides helper scripts in scripts/:
detect_teams.py - Find GitHub teams from metadata.yaml filesdatahub_lineage.py - Generate DataHub lineage query parameterspreview_base_schema.py - Preview base schema matches before applyingSee references/script_maintenance.md for:
In the repository:
sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/metadata.yamlsql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_event_v1/metadata.yamlsql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/schema.yamlsql/moz-fx-data-shared-prod/telemetry_derived/event_events_v1/schema.yamlIn this skill:
assets/ directory for metadata and schema templatesreferences/ directory for complete documentationBEFORE using any DataHub MCP tools (mcp__datahub-cloud__*), you MUST:
../bigquery-etl-core/references/datahub_best_practices.md - Token-efficient patterns for schema and DAG discoverySchema Discovery Priority:
/sql directory (schema.yaml files)/sql)Description Discovery Priority (for _live/_stable tables):
IMPORTANT: Glean Dictionary provides descriptions only, NOT schemas. Always get schema structure from /sql or DataHub.
Use DataHub for:
/sql_live/_stable tables) or /sqlFor lineage queries, use the helper script:
python scripts/datahub_lineage.py <table_identifier>
This provides parameters for efficient DataHub queries that return only essential lineage information.
See "Script Maintenance" section below for testing and troubleshooting.