name

source-system-analyser

description

Analyze source systems for ingestion readiness and data quality across databases, APIs, and flat files. Use when generating schema metadata, identifying data quality risks, mapping source structures, evaluating delete/late-arrival/timezone behavior, or producing a normalized schema.json contract for downstream ingestion. Triggers: analyze, profile, assess, audit, inspect source, ingestion readiness check, schema contract, data quality risks, source system, database analysis, API analysis, flat file, CSV schema, volume projection, capacity forecast.

Source System Analyser

Quick Routing

Determine source type (database, API, or flat file).
For database sources, run the database preflight before analysis.
Route to the matching module.
Produce normalized output.

Source routes:

Database sources: references/databases/postgresql/README.md, references/databases/mssql/README.md, references/databases/oracle/README.md
API sources: references/apis/test-api/README.md (current provider), fallback references/apis/generic/README.md
Flat file sources (CSV/Excel): references/flat/generic/README.md
Volume and capacity forecasting: references/volume-projection/README.md
Unknown or mixed source types: start with references/routing.md

Shared Requirements

Load these before executing any source workflow:

Prerequisites: references/shared/prerequisites.md
Output contract: references/shared/output-schema.md
Classification review workflow: references/shared/classification-review-workflow.md
Troubleshooting: references/shared/troubleshooting.md

Database Preflight

Before starting database analysis:

Check whether db-analysis-config.json already exists in the working directory.
If it exists, reuse it and do not ask database exclusion or row-limit questions again.
If it does not exist, ask the user:
- whether they want to exclude any schemas
- whether they want to exclude any tables
- whether they want to set a maximum row limit
If the user answers yes to any of those, create db-analysis-config.json with this shape:

{
  "exclude_schemas": [],
  "exclude_tables": [],
  "max_row_limit": null
}

If the user answers no to all three questions, do not create the JSON file.

This preflight applies only to database sources. API and flat-file workflows should not ask these questions.

Description Enrichment Continuation

After database analysis writes schema.json, check whether any table has an empty table_description or any column has an empty column_description.

If any descriptions are missing:

Build a checklist file:

python3 scripts/build_description_enrichment_checklist.py schema.json

Use the checklist as the ordered worklist for missing descriptions.
Work table by table in checklist order.
For each table:
- complete missing column_description items first
- query up to 3 sample rows per unresolved column when needed (3 rows: enough context to infer column meaning without pulling significant data volume)
- write generated column descriptions into each checklist item's proposed_description
- only after that table's column descriptions are complete, generate the table's table_description from the completed column descriptions for that table
Do not do a separate table-query step unless the column-level context is still insufficient.
Merge the checklist back into the main analyzer JSON:

python3 scripts/apply_description_enrichment.py schema.json schema_description_checklist.json

Do not treat the analysis as complete while the final analyzer JSON still has blank table or column descriptions.

Backward Compatibility

Keep existing database analyzer entrypoint unchanged:

.venv/bin/python scripts/source_system_analyzer.py <database_url> <output_json_path> [schema] [--dialect postgresql|mssql|oracle]
# or: .venv/bin/python scripts/source_system_analyzer.py --database-url-secret AZURE-MSSQL-URL <output_json_path> [schema]

The merged API and tabular flows are now available directly inside this skill:

API script: scripts/apis/api_reader.py
Test API wrapper: scripts/apis/test_api/test_api_reader.py
API analyzer: scripts/apis/api_analyzer.py
Tabular schema script: scripts/flat/tabular_schema_json.py
Volume projection collector: scripts/volume_projection/collector.py
Volume projection predictor: scripts/volume_projection/predictor.py

Common Mistakes

Re-asking preflight questions on rerun: Always check for db-analysis-config.json first. If it exists, reuse it silently — do not ask the user about exclusions again.
Leaving blank descriptions in schema.json: The analysis is not complete until all table_description and column_description fields are filled. Always run the description enrichment continuation after the analyzer.
Passing database URL with credentials as a CLI argument: <database_url> embeds passwords visible in ps aux and shell history. Prefer --database-url-secret (reads from Azure Key Vault) in shared or production environments.
Running the full analyzer to fix null classifications: Use the classification review workflow (one family at a time) — not a full rerun — when improving concept assignments.
Using source_system_analyzer.py for flat files: CSV/Excel inputs use tabular_schema_json.py, not the database analyzer.

Fallback Rules

If source type is unclear, classify in this order: URL scheme, protocol (http/https), file extension, then available metadata.
If still ambiguous, ask the user for source type before running scripts.
If a provider-specific parser is missing, use the generic workflow and map output to the shared schema contract.
Do not hardcode credentials; use environment variables or user-provided secure values.