| name | dbxmetagen-development |
| description | Development guidance for the dbxmetagen project -- AI-powered metadata generation for Databricks Unity Catalog. Use when working on dbxmetagen source code, tests, DDL generation, metadata review workflow, semantic layer, FK prediction, ontology, the FastAPI+React dashboard app, or deployment via Databricks Asset Bundles. Triggers on dbxmetagen, metadata generation, DDL, review workflow, knowledge base, knowledge graph, Genie builder, metric views. |
dbxmetagen Development Guide
For full architecture details, read CLAUDE.md. This skill covers
the essentials and hard-won pitfalls.
Current Environment
Always use these values for deployment and CLI operations:
| Setting | Value |
|---|
| Databricks CLI profile | DMVM |
| Deploy target | demo |
| Catalog | eswanson_demo |
| Schema | metadata_results |
./deploy.sh --profile DMVM --target demo --no-app
databricks bundle run <job_name> -t demo -p DMVM
databricks jobs run-now -p DMVM --json '{"job_id": ..., "job_parameters": [...]}'
Project Overview
What: Python library + Databricks App for automated metadata generation on Unity Catalog tables.
Uses LLMs to generate comments, detect PII, classify domains, build knowledge graphs, predict FKs,
generate metric views, and create Genie spaces.
Key entry points:
| File | Role |
|---|
src/dbxmetagen/main.py | CLI entry point, config validation, mode dispatch |
src/dbxmetagen/processing.py | Core pipeline (~3500 lines): chunking, sampling, DDL generation, persistence |
src/dbxmetagen/ddl_regenerator.py | DDL review workflow: load TSV/Excel, replace comments/tags, export SQL, apply |
src/dbxmetagen/fk_prediction.py | FK candidate generation, scoring, AI judgment, direction enforcement |
src/dbxmetagen/semantic_layer.py | Metric view generation, SQL autofix, Genie space context |
apps/dbxmetagen-app/app/api_server.py | FastAPI backend (~7300 lines, ~100 endpoints) |
apps/dbxmetagen-app/app/src/App.jsx | React frontend entry point |
Build and Test
uv sync
uv sync --extra pi
uv build
./run_tests.sh
./run_tests.sh -q
uv run pytest tests/test_domain_classifier.py -v
uv run pytest tests/test_ddl_regenerator.py::TestReplaceCommentInDDL::test_comment_containing_single_quote -v
Critical Testing Rules
-
Never run pytest tests/ directly. DDL regenerator tests and binary/variant tests must
run in separate processes due to mlflow/databricks-sdk import conflicts. Use run_tests.sh
or explicitly --ignore the conflicting files.
-
Use skip_yaml_loading=True when constructing MetadataConfig in tests:
config = MetadataConfig(
skip_yaml_loading=True,
catalog_name="test", schema_name="test", table_names="test.table",
)
-
conftest.py stubs heavy deps (pyspark, mlflow, openai, grpc, databricks_langchain)
as MagicMock. Key fixtures: mock_spark, test_config, sample_table_rows.
-
Integration tests live in notebooks/integration_tests/ as Databricks notebooks
(not pytest). They run on a Databricks cluster via DAB jobs.
Frontend
cd apps/dbxmetagen-app/app/src
npm install
npm run build
npm run dev
Rebuild after any frontend changes -- src/dist/ is committed and synced via DAB.
Key Patterns and Pitfalls
DDL Comment Handling (the #1 source of bugs)
The review workflow data flow:
AI generates comment
-> _create_*_ddl_func() sanitizes " -> ', wraps in double quotes
-> Export to TSV/Excel
-> User edits column_content (or ddl) in spreadsheet
-> load_metadata_file() reads back with pd.read_csv
-> update_ddl_row() calls replace_comment_in_ddl()
-> export_metadata() writes .sql file (one DDL per line)
-> extract_ddls_from_file() splits by semicolon
-> apply_ddl_to_databricks() executes via spark.sql()
Rules that MUST be followed:
-
Quote regex uses backreference \2, not a separate capture group:
r'(COMMENT ON TABLE [^"\']+ IS\s+)(["\'])(.*?)\2'
r'(COMMENT ON TABLE [^"\']+ IS\s+)(["\'])(.*?)(["\'])'
-
is_column_comment must check for "COMMENT ON COLUMN" or "ALTER COLUMN",
never bare "COLUMN" -- the word "Column" commonly appears in AI-generated comment text.
-
Sanitize new_comment in replace_comment_in_ddl the same way DDL generators do:
new_comment.replace('""', "'").replace('"', "'"). This keeps the review path consistent
with the generation path.
-
All pd.read_csv calls for TSV files MUST include keep_default_na=False, na_values=[].
Without this, literal "None" strings (used for column_name on table-level rows) become NaN.
-
extract_ddls_from_file splits on ; naively (content.split(";")). If a comment
contains a semicolon, the DDL gets corrupted. This is a known limitation -- avoid semicolons
in comment text when using the SQL export path.
LLM Response Parsing
AIMessage.content can be a list (not just str) when the model returns structured blocks.
chat_client.py handles this in invoke_structured.
- Pydantic validators in response models normalize malformed LLM output (misspellings,
nested lists, stringified arrays). Don't add redundant validation -- check existing validators first.
SQL Autofix (Metric Views)
Functions in semantic_layer.py (and duplicated in api_server.py):
_fix_unquoted_literals -- quotes multi-word string literals in =, !=, IN clauses
_fix_then_else_literals -- quotes unquoted values after THEN/ELSE
_fix_case_quoting -- fixes CASE expression quoting
All autofix functions need guards to avoid corrupting:
- Already-quoted strings
- Numeric values
- Column references (contain
.)
- Function calls (contain
()
- SQL reserved words
FK Prediction
- FK prediction accepts
table_names. When set, candidate pairs are filtered post-union to
include only pairs where at least one of table_a/table_b matches. It still reads from
graph_nodes/graph_edges (the full knowledge graph) for cross-table discovery, but limits
expensive AI_QUERY calls to relevant pairs only.
- Self-referential FKs (
src_table == dst_table AND src_column == dst_column) are filtered
- Direction is enforced via AI judge + cardinality fallback (
_enforce_direction)
- Reverse-pair cleanup runs before MERGE to prevent duplicates
- Query history candidates extracted from
system.query.history JOIN patterns
write_graph_edges() uses merge_edges(source_system='fk_predictions') -- every edge
must include edge_id and source_system
Graph Edges Write Contract
All writes to graph_edges must use merge_edges() from knowledge_graph.py.
from dbxmetagen.knowledge_graph import merge_edges
merge_edges(spark, target_table, edges_df, source_system="my_source")
Rules:
- Never use DELETE+INSERT,
saveAsTable("append"), or raw SQL INSERT for graph_edges
- Every edge DataFrame must include
edge_id (CONCAT_WS('::', src, dst, relationship))
and source_system
- Four
source_system values: knowledge_graph, ontology, fk_predictions, embedding_similarity
- MERGE is scoped by
source_system -- each module only sweeps its own stale edges
- If a module produces zero edges,
merge_edges deletes all rows for that source_system
align_edge_schema() pads missing columns as typed NULLs to match the 16-column canonical schema
Deployment
./deploy.sh --profile DMVM --target demo
./deploy.sh --profile DMVM --target demo --no-app
./deploy.sh --profile DMVM --target demo --permissions
Key rules:
deploy.sh generates databricks.yml from databricks.yml.template by stamping env vars.
Never edit databricks.yml directly.
- Similarly,
app.yaml is generated from app.yaml.template.
- Requires
{target}.env file (copy from example.env). Required vars:
DATABRICKS_HOST, catalog_name, schema_name, warehouse_id.
configurations/ is copied into the app source dir temporarily for DAB sync (cleaned up after).
Running the Eval Pipeline from CLI
The eval harness lives in a git worktree at ../dbxmetagen-eval on branch feat/evaluation-job.
It uses synthetic FHIR healthcare tables (patients, providers, encounters, diagnoses, medications,
lab_results) with known ground truth for comments, PII, domain, ontology, and FK predictions.
Three-step workflow
cd /path/to/dbxmetagen-eval
./deploy.sh --profile DMVM --target demo --no-app
JOB_ID=$(databricks bundle summary -t demo -p DMVM -o json | python3 -c "import sys,json; print(json.load(sys.stdin)['resources']['jobs']['full_analytics_pipeline_job']['id'])")
databricks jobs run-now -p DMVM --json "{
\"job_id\": $JOB_ID,
\"job_parameters\": {
\"ontology_bundle\": \"fhir_r4\",
\"incremental\": \"false\",
\"table_names\": \"eswanson_demo.metadata_results.patients,eswanson_demo.metadata_results.providers,eswanson_demo.metadata_results.encounters,eswanson_demo.metadata_results.diagnoses,eswanson_demo.metadata_results.medications,eswanson_demo.metadata_results.lab_results\"
}
}"
databricks bundle run eval_compare_job -t demo -p DMVM \
--param run_label="describe-your-change"
Alternatively, run step 2 through the app UI where you select the FHIR ontology
and the specific eval tables -- the app passes ontology_bundle and table_names
as job parameters, overriding the defaults correctly.
What the eval jobs do
| Job | What it does |
|---|
eval_setup_job | Creates synthetic tables + expected-value tables in eval_data schema (run once) |
full_analytics_pipeline_job | Runs the full pipeline (KB, ontology, embeddings, FK, etc.) against the eval tables |
eval_compare_job | Compares pipeline output tables against expected values, writes scores to eval_results |
Key gotcha
The full_analytics_pipeline_job YAML has ontology_bundle defaulting to "healthcare"
(hardcoded, not ${var.ontology_bundle}). When triggered from the app, the app overrides this
with whatever you select. When running from CLI, you must pass --param ontology_bundle=fhir_r4
or it will use the wrong bundle.
Investment Ops E2E Eval (Built-In)
The repo includes a self-contained e2e eval that builds a star schema from public market
data and scores pipeline outputs. Unlike the FHIR eval (which lives in a separate worktree),
this runs entirely from the main repo.
databricks bundle run eval_e2e_job -t demo -p DMVM
databricks bundle run eval_e2e_job -t demo -p DMVM --params="skip_generation=true"
databricks bundle run eval_e2e_job -t demo -p DMVM \
-- --skip_generation true --exclude_tables "dim_index,fct_index_membership"
Key parameters: skip_generation (skip yfinance download), skip_pipeline (skip everything),
skip_idempotency (skip re-run), exclude_tables (comma-separated short table names).
Caution: skip_generation only skips data table creation. Metadata generation (comment/PI/domain)
still runs unless skip_pipeline=true. See docs/E2E_EVAL_AND_INCREMENTALITY.md for full details.
Common Tasks
Adding a new unit test
- Create in
tests/test_*.py (or add to existing test class)
- Use
_make_config() helper or MetadataConfig(skip_yaml_loading=True, ...)
- If testing DDL functions, add to
tests/test_ddl_regenerator.py (runs in separate process)
- Mark known pre-existing bugs as
@pytest.mark.xfail(reason="...") -- do not skip them
Modifying DDL generation or review
Read the full data flow above. The two code paths that generate DDLs:
- Generation:
processing.py -> _create_table_comment_ddl_func() / _create_column_comment_ddl_func()
- Review:
ddl_regenerator.py -> replace_comment_in_ddl() / replace_pii_tags_in_ddl()
Both paths must apply the same sanitization. If you change one, check the other.
Adding a new mode or generator
Follow the Factory pattern in processing.py:
- Add mode to
MetadataConfig validation
- Create
Prompt subclass in prompts.py
- Create
MetadataGenerator subclass in metadata_generator.py
- Register in
PromptFactory and MetadataGeneratorFactory
- Add processing logic in
processing.py's mode dispatch
Version
Version is tracked in 4 places -- keep all in sync:
pyproject.toml (version = "X.Y.Z")
src/dbxmetagen/__init__.py (__version__ = "X.Y.Z")
apps/dbxmetagen-app/app/src/package.json
apps/dbxmetagen-app/app/src/package-lock.json