| name | dqx-profile-and-generate |
| description | Profile a DataFrame or table and generate DQX quality rule candidates with summary statistics. Use when the user asks to "profile a table", "generate DQX rules from data", "suggest data quality checks", "bootstrap a checks.yml", or "generate DLT expectations". Covers DQProfiler, DQGenerator, DQDltGenerator, the profiler workflow, sampling / filter options, and AI-assisted variants.
|
DQX — Profile and generate rule candidates
Typical one-shot bootstrap for a new table:
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.sdk import WorkspaceClient
ws = WorkspaceClient()
profiler = DQProfiler(ws)
generator = DQGenerator(ws)
df = spark.read.table("catalog.schema.input")
summary_stats, profiles = profiler.profile(df)
checks = generator.generate_dq_rules(profiles)
for c in checks:
print(c)
Profiling is a one-time bootstrap action per dataset. The candidate checks need human review before apply — don't auto-apply the raw output to production data.
Scoping the profile
DQProfiler.profile(df, columns=None, options=None) — columns is a top-level kwarg limiting the profiled columns; the following optional keys are set via the options dict:
sample_fraction — float 0–1 (e.g. 0.1 for 10% sample). Use on large tables.
sample_seed — int; pair with sample_fraction for reproducible runs.
limit — absolute row cap (e.g. 1_000_000).
filter — SQL string applied before profiling ("event_date >= '2026-01-01'").
criticality — default for every generated rule ("error" or "warn", default "error").
summary_stats, profiles = profiler.profile(
df,
columns=["order_id", "total_amount", "country_code"],
options={"sample_fraction": 0.1, "sample_seed": 42, "criticality": "warn"},
)
Generating DLT / Lakeflow expectations
from databricks.labs.dqx.profiler.dlt_generator import DQDltGenerator
dlt_expectations = DQDltGenerator(ws).generate_dlt_rules(profiles, language="python")
AI-assisted rule generation
DQX can generate rules from natural-language requirements via DSPy-backed LLMs — see the companion skills / docs rather than hand-rolling prompts:
No-code / workflow path (DQX installed as a workspace tool)
databricks labs dqx install
databricks labs dqx profile
databricks labs dqx profile --run-config default
databricks labs dqx profile --run-config default \
--patterns "main.product001.*;main.product002" \
--exclude-patterns "*_output;*_quarantine"
The workflow writes the generated candidates + summary stats to the checks_location on the run config (see dqx-storage).
Do / Don't
- Do review the generated checks and tighten
criticality / bounds before rolling to production.
- Do re-run profiling after a schema change or a large distribution shift — not on a schedule.
- Don't profile the output / quarantine table — the CLI auto-excludes
_dq_output / _dq_quarantine suffixes; keep the convention.
- Don't run profiling on the full streaming firehose — use
limit or sample_fraction against the current backfill.
Canonical docs: https://databrickslabs.github.io/dqx/docs/guide/data_profiling.