Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

dqx-define-checks

Name: Dqx Define Checks
Author: databrickslabs

// Create DQX quality rules (checks) for a PySpark DataFrame or Delta table. Use when the user asks to "add a DQX check", "define a data quality rule", "validate that column X is not null / unique / in a set", or wants checks expressed in YAML/JSON for storage. Covers DQRowRule, DQDatasetRule, DQForEachColRule, built-in check_funcs, filters, user_metadata, custom SQL/Python checks, and the declarative metadata form.

Exécuter dans Manus

$ git log --oneline --stat

stars:419

forks:119

updated:4 mai 2026 à 14:39

SKILL.md

readonly

related-skills.json

même dépôt

dqx-apply-checks.md

from "databrickslabs/dqx"

Validate a PySpark DataFrame or Delta table against a set of DQX quality rules using DQEngine. Use when the user asks to "run data quality checks", "apply DQX rules to a DataFrame/table", "split valid and invalid rows", "quarantine bad records", or "integrate DQX into a streaming pipeline". Covers apply_checks, apply_checks_and_split, the by_metadata variants, and the shape of the result columns.

2026-05-04419

dqx-end-to-end.md

from "databrickslabs/dqx"

Run DQX validation end-to-end — read an input table or path, apply checks, and write valid and quarantined rows to output locations — in a single call. Use when the user asks for "apply and save", "quality-check a table and split the output", "DQX on a whole table", "save valid and invalid rows", or wants to drop DQX into a Lakeflow / workflow that runs on a table or path. Covers apply_checks_and_save_in_table, the by_metadata variant, InputConfig / OutputConfig, and incremental streaming mode.

2026-05-04419

dqx-profile-and-generate.md

from "databrickslabs/dqx"

Profile a DataFrame or table and generate DQX quality rule candidates with summary statistics. Use when the user asks to "profile a table", "generate DQX rules from data", "suggest data quality checks", "bootstrap a checks.yml", or "generate DLT expectations". Covers DQProfiler, DQGenerator, DQDltGenerator, the profiler workflow, sampling / filter options, and AI-assisted variants.

2026-05-04419

dqx-storage.md

from "databrickslabs/dqx"

Load and save DQX checks (quality rules) to a file, workspace path, Unity Catalog volume, Delta table, Lakebase, or the DQX installation folder. Use when the user asks to "load DQX checks from YAML", "save checks to a Delta table", "read checks from a volume", "share checks across notebooks", or "use the DQX workspace install's default checks location". Covers every *ChecksStorageConfig and the matching load/save calls.

2026-05-04419

package.json

"author": "databrickslabs"

"repository": "databrickslabs/dqx"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Développeurs de logicielsProfessions informatiques et mathématiques15-1252L4

name

dqx-define-checks

description

Create DQX quality rules (checks) for a PySpark DataFrame or Delta table. Use when the user asks to "add a DQX check", "define a data quality rule", "validate that column X is not null / unique / in a set", or wants checks expressed in YAML/JSON for storage. Covers DQRowRule, DQDatasetRule, DQForEachColRule, built-in check_funcs, filters, user_metadata, custom SQL/Python checks, and the declarative metadata form.

DQX — Define quality checks

DQX rules come in two interchangeable forms. Pick based on where the checks will live.

DQX classes (DQRowRule, DQDatasetRule, DQForEachColRule) — use when checks are authored in code next to the pipeline. Static typing + IDE autocomplete.
Dict / YAML / JSON metadata — use when checks are loaded from a file, workspace path, volume, or Delta table. Required for the apply_checks_by_metadata* path.

Every check has a criticality of error (failing row quarantined) or warn (failing row passes but flagged). Default is error.

Minimal — class form

from databricks.labs.dqx import check_funcs
from databricks.labs.dqx.rule import DQRowRule, DQDatasetRule, DQForEachColRule

checks = [
    # row-level: one column
    DQRowRule(
        name="col3_is_not_null",
        criticality="warn",
        check_func=check_funcs.is_not_null_and_not_empty,
        column="col3",
    ),

    # same check across many columns
    *DQForEachColRule(
        columns=["col1", "col2"],
        criticality="error",
        check_func=check_funcs.is_not_null,
    ).get_rules(),

    # dataset-level: uniqueness across a composite key
    DQDatasetRule(
        criticality="error",
        check_func=check_funcs.is_unique,
        columns=["order_id", "line_item_id"],
    ),
]

Minimal — metadata form (YAML)

Load into Python via yaml.safe_load(...), then pass the resulting list[dict] to any apply_checks_by_metadata* call, or save through a storage config (see dqx-storage).

- name: col3_is_not_null
  criticality: warn
  check:
    function: is_not_null_and_not_empty
    arguments:
      column: col3

- criticality: error
  check:
    function: is_not_null
    for_each_column: [col1, col2]

- criticality: error
  check:
    function: is_unique
    arguments:
      columns: [order_id, line_item_id]

Common variants

Filtered check — evaluate only when a predicate holds: add filter="col1 < 3" (class) or filter: "col1 < 3" (YAML).
Positional args — check_func_args=[[1, 2]]; keyword args — check_func_kwargs={"allowed": [1, 2]}.
Struct / map / array element — use F.try_element_at(...) or dotted path (col7.field1) as the column value.
User metadata — annotate the rule with a user_metadata dict (e.g. {"check_type": "completeness"}) that flows into the result struct.
Custom check — pass any function that returns a PySpark Column as check_func. For inline SQL, use the fallback section below — only after confirming no built-in fits.
Aggregate dataset-level — is_aggr_not_greater_than, is_aggr_not_less_than, is_aggr_equal, is_aggr_not_equal; supply aggr_type (count, avg, stddev, percentile, count_distinct…), optional group_by, and limit.
Uniqueness dataset-level — is_unique, with columns, nulls_distinct (bool), and optional row_filter. Not an aggregate check — no aggr_type.

Full reference: https://databrickslabs.github.io/dqx/docs/reference/quality_checks.

Fallback: custom SQL

Search check_funcs first — the built-ins cover null/empty, range, set membership, regex, referential, aggregate, uniqueness, schema, freshness, comparison, and outlier cases with typed error messages and tested edge handling. Drop down to SQL only when no built-in fits.

sql_expression — row-level SQL boolean expression. Use when one row's validity depends on its own columns.
sql_query — dataset-level SQL query against {{ input_view }}. Use for cross-row aggregates, joins to reference DataFrames, or anything needing GROUP BY. Queries are validated by is_sql_query_safe() — read-only SELECT, no DDL/DML.

# row-level: SQL expression evaluated per row
- name: amount_positive_or_refunded
  criticality: error
  check:
    function: sql_expression
    arguments:
      expression: amount > 0 OR refunded = true
      msg: amount must be positive unless refunded

# dataset-level: SQL query, joined back to rows via merge_columns
- name: order_total_matches_lines
  criticality: error
  check:
    function: sql_query
    arguments:
      query: |
        SELECT order_id,
               SUM(line_amount) <> order_total AS condition
        FROM {{ input_view }}
        GROUP BY order_id, order_total
      merge_columns: [order_id]    # row-level: joins back per order_id
      condition_column: condition  # column in query output; true = fail
      # omit merge_columns for dataset-level (one verdict applies to every row)

For the equivalent class form, use DQRowRule(check_func=check_funcs.sql_expression, check_func_kwargs={"expression": "..."}) or DQDatasetRule(check_func=check_funcs.sql_query, check_func_kwargs={...}).

Converting between forms

from databricks.labs.dqx.checks_serializer import serialize_checks, deserialize_checks
checks_metadata = serialize_checks(checks)              # classes → list[dict]
checks_classes  = deserialize_checks(checks_metadata)   # list[dict] → classes

Validation before apply

Catch syntax errors without running the pipeline:

from databricks.labs.dqx.engine import DQEngine
status = DQEngine.validate_checks(checks)   # raises / returns ValidationStatus

Do / Don't

Do give each rule a stable, snake_case name — it ends up in result columns and dashboards.
Do put shared rules in a YAML/JSON/Delta file and load them (see dqx-storage) — classes are fine for a handful, metadata scales.
Don't use a regex check for null / empty / range / referential / aggregate cases — DQX has a built-in for each; see check_funcs.
Don't reach for sql_expression / sql_query when a built-in covers the case — they bypass typed error messages and security guards. Search check_funcs first.
Don't put side effects in a custom check_func — it must return a Column expression only.

Canonical docs: https://databrickslabs.github.io/dqx/docs/guide/quality_checks_definition.

dqx-define-checks

Plus depuis ce dépôt

Plus depuis ce dépôt

DQX — Define quality checks

Minimal — class form

Minimal — metadata form (YAML)

Common variants

Fallback: custom SQL

Converting between forms

Validation before apply

Do / Don't

DQX — Define quality checks

Minimal — class form

Minimal — metadata form (YAML)

Common variants

Fallback: custom SQL

Converting between forms

Validation before apply

Do / Don't