name	node-catalog-sync
description	Enforce synchronization between Kedro node files and catalog YAML files in the OptimusKG project. Use when editing any Python node file under optimuskg/pipelines/*/nodes/, including modifying run() functions, changing node() inputs/outputs, adding/removing/renaming DataFrame columns, or changing column types. Also use when creating new nodes or deleting existing ones.

Node–Catalog Sync

When a node file is edited, the corresponding catalog YAML files must be updated to stay in sync.

Path Mapping

Node outputs are namespace-prefixed by pipeline.py. Use this table to locate catalog files:

Layer	Node `outputs`	Catalog ID	YAML path	`filepath`
Bronze	`"{src}.{name}"`	`bronze.{src}.{name}`	`conf/base/catalog/bronze/{src}/{name}.yml`	`data/bronze/{src}/{name}.parquet`
Silver nodes	`"nodes.{entity}"`	`silver.nodes.{entity}`	`conf/base/catalog/silver/nodes/{entity}.yml`	`data/silver/nodes/{entity}.parquet`
Silver edges	`"edges.{e1}_{e2}"`	`silver.edges.{e1}_{e2}`	`conf/base/catalog/silver/edges/{e1}_{e2}.yml`	`data/silver/edges/{e1}_{e2}.parquet`
Gold	`"kg.{fmt}"`	`gold.kg.{fmt}`	`conf/base/catalog/gold/{fmt}.yml`	`data/gold/kg/{fmt}/`

Multiple outputs (list) produce one YAML file per output.

Sync Workflow

Follow all 4 steps in order after every node file edit.

Step 1: Identify affected catalog entries

Read the node() call in the edited file to find outputs (string or list).
Determine the namespace from the corresponding pipeline.py (e.g., namespace="bronze").
Construct the full catalog ID: {namespace}.{output}.
Locate the YAML file using the path mapping table above.

Step 2: Update dataset ID and filepath

Only if the outputs value in the node() call changed:

Update the YAML top-level key to the new full catalog ID.
Update filepath following the convention in the path mapping table.
Rename the YAML file to match the new output name.

Step 3: Rerun node and sync catalog

Rerun the node: uv run kedro run --nodes={node_name}
Sync schema and checksum: uv run cli sync-catalog --dataset {catalog_id}
- This reads the parquet file on disk and updates both load_args.schema and metadata.checksum in the YAML automatically.
- Use --dry-run to preview changes first.
- Use --validate to check without writing (useful in CI).
Never delete the checksum property.

Step 4: Cascade downstream

List all downstream nodes using DryRunner (no execution, just shows the DAG):

uv run kedro run --from-nodes={node_name} --runner=optimuskg.runners.DryRunner

Rerun the edited node and all its downstream dependents:
```
uv run kedro run --from-nodes={node_name}
```
Sync all affected catalog entries:
```
uv run cli sync-catalog
```

Catalog YAML Structure

{catalog_id}:
  type: optimuskg.datasets.polars.ParquetDataset
  filepath: data/{layer}/{path}.parquet
  load_args:
    schema:
      column_name: pl.Type
      struct_column:
        nested_field: pl.Type
  metadata:
    checksum: {blake2b_hex_digest}
    kedro-viz:
      layer: {layer}

name	node-catalog-sync
description	Enforce synchronization between Kedro node files and catalog YAML files in the OptimusKG project. Use when editing any Python node file under optimuskg/pipelines/*/nodes/, including modifying run() functions, changing node() inputs/outputs, adding/removing/renaming DataFrame columns, or changing column types. Also use when creating new nodes or deleting existing ones.

Node–Catalog Sync

When a node file is edited, the corresponding catalog YAML files must be updated to stay in sync.

Path Mapping

Node outputs are namespace-prefixed by pipeline.py. Use this table to locate catalog files:

Layer	Node `outputs`	Catalog ID	YAML path	`filepath`
Bronze	`"{src}.{name}"`	`bronze.{src}.{name}`	`conf/base/catalog/bronze/{src}/{name}.yml`	`data/bronze/{src}/{name}.parquet`
Silver nodes	`"nodes.{entity}"`	`silver.nodes.{entity}`	`conf/base/catalog/silver/nodes/{entity}.yml`	`data/silver/nodes/{entity}.parquet`
Silver edges	`"edges.{e1}_{e2}"`	`silver.edges.{e1}_{e2}`	`conf/base/catalog/silver/edges/{e1}_{e2}.yml`	`data/silver/edges/{e1}_{e2}.parquet`
Gold	`"kg.{fmt}"`	`gold.kg.{fmt}`	`conf/base/catalog/gold/{fmt}.yml`	`data/gold/kg/{fmt}/`

Multiple outputs (list) produce one YAML file per output.

Sync Workflow

Follow all 4 steps in order after every node file edit.

Step 1: Identify affected catalog entries

Read the node() call in the edited file to find outputs (string or list).
Determine the namespace from the corresponding pipeline.py (e.g., namespace="bronze").
Construct the full catalog ID: {namespace}.{output}.
Locate the YAML file using the path mapping table above.

Step 2: Update dataset ID and filepath

Only if the outputs value in the node() call changed:

Update the YAML top-level key to the new full catalog ID.
Update filepath following the convention in the path mapping table.
Rename the YAML file to match the new output name.

Step 3: Rerun node and sync catalog

Rerun the node: uv run kedro run --nodes={node_name}
Sync schema and checksum: uv run cli sync-catalog --dataset {catalog_id}
- This reads the parquet file on disk and updates both load_args.schema and metadata.checksum in the YAML automatically.
- Use --dry-run to preview changes first.
- Use --validate to check without writing (useful in CI).
Never delete the checksum property.

Step 4: Cascade downstream

List all downstream nodes using DryRunner (no execution, just shows the DAG):

uv run kedro run --from-nodes={node_name} --runner=optimuskg.runners.DryRunner

Rerun the edited node and all its downstream dependents:
```
uv run kedro run --from-nodes={node_name}
```
Sync all affected catalog entries:
```
uv run cli sync-catalog
```

Catalog YAML Structure

{catalog_id}:
  type: optimuskg.datasets.polars.ParquetDataset
  filepath: data/{layer}/{path}.parquet
  load_args:
    schema:
      column_name: pl.Type
      struct_column:
        nested_field: pl.Type
  metadata:
    checksum: {blake2b_hex_digest}
    kedro-viz:
      layer: {layer}

node-catalog-sync

Node–Catalog Sync

Path Mapping

Sync Workflow

Step 1: Identify affected catalog entries

Step 2: Update dataset ID and filepath

Step 3: Rerun node and sync catalog

Step 4: Cascade downstream

Catalog YAML Structure

Node–Catalog Sync

Path Mapping

Sync Workflow

Step 1: Identify affected catalog entries

Step 2: Update dataset ID and filepath

Step 3: Rerun node and sync catalog

Step 4: Cascade downstream

Catalog YAML Structure