| name | debug-bag-contents |
| description | Diagnose missing data in DerivaML dataset bag (BDBag) exports — FK traversal issues, missing tables, materialization problems, export timeouts. Use when a downloaded dataset bag is missing expected records, images, or feature values. |
| disable-model-invocation | true |
Debugging Dataset Bag Contents
When a dataset bag export is missing expected data, follow this step-by-step diagnostic process to identify and fix the issue.
Every tool below takes hostname= and catalog_id= arguments explicitly. Substitute your catalog's hostname (e.g., "data.example.org") and catalog ID (e.g., "1") wherever the examples show them.
Recommended First Step: Discover with rag_search
Before diving into specific resources, use rag_search to understand the catalog's schema and data landscape. This provides context that makes subsequent debugging more effective:
rag_search("dataset element types and FK paths", doc_type="catalog-schema")
rag_search("dataset bag export traversal", doc_type="user-guide")
This helps you understand which tables exist, how they relate via foreign keys, and what element types are registered -- all essential context for diagnosing missing bag data. After this initial discovery, use the specific tools listed below for targeted investigation.
Step 1: Check Dataset Members
Dataset members are the explicit records that belong to a dataset. If data is missing from a bag, the first question is whether the right members are in the dataset.
- Tool:
deriva_ml_get_dataset(hostname="data.example.org", catalog_id="1", dataset_rid="<rid>") for the dataset's summary and member counts.
- Tool:
deriva_ml_list_dataset_members(hostname="data.example.org", catalog_id="1", dataset_rid="<rid>") to get the full list of members, grouped by table.
- Verify that the records you expect are listed as members. If they are missing, add them with
deriva_ml_add_dataset_members(hostname="data.example.org", catalog_id="1", dataset_rid="<rid>", members={"Image": ["2-IMG1", ...]}).
Step 2: Check Element Type Registration
Every table that contributes members to a dataset must be registered as a dataset element type. If a table is not registered, its members will be silently excluded from the bag.
- Tool:
deriva_ml_list_dataset_element_types(hostname="data.example.org", catalog_id="1") to see which tables are registered as element types in the catalog.
- Tool:
deriva_ml_add_dataset_element_type(hostname="data.example.org", catalog_id="1", dataset_rid="<rid>", element_table="<table>") to register a table as an element type if it is missing.
- Common tables that should be registered:
Subject, Observation, Image (or other asset tables), and any custom tables whose records appear as dataset members.
Step 3: Preview Bag Export Paths
Before downloading a full bag, preview what the export will contain.
- Tool:
deriva_ml_bag_info(hostname="data.example.org", catalog_id="1", dataset_rid="<rid>", version="1.0.0") returns row counts, asset sizes, and the projected manifest per table. (This subsumes the legacy estimate_bag_size.)
- This preview shows which tables will be included and how many rows each will have, without actually downloading anything.
- Compare the preview counts against your expectations to spot discrepancies early.
Step 4: Understand FK Path Traversal
The bag export algorithm uses foreign key (FK) path traversal to determine which related records to include. Understanding this is critical for diagnosing missing data.
Key rules:
- Starting points are dataset members only from registered element types. Records in tables that are not registered as element types will not serve as starting points for traversal, even if they are dataset members.
- FK traversal follows both directions. From each starting point record, the export follows foreign keys both outward (this table references another) and inward (another table references this one).
- Vocabulary table endpoints are exported separately. Vocabulary/controlled-vocabulary tables encountered during traversal are collected and exported in their own section of the bag, not inline with the data tables.
- Traversal depth is bounded. The export does not follow FK chains indefinitely. It follows direct FK relationships from the member records.
How traversal works in practice:
- If
Subject is a registered element type and you have Subject members, the export will:
- Include those Subject records.
- Follow FKs from Subject to related tables (e.g., Subject_Phenotype).
- Follow FKs pointing back to Subject from other tables (e.g., Image.Subject_RID -> Subject).
- Export vocabulary terms referenced by any included records.
Step 5: Diagnose Common Scenarios
Scenario: Images missing from a Subject-only dataset
Problem: Dataset has Subject members but the exported bag does not include the associated Image records.
Diagnosis:
- Images are in a separate asset table with an FK to Subject.
- The FK traversal should find Images that reference the Subject members.
Fix checklist:
- Verify the Image table has a direct FK to Subject (not through an intermediate table).
- If the FK path goes through an intermediate table (e.g.,
Observation), that intermediate table may need to be registered as an element type, or intermediate records need to be added as members.
- Alternatively, add the Image records directly as dataset members and register the Image table as an element type.
Scenario: Observation data missing
Problem: Observations associated with Subjects are not in the bag.
Diagnosis:
- Check whether Observation has a direct FK to Subject.
- If yes, the FK traversal from Subject members should pick up Observations.
- If not, the path may be indirect and not traversed.
Fix:
- Add Observation records as explicit dataset members and register
Observation as an element type.
- Or ensure there is a direct FK link between the tables.
Scenario: Denormalize raises "Ambiguous path" error
Problem: Calling denormalize_as_dataframe(include_tables=["Image", "Subject"]) raises a DerivaMLException with "Ambiguous path between Image and Subject".
Diagnosis:
- The schema has multiple FK paths between Image and Subject.
- For example:
Image → Subject (direct FK) AND Image → Observation → Subject (multi-hop).
- Denormalize cannot determine which path to use for the join.
Fix:
Scenario: Denormalize returns null for joined columns
Problem: Denormalize returns rows but all columns from a joined table (e.g., Observation) are null.
Diagnosis:
- The joined table may not be FK-reachable from the primary table members.
- The FK column on the primary table may be null for all members.
- The FK path may require intermediate tables not listed in
include_tables.
Fix:
- Check the FK column values: does the primary table actually have non-null FK values pointing to the joined table?
- If the path goes through intermediate tables, include them in
include_tables.
- Verify the joined table has records matching the FK values.
Scenario: Vocabulary terms missing
Problem: Controlled vocabulary values referenced by data records are not in the bag.
Diagnosis:
- Vocabulary terms are exported separately from data tables.
- Check that the vocabulary table is properly configured as a vocabulary (not a regular table).
Fix:
- Vocabulary terms referenced by included records should be automatically exported. If they are missing, verify the FK relationship between the data table and the vocabulary table is intact.
- Use
get_table(hostname="data.example.org", catalog_id="1", schema="<schema>", table="<vocab_table>") to confirm the vocabulary table's structure.
Step 6: Download and Validate the Bag
Use the validation tool to get a detailed comparison of expected vs. actual bag contents.
- Tool: Python API bag inspection with the dataset RID.
- Returns a per-table comparison showing:
- Expected RIDs (based on dataset members and FK traversal).
- Actual RIDs present in the downloaded bag.
- Missing RIDs: records that should be in the bag but are not.
- Extra RIDs: records in the bag that were not expected (usually not a problem but worth investigating).
- Use the missing RIDs to identify exactly which records are being dropped and from which tables.
Step 7: Check FK Paths for All Element Types
For each registered element type, examine the FK paths that the export will follow.
- Tool:
deriva_ml_list_dataset_element_types(hostname, catalog_id) to see element types and the projected FK paths each will follow.
- Look for:
- Missing links: Tables you expect to be reachable but are not connected by FKs.
- Indirect paths: FK chains that go through intermediate tables, which may not be traversed if those intermediates are not included.
- Circular references: These are handled correctly but may cause confusion when reading the path graph.
Step 8: Fix Common Issues
Deep join timeouts
Problem: FK traversal through many intermediate tables causes slow exports or timeouts.
Fix — Option A (preferred): Increase the download timeout.
The default network timeout is (10, 610) seconds — 10s to connect, 610s (~10 min) to read each query response. For large datasets with deep FK joins, increase the read timeout:
dataset.download_dataset_bag(version="1.0.0", timeout=[10, 1800]) # Python API
This gives the server 30 minutes per query instead of 10. The connect timeout (first value) rarely needs changing.
For Hydra-Zen configs, add timeout to DatasetSpecConfig:
DatasetSpecConfig(rid="28EA", version="0.4.0", timeout=[10, 1800])
Fix — Option B: Exclude unnecessary tables from the FK graph.
If you don't need data from certain tables, prune them from the FK traversal:
dataset.download_dataset_bag(version="1.0.0", exclude_tables=["Study", "Protocol"]) # Python API
This prevents the export from traversing into those tables entirely. Use this when the excluded tables' data is not needed in the bag.
For Hydra-Zen configs:
DatasetSpecConfig(rid="28EA", version="0.4.0", exclude_tables=["Study", "Protocol"])
Fix — Option C: Flatten the traversal by adding direct members.
Add records from intermediate tables as direct dataset members rather than relying on deep FK traversal. This replaces the deep join with simpler association-based lookups.
Missing element type registration
Problem: Records from a table are added as members but the table is not a registered element type, so those records are ignored during export.
Fix:
- Tool:
deriva_ml_add_dataset_element_type(hostname="data.example.org", catalog_id="1", dataset_rid="<rid>", element_table="<table>") to register the table.
- Then re-export the bag.
Stale dataset version
Problem: The bag reflects an older version of the dataset, missing recently added members.
Fix:
- Tool:
deriva_ml_increment_dataset_version(hostname="data.example.org", catalog_id="1", dataset_rid="<rid>", description="...") to create a new version that captures current membership.
- Re-export the bag after incrementing.
Records exist but FK not established
Problem: Related records exist in the catalog but are not linked via FK to the member records.
Fix:
- Check the FK columns on the related records. Ensure they contain the correct RID values pointing to the dataset member records.
- Tool:
get_entities(hostname="data.example.org", catalog_id="1", schema="<schema>", table="<table>", filters={...}) to verify FK column values (or query_attribute with a path expression if you only want specific columns / FK joins).
Quick Diagnostic Checklist
Use this checklist when data is missing from a bag:
-
Are the records dataset members?
deriva_ml_list_dataset_members(hostname=..., catalog_id=..., dataset_rid=...) -- check if expected records appear.
- If not:
deriva_ml_add_dataset_members.
-
Is the table a registered element type?
deriva_ml_list_dataset_element_types(hostname, catalog_id).
- If not:
deriva_ml_add_dataset_element_type.
-
Is there a direct FK path?
- Inspect the schema (
get_table, rag_search) for the element type.
- If not: add intermediate records as members, or restructure FKs.
-
Does validation show the discrepancy?
- Python API bag inspection -- look at missing RIDs per table.
-
Is the version current?
deriva_ml_increment_dataset_version if members were recently changed.
-
Is the download timing out?
- First try increasing the timeout:
timeout=[10, 1800] (30 min read timeout).
- If that's not enough, use
exclude_tables to prune expensive FK branches.
- Or add intermediate records as direct members to flatten the joins.
-
Preview before full download.
deriva_ml_bag_info(hostname=..., catalog_id=..., dataset_rid=..., version=...) -- shows row counts, asset sizes, and manifest before downloading.
Reference Resources
deriva://docs/datasets — Full guide to bag export traversal, FK paths, and troubleshooting. Read this for detailed examples and edge cases beyond what this skill covers.
deriva://catalog/{h}/{c}/ml/dataset/{rid}/bag-preview — Preview bag contents before downloading
deriva_ml_list_dataset_element_types(hostname, catalog_id) — Check which element types are registered
Related Tools
| Tool | Purpose |
|---|
deriva_ml_list_dataset_members | List all members of a dataset |
deriva_ml_add_dataset_members | Add records to a dataset |
deriva_ml_delete_dataset_members | Remove records from a dataset |
deriva_ml_add_dataset_element_type | Register a table as dataset element type |
| Python API bag inspection | Validate bag contents against expectations |
deriva_ml_increment_dataset_version | Bump dataset version after changes |
deriva_ml_get_dataset_spec | View dataset specification |
deriva_ml_bag_info | Preview row counts, asset sizes, and manifest before downloading |
Python API dataset.download_dataset_bag(version) | Download the dataset bag (supports exclude_tables and timeout) |
deriva_ml_denormalize_dataset | Schema shape + size estimates (no dataset needed), or flatten dataset for analysis |
query_attribute | Inspect FK column values via filtered queries |
get_table | Check table schema and FK relationships |