mit einem Klick
debug-clickhouse-queries
// Investigate query evaluation failures in the Knowledge Graph synthetic data pipeline. Use when queries fail or return unexpected results after running the evaluate binary.
// Investigate query evaluation failures in the Knowledge Graph synthetic data pipeline. Use when queries fail or return unexpected results after running the evaluate binary.
Query the GitLab Knowledge Graph (Orbit) via `glab orbit remote` CLI subcommands or run a local copy with `glab orbit local`. Use for code-structure questions (who calls this function, where is this symbol defined), cross-project dependency and blast-radius analysis, merge-request and contributor queries, and any question answerable by traversing GitLab's unified entity graph (projects, users, MRs, issues, pipelines, files, definitions, vulnerabilities).
Audit and update documentation after code changes. Use when architecture, APIs, or behavior changed and docs may have drifted.
Profile GKG queries against ClickHouse with the query-profiler CLI. For optimizing query performance, comparing query plans, investigating slow queries, or checking ClickHouse resource usage.
AST-based code search and rewrite via tree-sitter patterns. Use instead of Grep/Edit for structural matching, batch rewrites, or context-aware queries (e.g. "unwrap inside impl blocks").
Investigate the history, usage, and liveness of code using search and git blame/log. Use when determining if code is dead, understanding why something exists, finding all callers before refactoring, or deciding whether something is safe to remove. Also useful for answering "who added this and why" or "is anything still using this".
GitLab Pajamas Design System expert for building UIs with Pajamas components and patterns. Use when: (1) implementing UI that should follow GitLab's Pajamas design system, (2) selecting or configuring Pajamas/GlComponent components (GlButton, GlAlert, GlModal, etc.), (3) translating Figma designs into Pajamas-compliant code, (4) questions about Pajamas component usage, variants, categories, or accessibility, (5) building GitLab-style interfaces, or (6) the user mentions "Pajamas", "GitLab UI", "Gl components", or "design system" in a GitLab context. Works hand-in-hand with the implement-design skill and Figma MCP tools.
| name | debug-clickhouse-queries |
| description | Investigate query evaluation failures in the Knowledge Graph synthetic data pipeline. Use when queries fail or return unexpected results after running the evaluate binary. |
When the evaluate binary reports failures or empty results, the goal is figuring out whether the problem is in the generated data, the query compilation, or something else.
Connect to ClickHouse and check what's really there:
clickhouse client --query "YOUR_QUERY"
Some starting points:
-- What columns exist?
DESCRIBE TABLE gl_<entity>
-- What values does a field actually contain?
SELECT <field>, count(*) FROM gl_<entity> GROUP BY <field>
-- What edge types were generated?
SELECT relationship_kind, count(*) FROM gl_edge
GROUP BY relationship_kind ORDER BY count(*) DESC
-- Sample some traversal paths
SELECT traversal_path FROM gl_<entity> LIMIT 10
The orbit query command lets you see what SQL the query engine produces without running the full pipeline. The query input format is defined in config/schemas/graph_query.schema.json.
First, sample some traversal paths from your data:
SELECT DISTINCT traversal_path FROM gl_group LIMIT 5
Then pass them to the query command (org ID is parsed from the first segment):
# From the SDLC queries file
cargo run -p orbit -- query -t "1/2/3/" fixtures/queries/sdlc_queries.json
# Multiple traversal paths
cargo run -p orbit -- query -t "1/2/" -t "1/3/" fixtures/queries/sdlc_queries.json
# Single query inline
cargo run -p orbit -- query -t "1/" --json '{"test": {"nodes": [{"type": "Pipeline"}]}}'
# JSON output for scripting
cargo run -p orbit -- query -t "1/" --format json fixtures/queries/sdlc_queries.json
This shows each query's input JSON, generated SQL, and parameters. You can then run the SQL directly in ClickHouse to see what's happening and use it to guide your investigations.
Empty results can come from three sources:
The evaluation report shows sampling metadata for each empty result:
path-scoped (N entities in 'path/') - IDs sampled from within the security contextglobal (N entities) - Fell back to global sampling (path had no matching entities)no sampling needed - Query has no node_ids parametersIf you see global sampling with empty results, the sampled IDs likely don't exist in the security context path. If you see path-scoped with empty results, the data exists but query predicates filter it out.
To tell them apart, check if matching data exists globally:
-- Does matching data exist anywhere?
SELECT count(*) FROM gl_project
WHERE visibility_level = 'public' AND star_count >= 100
-- How many entities are in the sampled path?
SELECT count(*) FROM gl_project
WHERE startsWith(traversal_path, '3/514/522/523/524/')
If global count is high but sampled path has few entities, it's a sampling issue, not a data bug.
The evaluate report shows the SQL and parameters for each query. Take a failing query and run it manually, then start removing predicates to narrow down what's breaking.
If UNKNOWN_COLUMN: Check if the column name in the query matches what's in the table. Could be a mismatch between the query engine and data generator.
If TYPE_MISMATCH: Check if the query is comparing incompatible types (e.g., timestamp column vs string date literal).
If 0 rows but data exists: Compare what the query filters for against what values actually exist. The enum values might not match, or an edge type might not be generated.
If traversal_path format error: Look at some actual paths in the data. Malformed paths (like double slashes) would fail validation.
If you see values like val188ebe1e3a382996 instead of proper enums, the fake data generator is falling back to random strings instead of using ontology values.
Quick diagnostic:
SELECT state, count(*) FROM gl_user GROUP BY state ORDER BY count(*) DESC
SELECT visibility_level, count(*) FROM gl_project GROUP BY visibility_level
SELECT user_type, count(*) FROM gl_user GROUP BY user_type
The ontology defines enum values in two ways:
type: enum with values: - explicitly an enumtype: string with values: - semantically an enum but typed as stringBoth should use the ontology's values: mapping. If not, check fake_data.rs to ensure it checks field.enum_values before falling back to pattern-based generation.
Traversal paths form a trie structure for access control:
1/2/ can access itself and descendants (1/2/3/, 1/2/3/4/, etc.)1/2/3/4/ can only access itself and descendants, NOT ancestors like 1/2/The startsWith(entity.traversal_path, security_context_path) filter enforces this: entities must be at or below the security context level.
Root entities (User, Group) have shallow paths. Nested entities (Project, MergeRequest, etc.) have deeper paths like "1/2/3/4/".
If queries return empty, check if the sampled path is too deep for the entities being queried:
-- Projects have nested paths
SELECT DISTINCT traversal_path FROM gl_project LIMIT 5
-- Returns: 1/2/3/, 1/2/4/, etc.
The synth config (crates/xtask/simulator.yaml) directly affects what data exists for queries to find.
Check which edge variants are configured:
-- What edges were actually generated?
SELECT relationship_kind, source_kind, target_kind, count(*)
FROM gl_edge
GROUP BY relationship_kind, source_kind, target_kind
ORDER BY count(*) DESC
If a query expects MergeRequest -> Pipeline edges but only User -> Pipeline exists, the query will return empty. Compare against crates/xtask/simulator.yaml associations section.
Associations can iterate per-source or per-target:
# Per-target (default): For each User, link 1 MR
AUTHORED:
"User -> MergeRequest": 1
# Per-source: For each MR, maybe link a User
MERGED:
"User -> MergeRequest":
ratio: 0.3
per: target
Why this matters: With 1000 Users and 100k MRs:
per: target with ratio 1 → 1000 edges (1 per User)per: source with ratio 0.3 → 30k edges (30% of MRs)If queries return empty for User-related edges, check:
Queries specify edge direction: {"type": "MERGED", "from": "merger", "to": "mr"}
This translates to: merger.id = edge.source_id AND edge.target_id = mr.id
All person-action edges use person-is-source convention (User is always source_id). Queries starting from the entity must use "direction": "incoming" or swap from/to.
Association edges must be queryable given the security filter rules. The generator uses edge_is_queryable():
This matches the query engine's behavior (crates/query-engine/compiler/src/security.rs):
When path-scoped sampling returns no results, the sampler falls back to org-scoped sampling using random_ids_in_org(). This ensures sampled entities are at least in the correct organization.
If you still see mismatches, check that the sampled entity exists in the org:
-- Check if sampled user is in the right org
SELECT id, traversal_path FROM gl_user WHERE id = <sampled_user_id>
-- Path should start with the security context's org_id
Column name definitions:
crates/query-engine/compiler/src/security.rs - security filter column namecrates/xtask/src/synth/arrow_schema.rs - data generation schemaEnum value sources:
config/ontology/nodes/**/*.yaml - ontology definitionscrates/xtask/src/synth/generator/fake_data.rs - fake value generation logicEdge configuration:
crates/xtask/simulator.yaml - relationships and associationsconfig/ontology/edges/*.yaml - edge type definitions (source/target kinds)Traversal path construction:
crates/xtask/src/synth/generator/traversal.rscrates/xtask/src/synth/generator/mod.rsAssociation generation:
crates/xtask/src/synth/config.rs - AssociationConfig, IterationDirectioncrates/xtask/src/synth/generator/mod.rs - generate_association_edges()config/ontology/edges/<type>.yaml define the right direction?crates/xtask/simulator.yaml associations?per: source vs per: target match the cardinality?cargo xtask synth generate
cargo xtask synth load
cargo xtask synth evaluate