with one click
data-join
// Join two datasets with automatic strategy selection (joinp vs join vs sqlp)
// Join two datasets with automatic strategy selection (joinp vs join vs sqlp)
Standard workflow order, tool selection matrix, and composition patterns for qsv CSV data wrangling
Respond to all pending review comments on the current PR — fetch comments, apply fixes, verify accuracy, test, commit, and reply. Use when addressing Copilot reviews, GitHub PR reviews, or any batch of review feedback.
Prepare an MCP server and plugin release by bumping versions across all files and updating changelog
Run SQL queries against CSV/TSV/Excel files using Polars SQL engine
Clean a CSV/TSV/Excel file - fix headers, trim whitespace, remove duplicates, validate
Convert between CSV, TSV, Excel, JSONL, Parquet, and other tabular formats
| name | data-join |
| description | Join two datasets with automatic strategy selection (joinp vs join vs sqlp) |
| user-invocable | true |
| argument-hint | <file1> <file2> |
| allowed-tools | ["mcp__qsv__qsv_sniff","mcp__qsv__qsv_count","mcp__qsv__qsv_headers","mcp__qsv__qsv_index","mcp__qsv__qsv_stats","mcp__qsv__qsv_select","mcp__qsv__qsv_sqlp","mcp__qsv__qsv_joinp","mcp__qsv__qsv_command","mcp__qsv__qsv_list_files","mcp__qsv__qsv_search_tools","mcp__qsv__qsv_get_working_dir","mcp__qsv__qsv_set_working_dir"] |
Join two tabular data files on common columns.
Cowork note: If relative paths don't resolve, call
mcp__qsv__qsv_get_working_dirandmcp__qsv__qsv_set_working_dirto sync the working directory.
| Scenario | Best Tool | Why |
|---|---|---|
| Standard equi-join | mcp__qsv__qsv_joinp | Polars engine, fastest |
| Non-equi join (>, <, BETWEEN) | mcp__qsv__qsv_sqlp | SQL supports complex conditions |
| Cross join / cartesian | mcp__qsv__qsv_sqlp | CROSS JOIN syntax |
| Memory-constrained | mcp__qsv__qsv_command with command: "join" | Streaming, lower memory |
| Fuzzy/approximate match | mcp__qsv__qsv_joinp with asof: true | Nearest-match join |
Index both files: Run mcp__qsv__qsv_index on both files for fast random access.
Inspect both files: Run mcp__qsv__qsv_headers on both files to identify column names. Determine which columns to join on.
Profile join columns: Run mcp__qsv__qsv_stats with cardinality: true, stats_jsonl: true on both files. Check the cardinality of join columns to determine optimal table order.
Choose strategy:
joinp: smaller cardinality table should be on the right for best performancemcp__qsv__qsv_sqlpmcp__qsv__qsv_joinp with asof: trueExecute join: Use mcp__qsv__qsv_joinp for standard joins:
joinp
columns1: "id"
input1: "file1.csv"
columns2: "id"
input2: "file2.csv"
# Join type: omit for inner (default), or set one of:
# left: true, full: true, cross: true
Or use mcp__qsv__qsv_sqlp for complex joins:
SELECT a.*, b.col1, b.col2
FROM file1 a
JOIN file2 b ON a.id = b.id AND a.date BETWEEN b.start_date AND b.end_date
For ASOF (nearest-match) joins, use mcp__qsv__qsv_joinp with asof: true:
joinp
columns1: "date"
input1: "events.csv"
columns2: "date"
input2: "reference.csv"
asof: true
strategy: "backward"
allow_exact_matches: true
strategy: "backward" (default) — match to the last right row with key < left keystrategy: "forward" — match to the first right row with key > left keystrategy: "nearest" — match to the numerically closest row (supports tolerance parameter)left_by/right_by parameters to restrict matching within subgroups (e.g., per jurisdiction)allow_exact_matches: true to include equal keys (<=, >=); default is strict inequality (<, >)Clean up result: Use mcp__qsv__qsv_select to remove duplicate join columns or unnecessary columns from the result.
Verify: Run mcp__qsv__qsv_count on the result. Compare with input counts to validate join behavior:
Before executing a join, read .stats.csv for both files and validate:
| Check | Stats Column | Red Flag | Action |
|---|---|---|---|
| Type match | type | Join columns have different types (e.g., Integer vs String) | Cast one column before joining: sqlp with CAST(col AS INTEGER) |
| Null density | nullcount, sparsity | sparsity > 0.3 on join column | Nulls don't match — expect unmatched rows; consider filtering nulls first |
| Value overlap | min, max | Non-overlapping ranges across files | No rows will match — verify correct join column |
| Skew detection | mode, mode_count | One value dominates (mode_count > 50% of rows) | Join will be heavily skewed many-to-one; verify this is expected |
| Uniqueness | uniqueness_ratio | Both files have uniqueness_ratio < 1.0 on join column | Many-to-many join risk — expect row explosion; verify with mcp__qsv__qsv_count after |
| Outlier keys | outliers_percentage | outliers_percentage > 5% on numeric join column | Outlier keys may not match across files; consider trimming first |
| Type | joinp Flag | SQL | Behavior |
|---|---|---|---|
| Inner | (default) | JOIN | Only matching rows |
| Left | --left | LEFT JOIN | All left + matching right |
| Full outer | --full | FULL OUTER JOIN | All rows from both |
| Cross | --cross | CROSS JOIN | Cartesian product |
| Left Anti | --left-anti | NOT IN / NOT EXISTS | Left rows without match |
| Left Semi | --left-semi | EXISTS | Left rows with match (no right cols) |
| ASOF | --asof | (use joinp) | Nearest-key match (temporal/numeric) |
joinp uses the Polars engine and is significantly faster than join for large filesjoinp optimize join executioncolumns1: "col1,col2"columns1: "id", columns2: "customer_id"joinp handles null values in join columns (nulls don't match by default)--try-parsedates — no need to pass it explicitly--left_by and --right_by (e.g., match nearest date per jurisdiction)--tolerance option (nearest strategy only) limits how far the nearest match can be: use duration strings for dates (1d, 30d, 365d) or positive integers for numeric keys--no-sort is set