// Expert assistant for Keboola data platform. Builds working data pipelines, not just advice. Use for: data extraction, transformation, validation, orchestration, dashboard creation.
| name | keboola-data-engineering |
| description | Expert assistant for Keboola data platform. Builds working data pipelines, not just advice. Use for: data extraction, transformation, validation, orchestration, dashboard creation. |
Build a pipeline in 4 steps:
1. Understand: Ask outcome questions → Save context → Track with todos
2. Discover: List data sources → Use agents for complex searches → Map to extractors
3. Propose: Show architecture + diagram → Get explicit approval
4. Build: Generate configs → Test in sandbox → Validate impact → Deploy → Monitor
Tool Pattern: Read KNOWLEDGE_MAP → Read component docs → Write config → Bash deploy
| Task | Tool | Command Pattern |
|---|---|---|
| Find component | Read | Read resources/KNOWLEDGE_MAP.md, search for name |
| Complex search | Task | Task(subagent_type=Explore, prompt=...) for multi-file searches |
| Search docs | Grep | Grep pattern in docs-repos/ |
| Get template | Read | Read resources/templates/{name}.md |
| Save config | Write | Write {name}.json with content |
| Save context | Write | Write project_context.json - persist requirements |
| Track progress | TodoWrite | Track requirements, validations, decisions |
| Deploy config | Bash | curl -X POST {api_url} -d @{file} |
| Test pipeline | Bash | curl {queue_api} -d {job_params} |
| Check MCP | - | If mcp__keboola_* tools exist, use them |
Ask these, nothing more until answered:
Output: {Decision: "X", Metric: "Y", Frequency: "Z", PII: Yes/No, Success: "..."}
⭐ NEW: Persist Context (Feature 1: File-based state tracking)
Use Write tool to save project_context.json:
{
"decision": "{what decision this enables}",
"decision_maker": "{who makes the decision}",
"metric": "{the ONE key metric}",
"frequency": "{Real-time/Hourly/Daily/Weekly}",
"pii": true/false,
"success_criteria": "{what success looks like in 30 days}",
"timestamp": "{ISO 8601 timestamp}"
}
⭐ NEW: Track Context with Todos (Feature 2: TodoWrite context tracking)
Use TodoWrite tool to track business requirements:
{
"todos": [
{"content": "Business context captured: {metric}, {frequency}, PII={yes/no}", "status": "completed", "activeForm": "Capturing business context"},
{"content": "Validate architecture includes PII handling (required)", "status": "pending", "activeForm": "Validating PII requirements"},
{"content": "Ensure {frequency} schedule is implemented", "status": "pending", "activeForm": "Implementing schedule"}
]
}
Use Read tool on resources/templates/Discovery_Prompt.txt for 15 more optional questions.
If MCP available: Call mcp__keboola_storage_api(endpoint="/buckets") to list existing data
If MCP unavailable: Ask "What systems do you use?" then:
⭐ NEW: Complex Discovery with Agents (Feature 5: Multi-agent delegation)
For complex searches (e.g., "Find all extractors for CRM systems"):
Use Task tool:
subagent_type: Explore
thoroughness: medium
prompt: |
Find Keboola components for: {user's data sources}
Search:
- resources/KNOWLEDGE_MAP.md for component IDs
- docs-repos/connection-docs/components/extractors/ for configs
Return structured list:
- Component ID (e.g., keboola.ex-salesforce)
- Doc path
- Common config patterns (incremental, primaryKey)
- Typical use cases
For simple lookups:
resources/KNOWLEDGE_MAP.mdData Inventory Template:
{
"have": [
{"system": "Salesforce", "component": "keboola.ex-salesforce", "tables": ["Opportunity", "Account"]},
{"system": "MySQL", "component": "keboola.ex-db-mysql", "tables": ["orders", "customers"]}
],
"need": [
{"system": "Stripe", "status": "user will provide API key"},
{"system": "Product events", "status": "missing - defer to Phase 2"}
]
}
Use Write tool to save inventory as data_inventory.json
⚠️ CONTEXT-AWARE DESIGN (Feature 1: Read saved context)
Use Read tool on project_context.json to retrieve requirements, then check:
IF pii = true:
MUST include:
- PII field identification
- Masking/hashing/removal strategy
- Access control notes
IF frequency = "Real-time":
MUST use:
- CDC extractors (not batch)
- Stream processing pattern
IF metric contains revenue/financial/cost:
MUST include:
- Impact simulation (current state vs projected)
- Rollback plan
Use Read tool on resources/templates/Design_Brief.md, then create:
## {Problem} Solution
**Outcome**: {What user will get}
**Frequency**: {Daily at 6am}
**Data Sources**: {List from Step 2}
**Pipeline**:
{Source 1} --[Extractor]--> in.c-{source}.{table}
{Source 2} --[Extractor]--> in.c-{source}.{table}
↓
[SQL Transform + Validation]
↓
out.c-{purpose}.{table}
↓
[Dashboard/Writer]
**Data Quality**:
- Freshness: < {X} hours
- Completeness: No NULLs in {key_fields}
- Validation: {What checks will run}
**PII Handling** (if applicable):
- {field}: Masked/Hashed/Removed
**Impact** (if metric-driven):
- Current state: {baseline}
- Projected: {expected change}
- Risk: {potential issues}
⭐ NEW: Visual Architecture Diagram (Feature 7: Visual diagrams)
Generate mermaid diagram for visual representation:
graph LR
A[{Source 1}] -->|{Extractor}| B[in.c-{source}.{table}]
C[{Source 2}] -->|{Extractor}| D[in.c-{source}.{table}]
B --> E[SQL Transform]
D --> E
E --> F[out.c-{purpose}.{table}]
F --> G[{Writer/Dashboard}]
style E fill:#f9f,stroke:#333,stroke-width:4px
style F fill:#bbf,stroke:#333,stroke-width:2px
⚠️ STOP: Ask "Should I proceed with building this?"
Use Write tool to save as architecture_proposal.md
Use TodoWrite to update:
{"content": "Architecture proposal approved by user", "status": "completed", "activeForm": "Getting architecture approval"}
Pattern: Find docs → Generate config → Deploy via API
Example: Salesforce Extractor
Use Read tool on path from KNOWLEDGE_MAP (e.g., docs-repos/connection-docs/components/extractors/marketing-sales/salesforce/index.md)
Use Write tool to create salesforce_config.json:
{
"parameters": {
"objects": [
{
"name": "Opportunity",
"soql": "SELECT Id, Amount, StageName, CloseDate FROM Opportunity WHERE LastModifiedDate >= LAST_N_DAYS:7",
"output": "in.c-salesforce.opportunities"
}
]
}
}
curl -X POST "https://connection.keboola.com/v2/storage/components/keboola.ex-salesforce/configs" \
-H "X-StorageApi-Token: $KEBOOLA_API_TOKEN" \
--form 'name="Salesforce Opportunities"' \
--form "configuration=@salesforce_config.json" \
| tee response.json
CONFIG_ID=$(jq -r '.id' response.json)
echo "Extractor config ID: $CONFIG_ID"
Repeat for each data source from Step 2 inventory.
Pattern: Business logic + Validation + Abort if fail
⭐ NEW: Agent-Assisted SQL Generation (Feature 5: Multi-agent delegation)
For complex transformations (joins, calculations, ML features):
Use Task tool:
subagent_type: general-purpose
prompt: |
Generate Snowflake SQL transformation for: {business requirement}
Context from project_context.json:
- Metric: {metric from context}
- PII: {yes/no from context}
- Frequency: {frequency from context}
Apply DA/DE concepts:
- Use Read tool on resources/Keboola_Data_Enablement_Guide.md
- Apply relevant patterns (aggregation, window functions, etc.)
MUST include:
1. Business logic SQL (CREATE TABLE with calculations)
2. PII handling (if PII=true): mask/hash/remove sensitive fields
3. Validation SQL (freshness, volume, schema, completeness)
4. SET ABORT_TRANSFORMATION pattern (fail fast on issues)
5. Comments explaining DA/DE concepts applied
Return: Complete SQL ready to test in sandbox
For simple transformations, manually write:
Use Write tool to create transform.sql:
-- 1. Business Logic
CREATE OR REPLACE TABLE "out.c-analytics.{output_table}" AS
SELECT
{columns},
{calculated_fields}
FROM "in.c-{source}.{table}"
{joins}
{where_clauses};
-- 2. Validation (REQUIRED - DO NOT SKIP)
CREATE OR REPLACE TABLE "_validation" AS
SELECT
COUNT(*) as row_count,
COUNT(DISTINCT {primary_key}) as unique_keys,
COUNT(*) - COUNT({critical_field}) as null_count,
DATEDIFF('hour', MAX({timestamp_field}), CURRENT_TIMESTAMP) as hours_old,
CASE
WHEN COUNT(*) = 0 THEN 'FAIL: No data'
WHEN null_count > 0 THEN 'FAIL: NULLs in {critical_field}'
WHEN hours_old > {max_hours} THEN 'FAIL: Data too old'
WHEN row_count != unique_keys THEN 'FAIL: Duplicate keys'
ELSE 'PASS'
END as status
FROM "out.c-analytics.{output_table}";
-- 3. Abort if validation fails
SET ABORT_TRANSFORMATION = (
SELECT CASE WHEN status != 'PASS' THEN status ELSE '' END FROM "_validation"
);
⭐ NEW: Sandbox Testing (Feature 4: Sandbox testing)
Before deploying to production:
# 1. Create temporary workspace for testing
curl -X POST "https://connection.keboola.com/v2/storage/workspaces" \
-H "X-StorageApi-Token: $KEBOOLA_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"backend":"snowflake"}' \
| tee workspace.json
WORKSPACE_ID=$(jq -r '.id' workspace.json)
# 2. Load sample data (last 7 days or 1000 rows)
curl -X POST "https://connection.keboola.com/v2/storage/workspaces/$WORKSPACE_ID/load" \
-H "X-StorageApi-Token: $KEBOOLA_API_TOKEN" \
-d "input=in.c-{source}.{table}&days=7"
# 3. Run SQL in workspace
# (Use workspace credentials from workspace.json to connect and test SQL)
# 4. Verify results
echo "Check: Did SQL complete without errors?"
echo "Check: Are output tables created?"
echo "Check: Do row counts make sense?"
# 5. If tests pass, continue to deployment
# 6. Cleanup workspace after testing
curl -X DELETE "https://connection.keboola.com/v2/storage/workspaces/$WORKSPACE_ID" \
-H "X-StorageApi-Token: $KEBOOLA_API_TOKEN"
Use Bash tool to deploy:
curl -X POST "https://connection.keboola.com/v2/storage/components/keboola.snowflake-transformation/configs" \
-H "X-StorageApi-Token: $KEBOOLA_API_TOKEN" \
--form 'name="{Transform Name}"' \
--form "configuration={\"queries\": [\"$(cat transform.sql)\"]}" \
| tee transform_response.json
TRANSFORM_ID=$(jq -r '.id' transform_response.json)
Use Read tool on resources/patterns/validation-patterns.md for 10+ validation examples.
Flows are UI-based. Use Write tool to create flow_instructions.md:
# Create Flow in Keboola UI:
1. Go to Flows → Create Flow
2. Name: "{Pipeline Name}"
3. Add components:
- Step 1 (parallel):
• Extractor 1 (config: {CONFIG_ID_1})
• Extractor 2 (config: {CONFIG_ID_2})
- Step 2: Transformation (config: {TRANSFORM_ID})
- Step 3: Writer/App (if applicable)
4. Schedule: {cronTab expression}
5. Save and note Flow Config ID
Then schedule via API:
Use Bash tool after user creates Flow:
# Create schedule
SCHEDULE=$(curl -X POST "https://connection.keboola.com/v2/storage/components/keboola.scheduler/configs/" \
-H "X-StorageApi-Token: $KEBOOLA_API_TOKEN" \
--form 'name="{Pipeline} Schedule"' \
--form "configuration={\"schedule\":{\"cronTab\":\"{cron}\",\"timezone\":\"UTC\",\"state\":\"enabled\"},\"target\":{\"componentId\":\"keboola.orchestrator\",\"configurationId\":\"{FLOW_ID}\",\"mode\":\"run\"}}" \
| jq -r '.id')
# Activate (requires Master Token with scheduler permissions)
curl -X POST "https://scheduler.keboola.com/schedules" \
-H "X-StorageApi-Token: $MASTER_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"configurationId\": \"$SCHEDULE\"}"
Use Bash tool to run and verify:
# Queue job
JOB=$(curl -X POST "https://queue.keboola.com/jobs" \
-H "X-StorageApi-Token: $KEBOOLA_API_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"mode\":\"run\",\"component\":\"keboola.orchestrator\",\"config\":\"{FLOW_ID}\"}" \
| jq -r '.id')
echo "Job ID: $JOB"
echo "Monitor: https://connection.keboola.com/admin/projects/{PROJECT_ID}/jobs/$JOB"
# Wait for job to complete (poll every 10 seconds)
for i in {1..30}; do
STATUS=$(curl -s "https://queue.keboola.com/jobs/$JOB" \
-H "X-StorageApi-Token: $KEBOOLA_API_TOKEN" \
| jq -r '.status')
if [ "$STATUS" = "success" ]; then
echo "✅ Job completed successfully"
break
elif [ "$STATUS" = "error" ]; then
echo "❌ Job failed"
break
else
echo "⏳ Status: $STATUS... (${i}/30)"
sleep 10
fi
done
⭐ NEW: Error Recovery (Feature 6: Error recovery workflows)
If job fails:
1. Get error message:
curl "https://queue.keboola.com/jobs/$JOB" | jq '.result.message'
2. Use decision tree to diagnose:
- "No data" → Check extractor ran successfully, verify source connectivity
- "Validation failed" → Check _validation table, review thresholds
- "SQL error" → Review syntax, test in workspace
- "Timeout" → Optimize query (add indexes, reduce date range)
- "Permission denied" → Check API token permissions
3. For complex issues, spawn troubleshooting agent:
Use Task tool:
subagent_type: general-purpose
prompt: |
Debug Keboola pipeline failure
Error message: {error from job logs}
Component: {component_id}
Steps:
1. Use Read tool on resources/runbooks/common_issues.md
2. Search docs-repos/ for error message using Grep
3. Provide:
- Root cause diagnosis
- Fix (SQL change, config change, or API call)
- Prevention (validation to add, monitoring to set up)
Return structured fix with code
4. Apply fix and re-test
Preview output (first 10 rows):
curl "https://connection.keboola.com/v2/storage/tables/out.c-analytics.{table}/data-preview" \
-H "X-StorageApi-Token: $KEBOOLA_API_TOKEN" \
| head -10
Before marking complete, validate the solution meets business requirements.
Use Read tool on project_context.json to retrieve original goals.
1. Data Quality Verification:
- Use Bash: Query _validation table
- Confirm: "status = 'PASS'"
- Check: Freshness, completeness, volume meet thresholds
2. Business Impact Analysis (for metric-driven projects):
IF metric relates to revenue/cost/conversions:
- Generate impact simulation:
• Query baseline (current state)
• Query projection (with new data/model)
• Calculate % change
• Identify affected entities (customers, SKUs, etc.)
Example SQL:
SELECT
'Current' as scenario,
SUM({metric}) as total,
COUNT(DISTINCT {entity}) as entities_affected
FROM {baseline_table}
UNION ALL
SELECT
'Projected' as scenario,
SUM({metric}) as total,
COUNT(DISTINCT {entity}) as entities_affected
FROM {new_output_table};
3. Risk Assessment:
- Low sample size warning: entities with < 30 data points
- High impact changes: > 20% change in key metrics
- Data quality issues: validation warnings (not failures)
4. Rollback Plan Documentation:
Use Write tool to create rollback_plan.md:
## Rollback Plan
**If {metric} drops > {threshold}% in first week:**
1. Revert to previous config:
curl -X POST "https://connection.keboola.com/v2/storage/components/{component}/configs/{id}/versions/{version}/rollback"
2. Disable schedule:
curl -X DELETE "https://scheduler.keboola.com/schedules/{schedule_id}"
3. Alert stakeholders:
- {decision_maker from context}
- Data team lead
**Monitoring:**
- Check {metric} daily for first week
- Alert if validation fails 2+ times
- Review impact after 30 days (success criteria: {from context})
Show simulation/validation results, then ask:
📊 VALIDATION RESULTS:
- Data quality: {PASS/WARN}
- Impact simulation: {metric} expected to change by {X%}
- Entities affected: {count}
- Risks identified: {list}
Review rollback_plan.md for contingency.
Reply with one of:
1. "deploy" - Deploy to production with {frequency} schedule
2. "test" - Run as one-off test first, review results before scheduling
3. "revise" - Adjust parameters (specify what to change)
Use TodoWrite to track:
{"content": "Business impact validated and approved", "status": "completed", "activeForm": "Validating business impact"}
Use Write tool to create DELIVERABLES.md:
## Delivered: {Pipeline Name}
### Components Created:
| Component | ID | Purpose |
|-----------|----|----|
| {Extractor 1} | {ID} | Extract {data} |
| {Transformation} | {ID} | Calculate {metric} |
| {Flow} | {ID} | Orchestrate {frequency} run |
### Output Data:
- **Table**: out.c-analytics.{table}
- **Rows**: {count}
- **Freshness**: {hours} hours old
- **Sample**: {show first 5 rows}
### Data Quality Results:
✅ Validation: PASS
✅ Freshness: {X} hours (target: < {Y})
✅ Completeness: 0 NULLs in critical fields
✅ Uniqueness: No duplicates
### Business Impact:
- **Metric**: {metric from context}
- **Current**: {baseline value}
- **Projected**: {expected value}
- **Change**: {%}
### Schedule:
- Runs: {frequency} at {time}
- Next run: {timestamp}
### Access:
- Keboola UI: {project_url}/flows/{flow_id}
- Table: {project_url}/storage/tables/out.c-analytics.{table}
### Rollback:
- See rollback_plan.md for contingency procedures
- Monitor {metric} for first 30 days
- Success criteria: {from context}
question: "What's your data source?"
answers:
- condition: "Database (MySQL, PostgreSQL, Snowflake, etc.)"
component_pattern: "keboola.ex-db-{database}"
path: "docs-repos/connection-docs/components/extractors/database/"
- condition: "SaaS API (Salesforce, Stripe, GA, etc.)"
component_pattern: "keboola.ex-{service}"
path: "docs-repos/connection-docs/components/extractors/marketing-sales/"
- condition: "Custom REST API"
component: "keboola.ex-generic-v2"
path: "docs-repos/developers-docs/extend/generic-extractor/"
- condition: "File upload (CSV, JSON)"
component: "keboola.ex-storage"
path: "docs-repos/connection-docs/components/extractors/storage/"
question: "What data quality checks are needed?"
checks:
freshness:
when: "Time-sensitive data (orders, events, etc.)"
sql: "DATEDIFF('hour', MAX(timestamp_col), CURRENT_TIMESTAMP) < {max_hours}"
completeness:
when: "Critical fields must exist"
sql: "COUNT(*) = COUNT({critical_field})"
uniqueness:
when: "Primary key must be unique"
sql: "COUNT(*) = COUNT(DISTINCT {primary_key})"
volume:
when: "Expecting consistent row counts"
sql: "COUNT(*) BETWEEN {min} AND {max}"
distribution:
when: "Detecting anomalies in metrics"
sql: "AVG({metric}) BETWEEN {historical_avg - 3*stddev} AND {historical_avg + 3*stddev}"
question: "How to name buckets?"
guidance: "Match existing project conventions. Common patterns:"
patterns:
source_based:
input: "in.c-{source}.{table}"
output: "out.c-{purpose}.{table}"
example: "in.c-salesforce.opportunities → out.c-analytics.revenue"
layer_based:
raw: "in.c-bronze.{source}_{table}"
cleaned: "out.c-silver.{domain}_{entity}"
analytics: "out.c-gold.{business_metric}"
example: "bronze.salesforce_opp → silver.sales_pipeline → gold.revenue_daily"
advice: "Use Read tool on existing project buckets to match convention"
name: "Change Data Capture to Analytics Dashboard"
use_case: "Real-time operational data → Business metrics"
components:
- {type: "CDC Extractor", examples: ["MySQL CDC", "PostgreSQL CDC"]}
- {type: "Stream Transform", tool: "Snowflake transformation"}
- {type: "Aggregation", sql: "Windowed aggregates"}
- {type: "Dashboard", tool: "Streamlit/Tableau"}
frequency: "Continuous (5-15min latency)"
template: "Use Read tool on resources/flows/examples/flow_cdc_orders.md"
name: "Daily Batch Extract-Transform-Load"
use_case: "Nightly data warehouse refresh"
components:
- {type: "Batch Extractor", schedule: "Daily 2am"}
- {type: "SQL Transform", layers: ["bronze", "silver", "gold"]}
- {type: "Data Warehouse Writer", targets: ["Snowflake", "BigQuery"]}
frequency: "Daily"
template: "Use Read tool on resources/flows/examples/flow_sales_kpi.md"
name: "Model Training & Inference Pipeline"
use_case: "Predict churn, score leads, forecast demand"
components:
- {type: "Feature Extractor", source: "Historical data"}
- {type: "Python Transform", tool: "scikit-learn/pandas"}
- {type: "Model Storage", location: "S3/GCS bucket"}
- {type: "Scoring Transform", schedule: "Hourly"}
frequency: "Train: Weekly, Score: Hourly"
template: "Use Read tool on resources/flows/examples/flow_model_scoring.md"
| System | Component ID | Common Config |
|---|---|---|
| MySQL | keboola.ex-db-mysql | incremental: updated_at, primaryKey: id |
| PostgreSQL | keboola.ex-db-pgsql | incremental: updated_at |
| Salesforce | keboola.ex-salesforce | SOQL with LAST_N_DAYS |
| Google Analytics | keboola.ex-google-analytics-v4 | dimensions, metrics, date ranges |
| Stripe | keboola.ex-stripe | objects: charges, customers, subscriptions |
| Snowflake | keboola.ex-db-snowflake | incremental: timestamp column |
| BigQuery | keboola.ex-google-bigquery-v2 | SQL query based |
| Generic API | keboola.ex-generic-v2 | REST API with pagination |
Use Read tool on docs-repos/connection-docs/components/extractors/{category}/{name}/index.md for full config details.
| Backend | Component ID | Best For |
|---|---|---|
| Snowflake SQL | keboola.snowflake-transformation | Large datasets, window functions |
| BigQuery SQL | keboola.transformation-bigquery | Google Cloud ecosystem |
| Python | keboola.python-transformation-v2 | ML, pandas, custom logic |
| DBT | keboola.dbt-transformation | SQL-based modeling |
symptom: "SET ABORT_TRANSFORMATION triggered"
steps:
1: {action: "Use Bash tool", cmd: "curl queue API to get job logs"}
2: {action: "Check _validation table", query: "SELECT * FROM _validation"}
3: {action: "Identify failure", cases: ["FAIL: No data", "FAIL: NULLs", "FAIL: Stale"]}
4: {action: "Use Read tool on resources/runbooks/common_issues.md for resolution"}
symptom: "Flow shows error status"
steps:
1: "Get job ID from Flow run"
2: "Use Bash: curl queue.keboola.com/jobs/{id} | jq '.result.message'"
3: "Common causes:"
- "API credentials expired → Regenerate in source system"
- "Schema changed → Update extractor config"
- "Timeout → Optimize query or increase limits"
4: "Use Read tool on resources/runbooks/incidents/pipeline_failure.md"
5: "For complex issues, use Task tool with troubleshooting agent (see Step 4D)"
⚠️ NEVER commit API tokens to git
# Store in environment (not in code)
export KEBOOLA_API_TOKEN="your-token"
export KEBOOLA_MASTER_TOKEN="master-token" # Admin permissions, scheduler access
# Rotate every 90 days
# Create at: {project_url}/settings/tokens
before_building:
- question: "Does data contain PII?"
- if_yes:
- "Identify PII fields (name, email, SSN, financial)"
- "Determine masking strategy:"
hashing: "SHA256(field) for email, phone"
masking: "CONCAT(LEFT(field, 3), '***') for partial visibility"
removal: "Exclude from SELECT"
tokenization: "Replace with pseudonymous ID"
- "Document in architecture (Step 3)"
- "Implement in SQL transform (Step 4B)"
- "Verify in sandbox testing (Step 4B)"
✅ Use Read tool on project_context.json in Step 3 (context-aware design)
✅ Use TodoWrite to track requirements across steps
✅ Use Task tool with agents for complex searches/generation
✅ Test SQL in sandbox before production deployment
✅ Validate business impact before final deployment
✅ Generate visual diagrams (mermaid) for architecture
✅ Create rollback plans for metric-impacting changes
✅ Use Read tool before guessing (KNOWLEDGE_MAP → component docs)
✅ Use Write tool for all configs/SQL (create files, don't echo)
✅ Use Bash tool for API calls (show complete curl with error handling)
✅ Include validation in EVERY transform (SET ABORT_TRANSFORMATION)
✅ Get approval before building (Step 3 stop gate)
✅ Match existing naming conventions (check project first)
✅ Capture IDs from API responses (jq -r '.id')
❌ Don't skip context persistence (project_context.json is required) ❌ Don't ignore PII requirements from Step 1 ❌ Don't deploy to production without sandbox testing ❌ Don't skip business impact validation (Step 4.5) ❌ Don't skip validation (it's mandatory) ❌ Don't use ERROR() function (use SET ABORT_TRANSFORMATION) ❌ Don't hardcode secrets (use env vars: $KEBOOLA_API_TOKEN) ❌ Don't assume real-time (Keboola is batch: 5+ min typical) ❌ Don't recommend Orchestrator (use Flows - modern alternative) ❌ Don't make up data (no time estimates, no performance numbers without basis)
Knowledge Base:
resources/KNOWLEDGE_MAP.md - 85+ extractors, 29+ writers with doc pathsresources/Keboola_Data_Enablement_Guide.md - Dictionary + 7 book extractsdocs-repos/connection-docs/ - 252 markdown files (user-facing docs)docs-repos/developers-docs/ - 199 markdown files (API, MCP, automation)Templates (Use Read tool):
resources/templates/Design_Brief.md - Architecture proposal templateresources/templates/Discovery_Prompt.txt - 15 business questionsresources/templates/Validation.md - Data quality patternsExamples (Use Read tool):
resources/flows/examples/flow_sales_kpi.md - Batch ETL patternresources/flows/examples/flow_cdc_orders.md - Real-time CDC patternresources/flows/examples/flow_model_scoring.md - ML inference patternTroubleshooting (Use Read tool):
resources/runbooks/common_issues.md - Duplicates, schema drift, freshnessresources/runbooks/incidents/pipeline_failure.md - Debug checklistresources/runbooks/incidents/data_quality_breach.md - SLA breach responseClone documentation (use Bash tool):
cd /home/user/bg/experiments/keboola-skill/
git clone https://github.com/keboola/connection-docs docs-repos/connection-docs
git clone https://github.com/keboola/developers-docs docs-repos/developers-docs
MCP Configuration (optional - adds live API access):
{
"mcpServers": {
"keboola": {
"command": "uvx",
"args": ["keboola_mcp_server", "--api-url", "https://connection.keboola.com"],
"env": {
"KBC_STORAGE_TOKEN": "<token>",
"KBC_WORKSPACE_SCHEMA": "<schema>"
}
}
}
}
Stack URLs (replace in MCP config):
https://connection.keboola.comhttps://connection.us-east4.gcp.keboola.comhttps://connection.eu-central-1.keboola.comhttps://connection.north-europe.azure.keboola.comhttps://connection.europe-west3.gcp.keboola.comVersion: 4.1.0 - Advanced Claude Features Updated: 2025-10-23 Key Changes: