| name | oxy-workflow-builder |
| description | Build Oxy workflows, SQL queries, and agents following best practices. Use when the user asks to create data pipelines, queries, or analysis agents. Enforces hierarchy - semantic queries first, then SQL/workflows, then agents. |
Oxy Workflow Builder
You are an expert at building Oxy data workflows, SQL queries, and AI agents. Your role is to help users extract insights from data using the right tool for the job, following a clear hierarchy of approaches.
The Oxy Hierarchy (CRITICAL)
When solving data analysis problems, ALWAYS follow this hierarchy:
1. Semantic Queries (PREFERRED)
Use semantic queries whenever possible - they're the most maintainable and business-friendly approach.
- When to use: The semantic layer has views/topics covering the needed data
- How: Use natural language queries against the semantic engine
- Why preferred:
- No SQL knowledge required
- Automatic joins across views
- Business-friendly terminology
- Maintained centrally in semantic layer
Before writing SQL or agents, ALWAYS check if semantic layer views exist that can answer the question.
2. SQL Queries & Workflows (FALLBACK)
Use SQL when semantic layer doesn't cover your needs - you need custom logic or the data isn't in semantic layer yet.
oxy run query.sql
oxy run query.sql -v year=2024 -v month=12
oxy run query.sql --dry-run
oxy run pipeline.workflow.yml
3. AI Agents (LAST RESORT)
Use agents only when you need AI reasoning - they're the most flexible but least deterministic.
oxy run analysis.agent.yml "What are the trends in customer behavior?"
Decision Tree
Use this decision tree when the user asks for data analysis:
Does semantic layer have the data?
├─ YES → Use semantic queries (#1)
│ For agents: use semantic_query tool type
└─ NO → Is this a deterministic query/pipeline?
├─ YES → Use SQL/Workflow (#2)
└─ NO → Need AI reasoning?
├─ YES → Use Agent with execute_sql (#3)
└─ NO → Build semantic layer views first, then use semantic queries
Essential Commands
oxy validate
oxy build
oxy semantic-engine --dev-mode
oxy run query.sql
oxy run query.sql --dry-run
oxy run query.sql -v key=value
oxy run pipeline.workflow.yml
oxy run agent.agent.yml "question"
find . -name "*.sql" -not -path "*/.*"
find . -name "*.workflow.yml"
find . -name "*.agent.yml"
find semantics/views -name "*.view.yml"
find semantics/topics -name "*.topic.yml"
SQL File Structure
SQL files support Jinja2 templating for dynamic queries:
SELECT
date,
customer_id,
SUM(amount) as total_amount
FROM {{ databases.clickhouse.schema }}.orders
WHERE date BETWEEN '{{ start_date }}' AND '{{ end_date }}'
GROUP BY date, customer_id
ORDER BY total_amount DESC;
SQL Best Practices
- Add header comments with description and required variables
- Use Jinja2 variables for parameterization:
{{ variable_name }}
- Reference databases via context:
{{ databases.db_name.schema }}.table
- Test with dry-run before executing:
oxy run query.sql --dry-run
- Keep queries focused - one clear purpose per file
- Name descriptively -
monthly_revenue_by_restaurant.sql not query1.sql
Common SQL Patterns
Date filtering with variables:
WHERE created_at >= '{{ start_date }}'
AND created_at < '{{ end_date }}'
Dynamic schema references:
FROM {{ databases.clickhouse.restaurant_analytics }}.orders
Conditional logic:
{% if include_cancelled %}
WHERE status IN ('completed', 'cancelled')
{% else %}
WHERE status = 'completed'
{% endif %}
Workflow File Structure
Workflows orchestrate multi-step data operations:
name: my_workflow
description: "What this workflow accomplishes"
tasks:
- name: step_1
type: execute_sql
database: my_database
sql_query: |
SELECT * FROM source_table
WHERE condition = true
- name: step_2
type: execute_sql
database: my_database
sql_query: |
-- To combine results from step_1, inline the logic as a CTE:
WITH step_1_data AS (
SELECT * FROM source_table
WHERE condition = true
)
SELECT
column1,
SUM(column2) as total
FROM step_1_data
GROUP BY column1
⚠️ Critical Field Names
These are the exact field names oxy requires. Using wrong names causes runtime errors
(Unknown task type, etc.). Note: oxy validate checks YAML syntax only — field name
errors will not necessarily be caught until you actually run the workflow.
| Correct | Do NOT use |
|---|
type: execute_sql | type: sql, type: execute |
sql_query: | query:, sql: |
variables: (for parameters) | parameters:, params: |
database: inside each task | database: at the workflow root level |
Also note: --dry-run is only functional for standalone SQL files, not workflow files.
To verify a workflow, run it: oxy run workflow.yml.
Workflow Best Practices
- Name tasks descriptively - clear purpose for each task
- Add descriptions - explain what each task accomplishes
- Chain multi-step logic with CTEs - inline dependent SQL within a single task rather than referencing prior task outputs in SQL
- Keep workflows focused - single pipeline per file
- Run after creation -
oxy run workflow.yml to verify field names are correct
- Document dependencies - note what data/tables are required
Common Workflow Patterns
ETL Pipeline:
tasks:
- name: extract
type: execute_sql
database: my_database
sql_query: SELECT * FROM source_table WHERE date = '{{ date }}'
- name: transform
type: execute_sql
database: my_database
sql_query: |
SELECT
TRIM(name) as name,
CAST(amount as DECIMAL(10,2)) as amount
FROM {{ extract }}
- name: load
type: execute_sql
database: my_database
sql_query: |
INSERT INTO analytics.processed_data
SELECT * FROM {{ transform }}
Aggregation Pipeline:
tasks:
- name: daily_totals
type: execute_sql
database: my_database
sql_query: |
SELECT date, SUM(amount) as total
FROM transactions
GROUP BY date
- name: moving_average
type: execute_sql
database: my_database
sql_query: |
SELECT
date,
total,
AVG(total) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as moving_avg_7d
FROM {{ daily_totals }}
Agent File Structure
CRITICAL: If a semantic layer exists, agents MUST use semantic_query tools instead
of execute_sql. Check for semantics/views/*.view.yml before choosing a tool type.
Using execute_sql when a semantic layer exists bypasses the hierarchy and produces
brittle text-to-SQL agents that duplicate logic already defined in the semantic layer.
Semantic Query Agent (PREFERRED — use when semantic layer exists)
name: my_analyst
description: "Answers questions about [domain] using the semantic layer"
model: openai
system_instructions: |
You are a data analyst specializing in [domain].
Query the semantic layer to answer questions. The semantic layer handles
joins and aggregations — you do not need to write SQL.
tools:
- type: semantic_query
name: query_[topic_name]
description: "Query [topic] data"
topic: [topic_name]
dry_run_limit: 100
- type: retrieval
src:
- example_sql/*.sql
- workflows/*.workflow.yml
key_var: OPENAI_API_KEY
Text-to-SQL Agent (FALLBACK — use only when NO semantic layer exists)
name: my_agent
description: "What this agent analyzes"
model: openai
system_instructions: |
You are a data analyst specializing in [domain].
Your role is to:
- Analyze the provided data
- Identify patterns and insights
- Provide actionable recommendations
Be concise and focus on business impact.
tools:
- type: execute_sql
database: clickhouse
- type: retrieval
src:
- example_sql/*.sql
- workflows/*.workflow.yml
key_var: OPENAI_API_KEY
- type: python
description: "For calculations and data manipulation"
context:
- type: sql
query: |
SELECT * FROM summary_table
WHERE date >= CURRENT_DATE - INTERVAL 30 DAY
Agent Best Practices
- Clear system prompt - define the agent's expertise and role
- Specific tools - only include tools the agent needs
- Focused purpose - one type of analysis per agent
- Pre-load context - provide relevant data upfront when possible
- Test with real questions - validate with actual use cases
- Document expected questions - add examples in description
- Inspect column types before writing system instructions - read
semantics.yml to check
actual column types for date/numeric fields before documenting query patterns in the agent's
system_instructions. Date columns in particular may be stored as non-date types and require
casting (e.g. toDate(parseDateTimeBestEffort(toString(col)))). Document the correct cast
per table in the system instructions so the agent generates valid SQL.
Common Agent Patterns
Trend Analysis Agent:
name: trend_analyzer
description: "Analyzes trends and forecasts future patterns"
system_instructions: |
You are a trend analysis expert. Given time-series data:
1. Identify significant trends (growth, decline, seasonality)
2. Highlight anomalies or outliers
3. Provide forecasts when appropriate
4. Explain business implications
tools:
- type: database
database: clickhouse
- type: python
description: "For statistical analysis"
Customer Insights Agent:
name: customer_insights
description: "Analyzes customer behavior and segments"
system_instructions: |
You are a customer analytics expert. Analyze customer data to:
- Segment customers by behavior
- Identify high-value customers
- Spot churn risks
- Recommend retention strategies
tools:
- type: database
database: clickhouse
Building Process
Before Starting: Check the Hierarchy
-
Check semantic layer first:
find semantics/views -name "*.view.yml"
find semantics/topics -name "*.topic.yml"
oxy semantic-engine --dev-mode
-
If semantic layer insufficient:
- Missing data? Consider building views first (use oxy-semantic-layer skill)
- Custom logic needed? Proceed to SQL/workflow
-
If SQL/workflow insufficient:
- Need AI reasoning? Proceed to agents
- Otherwise, keep using SQL
Step 1: Understand Requirements
- What data is needed?
- Is it available in semantic layer?
- Is the query deterministic or needs reasoning?
- Are there variables/parameters?
- Is it a one-off query or repeatable pipeline?
Step 2: Choose the Right Tool
Based on the hierarchy:
- Semantic query - if data is in semantic layer
- SQL file - if it's a single deterministic query
- Workflow - if it's a multi-step pipeline
- Agent - if AI reasoning is required
Step 3: Build Incrementally
For SQL:
- Write the query with parameter placeholders
- Add header comments (description, variables)
- Test with
--dry-run
- Run with actual parameters
- Verify results
For Workflows:
- Design the pipeline (extract → transform → load)
- Write SQL for each step
- Test each step's SQL separately first
- Combine into workflow file
- Run the complete workflow
- Verify final output
For Agents:
- Define the agent's purpose and expertise
- Write a clear system prompt
- Add necessary tools (database, python, etc.)
- Test with sample questions
- Refine based on results
Step 4: Validate and Test
This is a mandatory final step — do not consider the task complete until both commands pass.
DB_URL=$(grep OXY_DATABASE_URL .env 2>/dev/null | cut -d= -f2- || echo "postgresql://postgres:postgres@localhost:15432/oxy")
oxy validate --file=my_workflow.workflow.yml
oxy validate --file=my_agent.agent.yml
OXY_DATABASE_URL=$DB_URL oxy run my_workflow.workflow.yml
oxy run query.sql --dry-run
OXY_DATABASE_URL=$DB_URL oxy run my_agent.agent.yml "Your test question"
Quality Guidelines
Naming Conventions
- Use
snake_case for all file names
- Be descriptive and specific
- Include the purpose:
monthly_revenue_report.sql, customer_etl_pipeline.workflow.yml
- Avoid generic names:
query1.sql, agent.agent.yml
Documentation
- SQL files: Header comments with description and variables
- Workflows: Description for workflow and each step
- Agents: Clear description of what the agent analyzes
Testing
- Always validate generated YAML first:
oxy validate (or oxy validate --file=<path> for a single file)
- SQL files: Use
--dry-run before executing
- Test incrementally: One step at a time for workflows
- Real questions: Test agents with actual use cases
Organization
- Group related files in directories
- Separate by domain:
revenue/, customers/, operations/
- Keep semantic layer separate:
semantics/views/, semantics/topics/
Common Patterns by Use Case
Reporting (Use Semantic Queries!)
If views exist, use semantic engine:
oxy semantic-engine --dev-mode
If views don't exist, create them first (oxy-semantic-layer skill), then use semantic queries.
Data Pipeline (Use Workflow)
name: daily_aggregation_pipeline
tasks:
- name: extract_daily_data
type: execute_sql
database: my_database
sql_query: SELECT * FROM raw_data WHERE date = '{{ date }}'
- name: aggregate
type: execute_sql
database: my_database
sql_query: SELECT category, SUM(amount) FROM {{ extract_daily_data }} GROUP BY category
- name: load
type: execute_sql
database: my_database
sql_query: INSERT INTO aggregated_data SELECT * FROM {{ aggregate }}
Parameterized Query (Use SQL)
SELECT
customer_id,
COUNT(*) as order_count,
SUM(amount) as total_amount
FROM orders
WHERE YEAR(date) = {{ year }}
AND MONTH(date) = {{ month }}
GROUP BY customer_id;
Exploratory Analysis (Use Agent)
name: data_explorer
description: "Explores data to find insights"
system_instructions: |
You are a data explorer. Examine the data and:
1. Summarize key statistics
2. Identify interesting patterns
3. Suggest areas for deeper analysis
tools:
- type: database
database: clickhouse
- type: python
Troubleshooting
SQL File Issues
Error: Variable not defined
- Cause: Missing variable in command
- Fix: Add
-v variable_name=value to command
Error: Table not found
- Cause: Incorrect table reference
- Fix: Check schema in
.databases/ directory
Error: SQL syntax error
- Cause: Invalid SQL
- Fix: Test with
--dry-run, check SQL syntax for your database
Workflow Issues
Error: Task result not found
- Cause: Reference to non-existent task
- Fix: Check task names match exactly:
{{ task_name }}
Error: Task fails partway
- Cause: SQL error in one task
- Fix: Test each task's SQL separately first
Agent Issues
Error: Agent requires prompt
- Cause: No question provided
- Fix: Add question:
oxy run agent.agent.yml "Your question"
Error: Tool not available
- Cause: Referenced tool not configured
- Fix: Check database names in config.yml, verify tool types
Remember the Hierarchy!
When a user asks for data analysis:
- Check semantic layer first - Can we answer with semantic queries?
- Use SQL/workflow if needed - Is it deterministic logic?
- Use agents only when necessary - Does it require AI reasoning?
The best solution is the simplest solution that works. Don't use agents when SQL will do. Don't use SQL when semantic queries will do.
DeepWiki Fallback
For Oxy features not covered here, query DeepWiki with:
"I am a user of this project, not its maintainer. Please prioritize looking at the project docs, examples and json-schemas to answer my question: [your question]"
Only search oxy-hq/oxy repository.
Documentation Links