// Analyze evaluation baseline results, identify failure patterns, and generate actionable insights. Use after running eval baselines or when user asks to analyze eval results, check benchmarks, investigate failures, or understand what's failing.
| name | Eval Analyzer |
| description | Analyze evaluation baseline results, identify failure patterns, and generate actionable insights. Use after running eval baselines or when user asks to analyze eval results, check benchmarks, investigate failures, or understand what's failing. |
Analyze AILANG evaluation baseline results to identify failure patterns, compare model performance, and generate actionable insights.
Most common usage:
# User says: "Analyze the v0.3.24 eval results"
# This skill will:
# 1. Run eval-analyze to categorize failures (standard eval)
# 2. Run agent KPIs to analyze efficiency (agent eval)
# 3. Generate summary with jq queries
# 4. Identify top failing benchmarks
# 5. Show model performance comparison
# 6. Provide optimization recommendations
For agent evaluation analysis (NEW - optimization focus):
# Step 1: Get efficiency metrics (turns, tokens, cost)
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/v0.3.24
# Step 2: Investigate expensive benchmarks
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/baselines/v0.3.24 simple_print
# Step 3: Compare Python vs AILANG
./tools/compare_agents.sh eval_results/baselines/v0.3.24
See resources/agent_optimization_guide.md for complete optimization strategies.
Invoke this skill when:
All commands work on baseline directories like eval_results/baselines/v0.3.16/.
eval-matrixShows comprehensive statistics with model/language breakdowns.
ailang eval-matrix eval_results/baselines/v0.3.16 0.3.16 | head -60
Shows: Overall stats, per-model performance, per-language breakdown, top error codes.
eval-analyzeCategorizes failures and can generate design docs for issues.
# Dry run (no design docs, just analysis)
ailang eval-analyze -results eval_results/baselines/v0.3.16 -dry-run
# Full analysis with design doc generation
ailang eval-analyze -results eval_results/baselines/v0.3.16
⚠️ CRITICAL: Must use -results flag, NOT positional argument!
Output: Categorized failures (compile_error, logic_error, runtime_error) with frequency, affected benchmarks, models, and sample errors.
eval-summaryGenerates JSONL for easy querying with jq.
ailang eval-summary eval_results/baselines/v0.3.16
Output: eval_results/baselines/v0.3.16/summary.jsonl
eval-compareShows what changed between two versions.
ailang eval-compare eval_results/baselines/v0.3.15 eval_results/baselines/v0.3.16
fair_comparison.pyUse this for accurate version comparisons! The eval-compare command may include duplicates or different model sets. This script normalizes data for apple-to-apples comparison.
.claude/skills/eval-analyzer/scripts/fair_comparison.py
What it does:
Output:
v0.4.0: 56/123 = 45.5%
v0.4.2: 59/123 = 48.0%
Delta: +3 (+2.4pp)
✅ Fixed: 11 benchmarks
❌ Broken: 8 benchmarks
NET: +3 benchmarks
When to use: Before making decisions based on eval results (e.g., reverting changes, merging PRs).
validate_eval_results.pyCheck for output corruption and race conditions in eval results.
python3 tools/validate_eval_results.py eval_results/baselines/v0.4.2
Checks:
When to use: After running eval baselines, especially if results look suspicious.
For agent-based evaluation results (Python vs AILANG comparisons with Claude Code):
Shows efficiency metrics for agent runs - key for optimizing language and prompts.
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/WITH_ALL_FIXES
Output:
Goal: Minimize agent turns and tokens → indicates clearer prompts and simpler language.
View full agent conversation logs to understand what happened.
# View all transcripts
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/WITH_ALL_FIXES
# View only failures
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/WITH_ALL_FIXES --failed-only
# View specific benchmark
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/WITH_ALL_FIXES fizzbuzz
Output:
Use for: Understanding why AILANG solutions fail or take many turns.
Use the existing tools/compare_agents.sh script for side-by-side comparison:
./tools/compare_agents.sh eval_results/WITH_ALL_FIXES
Output:
# Show overall statistics
ailang eval-matrix eval_results/baselines/v0.3.16 0.3.16 | head -60
Look for:
# Categorize all failures
ailang eval-analyze -results eval_results/baselines/v0.3.16 -dry-run
Key metrics:
Use jq queries on summary.jsonl for custom analysis:
# Ensure summary exists
ailang eval-summary eval_results/baselines/v0.3.20
# AILANG-only success rate (all models)
jq -s 'map(select(.lang == "ailang")) |
{total: length, success: (map(select(.stdout_ok == true)) | length),
rate: ((map(select(.stdout_ok == true)) | length) * 100.0 / length)}' \
eval_results/baselines/v0.3.20/summary.jsonl
# Dev models only (useful for prompt testing)
jq -s 'map(select(.lang == "ailang" and
(.model == "gpt5-mini" or .model == "claude-haiku-4-5" or .model == "gemini-2-5-flash"))) |
{total: length, success: (map(select(.stdout_ok == true)) | length),
rate: ((map(select(.stdout_ok == true)) | length) * 100.0 / length)}' \
eval_results/baselines/v0.3.20/summary.jsonl
# Check specific benchmark across all models
jq -s 'map(select(.benchmark == "explicit_state_threading" and .lang == "ailang")) |
map({model, success: .stdout_ok, error: .error_category})' \
eval_results/baselines/v0.3.20/summary.jsonl
# Compare two versions (dev models AILANG-only)
jq -s 'map(select(.lang == "ailang" and
(.model == "gpt5-mini" or .model == "claude-haiku-4-5" or .model == "gemini-2-5-flash"))) |
{total: length, success: (map(select(.stdout_ok == true)) | length),
rate: ((map(select(.stdout_ok == true)) | length) * 100.0 / length)}' \
eval_results/baselines/v0.3.20/summary.jsonl \
eval_results/baselines/v0.3.21/summary.jsonl
For more jq patterns, see resources/jq_queries.md
Use the provided helper scripts for detailed code inspection:
# Failure analysis with error categorization
.claude/skills/eval-analyzer/scripts/analyze_failures.sh eval_results/baselines/v0.3.16
# Model performance comparison
.claude/skills/eval-analyzer/scripts/compare_models.sh eval_results/baselines/v0.3.16
# Examine specific benchmark failures
.claude/skills/eval-analyzer/scripts/examine_code.sh eval_results/baselines/v0.3.16 api_call_json
# Show regressions and improvements
ailang eval-compare eval_results/baselines/v0.3.15 eval_results/baselines/v0.3.16
Based on the data, identify:
Symptom: eval-analyze only finds 6 results
Cause: Used positional argument instead of -results flag
Solution:
# ❌ WRONG
ailang eval-analyze eval_results/baselines/v0.3.16
# ✅ CORRECT
ailang eval-analyze -results eval_results/baselines/v0.3.16
Symptom: jq queries fail with "file not found"
Cause: Need to run eval-summary first
Solution:
ailang eval-summary eval_results/baselines/v0.3.16
Symptom: eval-analyze shows issues but doesn't create docs
Cause: Using -dry-run flag
Solution: Run without -dry-run to generate design docs
The skill includes helper scripts in scripts/ directory:
Fast overview using eval-matrix.
.claude/skills/eval-analyzer/scripts/quick_summary.sh eval_results/baselines/v0.3.16
Output: Overall stats, model performance, language breakdown, top error codes.
Detailed failure analysis with error categorization.
.claude/skills/eval-analyzer/scripts/analyze_failures.sh eval_results/baselines/v0.3.16 ailang
Output: Overall statistics, error categories, top failing benchmarks, model performance, error codes.
Model-by-model performance comparison.
.claude/skills/eval-analyzer/scripts/compare_models.sh eval_results/baselines/v0.3.16
Output: Success rates, first-attempt vs final, cost analysis, token usage, best model per benchmark.
Inspect generated code from specific benchmarks.
.claude/skills/eval-analyzer/scripts/examine_code.sh eval_results/baselines/v0.3.16 api_call_json
.claude/skills/eval-analyzer/scripts/examine_code.sh eval_results/baselines/v0.3.16 api_call_json gpt5
Output: Generated code, compiler errors, success status, error codes for each model run.
View prompts used for specific benchmarks.
.claude/skills/eval-analyzer/scripts/examine_prompts.sh eval_results/baselines/v0.3.16 api_call_json
Output: System prompt, user prompt, success status for benchmark runs.
Check if prompt documentation matches actual implementation.
.claude/skills/eval-analyzer/scripts/verify_prompt_accuracy.sh v0.3.16
Output: Reports false limitations, undocumented features, and prompt-code mismatches.
Use this: After creating new prompt versions to catch documentation bugs!
resources/failure_analysis_v0.3.16.md - Comprehensive analysis of v0.3.16 eval results with root cause analysisSee resources/jq_queries.md for more query examples and patterns.
This skill loads information progressively:
ailang eval-* commands and helper scriptsresources/jq_queries.md, analysis documentseval-analyze generates design docs using LLM (default: gpt5)-dry-run to preview before generating design docseval_results/baselines/vX.X.X/post-release skill (which runs baselines)