| name | llm-benchmark |
| description | Retrieve and compare LLM performance metrics (latency, throughput, pricing, intelligence scores) using the Artificial Analysis API. Use this skill whenever the user wants to: - Compare two or more LLMs on speed, cost, or quality metrics - Find the fastest, cheapest, or smartest model for a given use case - Get current benchmark scores (MMLU, GPQA, coding, math) for a model or set of models - Analyze trade-offs between latency, throughput, and pricing across AI providers - Answer questions like "which model has the lowest TTFT?", "what's the cheapest model above X intelligence score?",
"compare GPT-4o vs Claude vs Gemini on speed and cost"
Trigger any time the user mentions model comparison, LLM benchmarks, token speed, time to first token, or asks which model to pick. Requires a valid Artificial Analysis API key (set as env var ARTIFICIAL_ANALYSIS_API_KEY or provided inline).
|
LLM Benchmark Skill
This skill fetches live LLM performance data from the Artificial Analysis API
and presents it as a clean, filterable comparison. Data attribution: artificialanalysis.ai
Prerequisites
The user must have:
- An account at artificialanalysis.ai
- An API key from their Insights Platform dashboard
- The key available as the environment variable
ARTIFICIAL_ANALYSIS_API_KEY, or ready to provide
If the key is not set, ask the user to provide it before proceeding.
Workflow
Step 1 – Understand the user's goal
Before fetching data, clarify what the user wants to compare:
- Which models? Specific models by name, a provider (e.g., "all OpenAI models"), or "all available"
- Which metrics? Latency (TTFT), throughput (tokens/sec), pricing, intelligence index, coding, math, or a specific benchmark
- What decision are they making? (e.g., "I want the fastest model under $5/M tokens") — this helps prioritize the output
If the user's request is clear enough, proceed without asking and clarify assumptions in the output.
Step 2 – Run the fetch+compare script
Use the bundled script to fetch data and produce output. Always capture both stdout and stderr (2>&1) so error messages are visible:
python /path/to/skill/scripts/fetch_metrics.py \
--api-key "USER_KEY_HERE" \
[--models "model-slug-1,model-slug-2"] \
[--providers "openai,anthropic,google"] \
[--metric latency|throughput|pricing|intelligence|coding|math|all] \
[--top N] \
[--filter-min-intelligence 20] \
[--output table|json|csv] \
2>&1
Replace /path/to/skill/scripts/fetch_metrics.py with the actual path to the script (same directory as this SKILL.md, inside scripts/).
If the script exits with an error, the output will begin with ERROR: and explain what went wrong. Stop immediately and show that message to the user — do not try to fabricate or estimate data. Common errors:
ERROR: Invalid API key — ask the user to double-check their key
ERROR: API rate limit exceeded — tell the user their daily quota is exhausted (1,000 req/day on the free tier) and that it resets after 24 hours; suggest they check https://artificialanalysis.ai for their quota status or a plan upgrade
ERROR: Network error — likely a connectivity issue; ask the user to retry
Comparing a specific list of models — use --models with a comma-separated list of slugs:
python .../fetch_metrics.py --api-key "..." \
--models "gpt-4o,claude-4-5-sonnet,gemini-2-0-flash" \
--metric all
Slugs are the URL-safe identifiers (e.g. gpt-4o, claude-4-5-sonnet, gemini-2-0-flash). The script also accepts partial name matches, so "claude sonnet" will find the right model. To discover the exact slug for any model, run:
python .../fetch_metrics.py --api-key "..." --list-models
If --models and --providers are both omitted, all available models are fetched.
Step 3 – Present results
After running the script, present the output in a clear, readable way:
- Table format (default): render as a markdown table grouped by provider
- Highlight winners: call out the top model per metric with a brief summary
- Explain the metrics: briefly explain what each metric means if the user seems unfamiliar
- Median TTFT: how long until the first token appears (lower = more responsive)
- Output tokens/sec: generation speed after first token (higher = faster streaming)
- Input/output price: cost per million tokens
- Intelligence Index: Artificial Analysis composite score across multiple benchmarks
Always end with a "💡 Recommendation" section based on the user's stated goal.
Always include the attribution line:
Data source: artificialanalysis.ai — fetched live
Output Format
Standard Comparison Table
## LLM Benchmark Comparison
*Fetched live from Artificial Analysis*
| Model | Provider | TTFT (s) | Tokens/s | Price In ($/M) | Price Out ($/M) | Intelligence |
|-------|----------|----------|----------|----------------|-----------------|--------------|
| ... | ... | ... | ... | ... | ... | ... |
💡 **Recommendation**: [1–2 sentence summary based on the user's goal]
Data source: artificialanalysis.ai
Top-N Format (when user asks "what's the fastest/cheapest/best")
Show only the top N models for the requested metric, sorted by that metric, with a brief explanation of why each ranks where it does.
Error Handling
- Missing API key: Ask the user to provide it; show instructions for setting
ARTIFICIAL_ANALYSIS_API_KEY
- API rate limit hit: Inform the user (free tier: 1,000 req/day). Suggest waiting or checking quota
- Model not found: List similar model slugs that were found and let the user pick
- Empty results after filtering: Relax the filter and show what's available, explaining the trade-off
Notes on the API
- Base URL:
https://artificialanalysis.ai/api/v2
- Key header:
x-api-key: <your-key>
- Main endpoint:
GET /data/llms/models — returns all 470+ LLM models in one call
- Response structure:
evaluations and pricing are nested objects; the script flattens them automatically
- Intelligence index scale: the real API uses a score roughly in the 0–55 range for current models (not 0–100). Adjust
--filter-min-intelligence accordingly — e.g., --filter-min-intelligence 20 for above-average models, --filter-min-intelligence 30 for top-tier.
- Benchmark scores: raw API values are 0–1 (e.g., gpqa: 0.543 = 54.3%); the script converts to percentage automatically
- Missing speed data: some models show TTFT=0 and Tokens/s=0 — this means Artificial Analysis hasn't measured that model in a real-time benchmark run yet (common for newer or less popular models)
- Fields used by this skill:
name, slug, model_creator.name — identity
median_time_to_first_token_seconds — TTFT latency (top-level)
median_output_tokens_per_second — throughput (top-level)
pricing.price_1m_input_tokens, pricing.price_1m_output_tokens, pricing.price_1m_blended_3_to_1 — pricing
evaluations.artificial_analysis_intelligence_index — overall quality score
evaluations.artificial_analysis_coding_index, evaluations.artificial_analysis_math_index — domain scores
evaluations.mmlu_pro, evaluations.gpqa, evaluations.livecodebench, evaluations.math_500, evaluations.aime — specific benchmarks