Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

llm-benchmark

Name: Llm Benchmark
Author: marcoscale98

// Retrieve and compare LLM performance metrics (latency, throughput, pricing, intelligence scores) using the Artificial Analysis API. Use this skill whenever the user wants to: - Compare two or more LLMs on speed, cost, or quality metrics - Find the fastest, cheapest, or smartest model for a given use case - Get current benchmark scores (MMLU, GPQA, coding, math) for a model or set of models - Analyze trade-offs between latency, throughput, and pricing across AI providers - Answer questions like "which model has the lowest TTFT?", "what's the cheapest model above X intelligence score?", "compare GPT-4o vs Claude vs Gemini on speed and cost" Trigger any time the user mentions model comparison, LLM benchmarks, token speed, time to first token, or asks which model to pick. Requires a valid Artificial Analysis API key (set as env var ARTIFICIAL_ANALYSIS_API_KEY or provided inline).

Ejecutar en Manus

$ git log --oneline --stat

stars:0

forks:0

updated:8 de abril de 2026, 12:29

Explorador de archivos

3 archivos

SKILL.md

readonly

name

llm-benchmark

description

Retrieve and compare LLM performance metrics (latency, throughput, pricing, intelligence scores) using the Artificial Analysis API. Use this skill whenever the user wants to: - Compare two or more LLMs on speed, cost, or quality metrics - Find the fastest, cheapest, or smartest model for a given use case - Get current benchmark scores (MMLU, GPQA, coding, math) for a model or set of models - Analyze trade-offs between latency, throughput, and pricing across AI providers - Answer questions like "which model has the lowest TTFT?", "what's the cheapest model above X intelligence score?", "compare GPT-4o vs Claude vs Gemini on speed and cost" Trigger any time the user mentions model comparison, LLM benchmarks, token speed, time to first token, or asks which model to pick. Requires a valid Artificial Analysis API key (set as env var ARTIFICIAL_ANALYSIS_API_KEY or provided inline).

LLM Benchmark Skill

This skill fetches live LLM performance data from the Artificial Analysis API and presents it as a clean, filterable comparison. Data attribution: artificialanalysis.ai

Prerequisites

The user must have:

An account at artificialanalysis.ai
An API key from their Insights Platform dashboard
The key available as the environment variable ARTIFICIAL_ANALYSIS_API_KEY, or ready to provide

If the key is not set, ask the user to provide it before proceeding.

Workflow

Step 1 – Understand the user's goal

Before fetching data, clarify what the user wants to compare:

Which models? Specific models by name, a provider (e.g., "all OpenAI models"), or "all available"
Which metrics? Latency (TTFT), throughput (tokens/sec), pricing, intelligence index, coding, math, or a specific benchmark
What decision are they making? (e.g., "I want the fastest model under $5/M tokens") — this helps prioritize the output

If the user's request is clear enough, proceed without asking and clarify assumptions in the output.

Step 2 – Run the fetch+compare script

Use the bundled script to fetch data and produce output. Always capture both stdout and stderr (2>&1) so error messages are visible:

python /path/to/skill/scripts/fetch_metrics.py \
  --api-key "USER_KEY_HERE" \
  [--models "model-slug-1,model-slug-2"] \
  [--providers "openai,anthropic,google"] \
  [--metric latency|throughput|pricing|intelligence|coding|math|all] \
  [--top N] \
  [--filter-min-intelligence 20] \
  [--output table|json|csv] \
  2>&1

Replace /path/to/skill/scripts/fetch_metrics.py with the actual path to the script (same directory as this SKILL.md, inside scripts/).

If the script exits with an error, the output will begin with ERROR: and explain what went wrong. Stop immediately and show that message to the user — do not try to fabricate or estimate data. Common errors:

ERROR: Invalid API key — ask the user to double-check their key
ERROR: API rate limit exceeded — tell the user their daily quota is exhausted (1,000 req/day on the free tier) and that it resets after 24 hours; suggest they check https://artificialanalysis.ai for their quota status or a plan upgrade
ERROR: Network error — likely a connectivity issue; ask the user to retry

Comparing a specific list of models — use --models with a comma-separated list of slugs:

python .../fetch_metrics.py --api-key "..." \
  --models "gpt-4o,claude-4-5-sonnet,gemini-2-0-flash" \
  --metric all

Slugs are the URL-safe identifiers (e.g. gpt-4o, claude-4-5-sonnet, gemini-2-0-flash). The script also accepts partial name matches, so "claude sonnet" will find the right model. To discover the exact slug for any model, run:

python .../fetch_metrics.py --api-key "..." --list-models

If --models and --providers are both omitted, all available models are fetched.

Step 3 – Present results

After running the script, present the output in a clear, readable way:

Table format (default): render as a markdown table grouped by provider
Highlight winners: call out the top model per metric with a brief summary
Explain the metrics: briefly explain what each metric means if the user seems unfamiliar
- Median TTFT: how long until the first token appears (lower = more responsive)
- Output tokens/sec: generation speed after first token (higher = faster streaming)
- Input/output price: cost per million tokens
- Intelligence Index: Artificial Analysis composite score across multiple benchmarks

Always end with a "💡 Recommendation" section based on the user's stated goal.

Always include the attribution line:

Data source: artificialanalysis.ai — fetched live

Output Format

Standard Comparison Table

## LLM Benchmark Comparison
*Fetched live from Artificial Analysis*

| Model | Provider | TTFT (s) | Tokens/s | Price In ($/M) | Price Out ($/M) | Intelligence |
|-------|----------|----------|----------|----------------|-----------------|--------------|
| ...   | ...      | ...      | ...      | ...            | ...             | ...          |

💡 **Recommendation**: [1–2 sentence summary based on the user's goal]

Data source: artificialanalysis.ai

Top-N Format (when user asks "what's the fastest/cheapest/best")

Show only the top N models for the requested metric, sorted by that metric, with a brief explanation of why each ranks where it does.

Error Handling

Missing API key: Ask the user to provide it; show instructions for setting ARTIFICIAL_ANALYSIS_API_KEY
API rate limit hit: Inform the user (free tier: 1,000 req/day). Suggest waiting or checking quota
Model not found: List similar model slugs that were found and let the user pick
Empty results after filtering: Relax the filter and show what's available, explaining the trade-off

Notes on the API

Base URL: https://artificialanalysis.ai/api/v2
Key header: x-api-key: <your-key>
Main endpoint: GET /data/llms/models — returns all 470+ LLM models in one call
Response structure: evaluations and pricing are nested objects; the script flattens them automatically
Intelligence index scale: the real API uses a score roughly in the 0–55 range for current models (not 0–100). Adjust --filter-min-intelligence accordingly — e.g., --filter-min-intelligence 20 for above-average models, --filter-min-intelligence 30 for top-tier.
Benchmark scores: raw API values are 0–1 (e.g., gpqa: 0.543 = 54.3%); the script converts to percentage automatically
Missing speed data: some models show TTFT=0 and Tokens/s=0 — this means Artificial Analysis hasn't measured that model in a real-time benchmark run yet (common for newer or less popular models)
Fields used by this skill:
- name, slug, model_creator.name — identity
- median_time_to_first_token_seconds — TTFT latency (top-level)
- median_output_tokens_per_second — throughput (top-level)
- pricing.price_1m_input_tokens, pricing.price_1m_output_tokens, pricing.price_1m_blended_3_to_1 — pricing
- evaluations.artificial_analysis_intelligence_index — overall quality score
- evaluations.artificial_analysis_coding_index, evaluations.artificial_analysis_math_index — domain scores
- evaluations.mmlu_pro, evaluations.gpqa, evaluations.livecodebench, evaluations.math_500, evaluations.aime — specific benchmarks

package.json

"author": "marcoscale98"

"repository": "marcoscale98/agent-skills"

$ gh browse

$ install --globalskills.sh

$ download --local

Ejecutar en Manus

[HINT] Descarga el directorio completo de la habilidad incluyendo SKILL.md y todos los archivos relacionados

related-imports.ts

// Habilidades Relacionadas

import session-logs

from "openclaw"

359,981

import openclaw-test-heap-leaks

import openclaw-qa-testing

from "openclaw"

359,981

import openclaw-secret-scanning-maintainer

import get-search-view-results

from "microsoft"

183,992

import component-fixtures

from "microsoft"

183,992

import auto-perf-optimize

from "microsoft"

183,992

import cpu-profile-analysis

name

llm-benchmark

description

LLM Benchmark Skill

This skill fetches live LLM performance data from the Artificial Analysis API and presents it as a clean, filterable comparison. Data attribution: artificialanalysis.ai

Prerequisites

The user must have:

An account at artificialanalysis.ai
An API key from their Insights Platform dashboard
The key available as the environment variable ARTIFICIAL_ANALYSIS_API_KEY, or ready to provide

If the key is not set, ask the user to provide it before proceeding.

Workflow

Step 1 – Understand the user's goal

Before fetching data, clarify what the user wants to compare:

Which models? Specific models by name, a provider (e.g., "all OpenAI models"), or "all available"
Which metrics? Latency (TTFT), throughput (tokens/sec), pricing, intelligence index, coding, math, or a specific benchmark
What decision are they making? (e.g., "I want the fastest model under $5/M tokens") — this helps prioritize the output

If the user's request is clear enough, proceed without asking and clarify assumptions in the output.

Step 2 – Run the fetch+compare script

Use the bundled script to fetch data and produce output. Always capture both stdout and stderr (2>&1) so error messages are visible:

python /path/to/skill/scripts/fetch_metrics.py \
  --api-key "USER_KEY_HERE" \
  [--models "model-slug-1,model-slug-2"] \
  [--providers "openai,anthropic,google"] \
  [--metric latency|throughput|pricing|intelligence|coding|math|all] \
  [--top N] \
  [--filter-min-intelligence 20] \
  [--output table|json|csv] \
  2>&1

Replace /path/to/skill/scripts/fetch_metrics.py with the actual path to the script (same directory as this SKILL.md, inside scripts/).

ERROR: Invalid API key — ask the user to double-check their key
ERROR: API rate limit exceeded — tell the user their daily quota is exhausted (1,000 req/day on the free tier) and that it resets after 24 hours; suggest they check https://artificialanalysis.ai for their quota status or a plan upgrade
ERROR: Network error — likely a connectivity issue; ask the user to retry

Comparing a specific list of models — use --models with a comma-separated list of slugs:

python .../fetch_metrics.py --api-key "..." \
  --models "gpt-4o,claude-4-5-sonnet,gemini-2-0-flash" \
  --metric all

python .../fetch_metrics.py --api-key "..." --list-models

If --models and --providers are both omitted, all available models are fetched.

Step 3 – Present results

After running the script, present the output in a clear, readable way:

Table format (default): render as a markdown table grouped by provider
Highlight winners: call out the top model per metric with a brief summary
Explain the metrics: briefly explain what each metric means if the user seems unfamiliar
- Median TTFT: how long until the first token appears (lower = more responsive)
- Output tokens/sec: generation speed after first token (higher = faster streaming)
- Input/output price: cost per million tokens
- Intelligence Index: Artificial Analysis composite score across multiple benchmarks

Always end with a "💡 Recommendation" section based on the user's stated goal.

Always include the attribution line:

Data source: artificialanalysis.ai — fetched live

Output Format

Standard Comparison Table

## LLM Benchmark Comparison
*Fetched live from Artificial Analysis*

| Model | Provider | TTFT (s) | Tokens/s | Price In ($/M) | Price Out ($/M) | Intelligence |
|-------|----------|----------|----------|----------------|-----------------|--------------|
| ...   | ...      | ...      | ...      | ...            | ...             | ...          |

💡 **Recommendation**: [1–2 sentence summary based on the user's goal]

Data source: artificialanalysis.ai

Top-N Format (when user asks "what's the fastest/cheapest/best")

Show only the top N models for the requested metric, sorted by that metric, with a brief explanation of why each ranks where it does.

Error Handling

Missing API key: Ask the user to provide it; show instructions for setting ARTIFICIAL_ANALYSIS_API_KEY
API rate limit hit: Inform the user (free tier: 1,000 req/day). Suggest waiting or checking quota
Model not found: List similar model slugs that were found and let the user pick
Empty results after filtering: Relax the filter and show what's available, explaining the trade-off

Notes on the API

Base URL: https://artificialanalysis.ai/api/v2
Key header: x-api-key: <your-key>
Main endpoint: GET /data/llms/models — returns all 470+ LLM models in one call
Response structure: evaluations and pricing are nested objects; the script flattens them automatically
Intelligence index scale: the real API uses a score roughly in the 0–55 range for current models (not 0–100). Adjust --filter-min-intelligence accordingly — e.g., --filter-min-intelligence 20 for above-average models, --filter-min-intelligence 30 for top-tier.
Benchmark scores: raw API values are 0–1 (e.g., gpqa: 0.543 = 54.3%); the script converts to percentage automatically
Missing speed data: some models show TTFT=0 and Tokens/s=0 — this means Artificial Analysis hasn't measured that model in a real-time benchmark run yet (common for newer or less popular models)
Fields used by this skill:
- name, slug, model_creator.name — identity
- median_time_to_first_token_seconds — TTFT latency (top-level)
- median_output_tokens_per_second — throughput (top-level)
- pricing.price_1m_input_tokens, pricing.price_1m_output_tokens, pricing.price_1m_blended_3_to_1 — pricing
- evaluations.artificial_analysis_intelligence_index — overall quality score
- evaluations.artificial_analysis_coding_index, evaluations.artificial_analysis_math_index — domain scores
- evaluations.mmlu_pro, evaluations.gpqa, evaluations.livecodebench, evaluations.math_500, evaluations.aime — specific benchmarks