Run any Skill in Manus with one click

$pwd:

evaluate

Name: Evaluate
Author: damionrashford

// Systematic evaluation of ML models, experiments, and AI system outputs. Multi-dimensional rubrics, LLM-as-judge, bias detection, and structured comparison frameworks. Use when the user asks to "evaluate model performance", "compare models", "build evaluation rubrics", "assess output quality", "detect model bias", or mentions evaluation frameworks, LLM-as-judge, model comparison, or quality assessment.

Run Skill in Manus

$ git log --oneline --stat

stars:2

forks:0

updated:April 9, 2026 at 01:21

File Explorer

3 files

SKILL.md

readonly

related-skills.json

same repository

claude-code.md

from "damionrashford/mlx"

Comprehensive Claude Code knowledge base — plugins, hooks, skills, agents, MCP, channels, headless mode, permissions, settings, and all extensibility features. Use when building, configuring, debugging, or extending Claude Code.

2026-04-092

skill-creator.md

from "damionrashford/mlx"

Use when building a skill, creating a SKILL.md, packaging a workflow, making a slash command, or asked "how do I make a skill". Scaffolds the folder, generates SKILL.md from a template, validates against spec. Produces a complete ready-to-deploy skill folder: scripts, references, assets. Also use to review or improve an existing skill.

2026-04-092

analyze.md

from "damionrashford/mlx"

Statistical analysis, hypothesis testing, A/B testing, cohort analysis, segmentation, trend detection, business metrics, pre-delivery validation, and data visualization. Use when the user asks to "analyze this data", "run a statistical test", "compare groups", "find trends", "do A/B test analysis", "segment customers", "calculate KPIs", "validate this analysis", "check my work", "sanity check", "review my numbers", "make a chart", "create a dashboard", "plot the data", "visualize results", or mentions hypothesis testing, cohort analysis, business analytics, data validation, bar charts, line charts, heatmaps, scatter plots, or data storytelling.

2026-04-092

autoexperiment.md

from "damionrashford/mlx"

Autonomous time-budget experiment loop. Modify a training script, train for a fixed wall-clock budget, evaluate, record, repeat. Inspired by karpathy/autoresearch. Use for overnight architecture search, systematic hyperparameter sweeps, or any iterative model improvement workflow.

2026-04-092

context-engineering.md

from "damionrashford/mlx"

Context engineering for building production LLM applications: context window management, degradation patterns, optimization strategies, memory system selection, multi-agent architecture, filesystem context patterns, and tool design principles. Use when building LLM apps, RAG pipelines, AI agents, multi-agent systems, or when designing memory, tool APIs, or context strategies for any language model application.

2026-04-092

data-prep.md

from "damionrashford/mlx"

Explore, clean, and engineer datasets end-to-end: statistical profiling, distribution checks, missing value analysis, duplicate detection, outlier removal, type fixing, encoding, create features, encode categories, transform columns, add rolling windows, build interaction terms, and feature engineering. Supports pandas, polars, and PySpark. Use when the user wants to explore data, profile columns, understand a dataset, clean data, handle missing values, remove duplicates, fix data types, preprocess a dataset before modeling, create features, encode categories, transform columns, add rolling windows, build interaction terms, or do feature engineering.

2026-04-092

package.json

"author": "damionrashford"

"repository": "damionrashford/mlx"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

name	evaluate
description	Systematic evaluation of ML models, experiments, and AI system outputs. Multi-dimensional rubrics, LLM-as-judge, bias detection, and structured comparison frameworks. Use when the user asks to "evaluate model performance", "compare models", "build evaluation rubrics", "assess output quality", "detect model bias", or mentions evaluation frameworks, LLM-as-judge, model comparison, or quality assessment.
allowed-tools	Bash, Read, Write, Glob, Grep
argument-hint	path to predictions, results.tsv, or model outputs (e.g. "results.tsv" or "predictions.csv")
model	sonnet
effort	medium
paths	/results.tsv,/predictions.csv,/metrics.json
compatibility	>=1.0
metadata	{"category":"model-evaluation","tags":["evaluation","llm-as-judge","bias-detection","metrics","model-comparison","rubrics"],"phase":"evaluate"}

Model & System Evaluation

Frameworks for systematic evaluation of ML models, experiment results, and AI system outputs. Covers traditional ML metrics, LLM-as-judge patterns, bias detection, and structured comparison.

When to use

Comparing multiple trained models beyond val_score
Evaluating LLM/AI application outputs (RAG quality, prompt effectiveness)
Building evaluation rubrics for subjective tasks
Detecting bias in model predictions
Creating test sets for systematic assessment
Deciding between experiment results in results.tsv

Traditional ML evaluation

Classification metrics

Metric	When to use	Formula
Accuracy	Balanced classes	correct / total
Precision	False positives costly	TP / (TP + FP)
Recall	False negatives costly	TP / (TP + FN)
F1	Imbalanced classes	2 * P * R / (P + R)
AUC-ROC	Ranking / threshold selection	Area under ROC curve
Log loss	Probability calibration	-mean(y*log(p))

Regression metrics

Metric	When to use	Sensitivity
RMSE	Penalize large errors	Outlier sensitive
MAE	Robust to outliers	Linear penalty
R-squared	Variance explained	Scale independent
MAPE	Percentage error	Zero sensitive

Beyond single metrics

from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Multi-dimensional evaluation
report = classification_report(y_true, y_pred, output_dict=True)

# Per-class performance (find weak spots)
cm = confusion_matrix(y_true, y_pred)
per_class_acc = cm.diagonal() / cm.sum(axis=1)

# Calibration (are probabilities reliable?)
from sklearn.calibration import calibration_curve
fraction_pos, mean_predicted = calibration_curve(y_true, y_prob, n_bins=10)

# Fairness across subgroups
for group in ['group_a', 'group_b']:
    mask = df['subgroup'] == group
    group_score = metric(y_true[mask], y_pred[mask])
    print(f"{group}: {group_score:.4f}")

Experiment comparison framework

When comparing experiments in results.tsv:

Multi-dimensional comparison table

| Experiment | val_score | test_score | memory_mb | train_time | complexity |
|------------|-----------|------------|-----------|------------|------------|
| exp000     | 0.8523    | 0.8401     | 4096      | 2m         | low        |
| exp003     | 0.8634    | 0.8521     | 4352      | 5m         | medium     |
| exp007     | 0.8641    | 0.8530     | 8192      | 45m        | high       |

Decision criteria (beyond raw score)

Statistical significance: Is the improvement real or noise?
- Bootstrap confidence intervals (1000 resamples)
- Paired t-test or Wilcoxon for matched samples
- If delta < 2x std of CV folds, it's likely noise
Efficiency trade-off: Score per resource unit
- Score improvement / memory increase
- Score improvement / training time increase
- Diminishing returns detection
Robustness: Does it hold across conditions?
- Cross-validation variance (low = robust)
- Performance on minority classes
- Sensitivity to random seed
Deployability: Can it run in production?
- Inference latency (ms per prediction)
- Memory footprint at serving time
- Dependency complexity

LLM-as-judge evaluation

For evaluating LLM application outputs (RAG, chatbots, agents):

Direct scoring

EVAL_PROMPT = """Rate the following AI response on a scale of 1-5 for each dimension.

**Task**: {task_description}
**Input**: {user_input}
**Response**: {model_output}
**Reference** (if available): {reference}

Rate each dimension:
- Relevance (1-5): Does the response address the question?
- Accuracy (1-5): Are the facts correct?
- Completeness (1-5): Does it cover all aspects?
- Clarity (1-5): Is it well-organized and clear?
- Helpfulness (1-5): Would this actually help the user?

Output as JSON:
{{"relevance": X, "accuracy": X, "completeness": X, "clarity": X, "helpfulness": X, "reasoning": "..."}}
"""

Pairwise comparison

More reliable than direct scoring for subjective quality:

PAIRWISE_PROMPT = """Which response better answers the question?

**Question**: {question}

**Response A**: {response_a}
**Response B**: {response_b}

Choose: A is better / B is better / Tie
Explain your reasoning in 2-3 sentences.
"""

Bias mitigation: Run twice with A/B swapped. If results disagree, mark as tie.

Known biases in LLM judges

Bias	Description	Mitigation
Position	Prefers first response	Swap positions, average
Length	Prefers longer responses	Normalize by length
Self-enhancement	Prefers own model's style	Use different judge model
Verbosity	Equates detail with quality	Explicit rubric criteria
Authority	Prefers confident tone	Focus on factual accuracy

Evaluation taxonomy: when to use each approach

Approach	Best for	Limitation
Direct scoring	Objective criteria — factual accuracy, instruction following, toxicity	Score calibration drift, inconsistent scale interpretation
Pairwise comparison	Subjective preferences — tone, style, persuasiveness, overall quality	Position bias, length bias
Rubric-based	Multi-dimensional quality with defined criteria	Requires upfront rubric design

Research (MT-Bench, Zheng et al. 2023): pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation. Use direct scoring for objective criteria with clear ground truth; use pairwise for subjective quality comparisons.

Metric selection by evaluation task

Task type	Primary metrics	Secondary
Binary pass/fail	Recall, Precision, F1	Cohen's κ
Ordinal scale (1-5)	Spearman's ρ, Kendall's τ	Cohen's κ (weighted)
Pairwise preference	Agreement rate, position consistency	Confidence calibration
Multi-label	Macro-F1, Micro-F1	Per-label precision/recall

Production evaluation pipeline

def evaluate_with_bias_mitigation(question, response_a, response_b, judge_model):
    # Forward pass
    result_1 = judge_model(PAIRWISE_PROMPT.format(
        question=question, response_a=response_a, response_b=response_b
    ))
    # Swapped pass (position bias mitigation)
    result_2 = judge_model(PAIRWISE_PROMPT.format(
        question=question, response_a=response_b, response_b=response_a
    ))
    # Normalize result_2 (A/B labels are swapped)
    if result_1 == result_2:
        return result_1  # Consistent — high confidence
    else:
        return "tie"  # Inconsistent — call it a tie

def build_rubric(criterion, weight, levels):
    """
    criterion: "Factual Accuracy"
    weight: 0.40
    levels: {5: "All claims verified", 3: "Mostly accurate", 1: "Multiple errors"}
    """
    return {"criterion": criterion, "weight": weight, "levels": levels}

Performance driver — token budget matters

Research (BrowseComp benchmark) shows three factors explain 95% of agent performance variance:

Token usage: 80% of variance — more context/turns = better performance
Number of tool calls: ~10%
Model choice: ~5%

Implication: evaluate with realistic token budgets, not unlimited resources. Upgrading from an older model to Claude Sonnet 4.5 or GPT-5.2 provides larger gains than doubling token budget on the same model.

Test set design

Complexity stratification

Build test sets covering multiple difficulty levels:

test_set = {
    "simple": [
        # Single-step, clear answer, common patterns
    ],
    "medium": [
        # Multi-step, some ambiguity, less common patterns
    ],
    "complex": [
        # Many steps, significant ambiguity, edge cases
    ],
    "adversarial": [
        # Deliberately tricky, boundary conditions, known failure modes
    ]
}

Coverage checklist

Happy path (typical inputs)
Edge cases (boundary values, empty inputs, extreme values)
Subgroup fairness (performance across demographics/categories)
Out-of-distribution (inputs unlike training data)
Adversarial (inputs designed to fool the model)
Temporal (does performance change with data from different time periods?)

Evaluation report format

=== Evaluation Report ===
Task: [classification/regression/generation/retrieval]
Models compared: [list]
Test set: [size, composition]

Best model: [name]
Primary metric: [metric] = [value] (95% CI: [low, high])

Multi-dimensional comparison:
| Dimension       | Model A | Model B | Winner |
|-----------------|---------|---------|--------|
| Primary metric  |         |         |        |
| Inference speed |         |         |        |
| Memory usage    |         |         |        |
| Robustness      |         |         |        |
| Fairness        |         |         |        |

Recommendation: [which model and why]
Caveats: [limitations of this evaluation]

Rules

Never evaluate on training data
Use multiple metrics, not just one number
Report confidence intervals, not just point estimates
Check subgroup performance, not just aggregate
Test set is touched ONCE for final evaluation
Document evaluation methodology alongside results
Acknowledge what the evaluation does NOT cover

evaluate

More from this repository

More from this repository

Model & System Evaluation

When to use

Traditional ML evaluation

Classification metrics

Regression metrics

Beyond single metrics

Experiment comparison framework

Multi-dimensional comparison table

Decision criteria (beyond raw score)

LLM-as-judge evaluation

Direct scoring

Pairwise comparison

Known biases in LLM judges

Evaluation taxonomy: when to use each approach

Metric selection by evaluation task

Production evaluation pipeline

Performance driver — token budget matters

Test set design

Complexity stratification

Coverage checklist

Evaluation report format

Rules

Model & System Evaluation

When to use

Traditional ML evaluation

Classification metrics

Regression metrics

Beyond single metrics

Experiment comparison framework

Multi-dimensional comparison table

Decision criteria (beyond raw score)

LLM-as-judge evaluation

Direct scoring

Pairwise comparison

Known biases in LLM judges

Evaluation taxonomy: when to use each approach

Metric selection by evaluation task

Production evaluation pipeline

Performance driver — token budget matters

Test set design

Complexity stratification

Coverage checklist

Evaluation report format

Rules