一键在 Manus 中运行任何 Skill

$pwd:

umbrela-eval

Name: Umbrela Eval
Author: castorini

// Use when analyzing umbrela evaluation results — comparing nDCG@10 scores across backends, interpreting confusion matrices, computing kappa agreement, and comparing modified qrels against human judgments. Use after running evaluate to interpret results.

在 Manus 中运行

$ git log --oneline --stat

stars:56

forks:8

updated:2026年3月25日 12:41

文件资源管理器

3 个文件

SKILL.md

readonly

name	umbrela-eval
description	Use when analyzing umbrela evaluation results — comparing nDCG@10 scores across backends, interpreting confusion matrices, computing kappa agreement, and comparing modified qrels against human judgments. Use after running evaluate to interpret results.

Umbrela Eval

Analyze and compare umbrela evaluation results across backends, models, and configurations.

When to Use

After umbrela evaluate — interpret nDCG@10 scores and confusion matrices
When comparing LLM judge backends (e.g., gpt-4o vs gemini-pro)
When comparing prompt types (bing vs basic)
When assessing agreement between LLM and human judgments

What It Does

Backend Comparison

Compare nDCG@10 scores across different backends/models on the same qrel
Show original vs modified nDCG@10 deltas
Identify which backends agree most with human judgments

Confusion Matrix Analysis

Load confusion matrix PNGs from conf_matrix/ directory
Report per-category accuracy (0, 1, 2, 3)
Identify systematic biases (e.g., model over-predicts category 2)

Agreement Metrics

Cohen's kappa between LLM labels and human labels
Per-category precision, recall, F1
Overall accuracy

Usage

Compare evaluation results:

python3 .claude/skills/umbrela-eval/scripts/compare.py \
  --qrel dl19-passage \
  --run-a modified_qrels/dl19-passage_gpt-4o_01230_0_1.txt \
  --run-b modified_qrels/dl19-passage_gemini-pro_01230_0_1.txt

Or use the CLI directly:

# Run evaluation
umbrela evaluate --backend gpt --model gpt-4o \
  --qrel dl19-passage --result-file run.trec --output json

# View judgment artifacts
umbrela view judgments.jsonl --records 10

Reference Files

references/qrels.md — Standard qrels, nDCG@10 baselines, and evaluation conventions

Comparison Script

See scripts/compare.py for the side-by-side comparison tool.

Gotchas

evaluate uses cached modified qrels by default. Use --regenerate to force re-judging.
Confusion matrices compare LLM labels to human labels only for pairs where both exist — "holes" (new judgments) are not in the matrix.
nDCG@10 differences between original and modified qrels reflect the impact of judging previously-unjudged documents, not just label agreement.
--judge-cat 2,3 means only relevant-category pairs are judged — this changes which holes are filled and affects nDCG differently than judging all categories.
Ensemble results represent majority vote. Check individual backend confusion matrices for per-backend quality.
pyserini is required for evaluation (uv sync --extra pyserini).

related-skills.json

同仓库

umbrela-install.md

from "castorini/umbrela"

Set up an umbrela development environment — checks Python 3.11+, installs via uv or pip with cloud extras, and verifies with doctor. Use when someone is onboarding, setting up a fresh clone, or troubleshooting their environment.

2026-03-1856

umbrela-verify.md

from "castorini/umbrela"

Use when validating umbrela judge outputs — checks label range (0–3), qid/docid completeness, result_status consistency, backend metadata, and JSONL integrity. Wraps `umbrela validate` plus custom assertions. Use after running judge or evaluate to verify output correctness.

2026-03-1756

umbrela-quickstart.md

from "castorini/umbrela"

Use when working with umbrela CLI commands (judge, evaluate), backend selection (gpt, gemini, hf, os, ensemble), qrel handling (dl19-passage, dl20-passage, etc.), relevance labels (0–3), or introspection (doctor, describe, schema, validate). Covers all entry points, flags, and evaluation workflows.

2026-03-1756

package.json

"author": "castorini"

"repository": "castorini/umbrela"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

数据科学家计算机与数学类职业15-2051L4

name	umbrela-eval
description	Use when analyzing umbrela evaluation results — comparing nDCG@10 scores across backends, interpreting confusion matrices, computing kappa agreement, and comparing modified qrels against human judgments. Use after running evaluate to interpret results.

Umbrela Eval

Analyze and compare umbrela evaluation results across backends, models, and configurations.

When to Use

After umbrela evaluate — interpret nDCG@10 scores and confusion matrices
When comparing LLM judge backends (e.g., gpt-4o vs gemini-pro)
When comparing prompt types (bing vs basic)
When assessing agreement between LLM and human judgments

What It Does

Backend Comparison

Compare nDCG@10 scores across different backends/models on the same qrel
Show original vs modified nDCG@10 deltas
Identify which backends agree most with human judgments

Confusion Matrix Analysis

Load confusion matrix PNGs from conf_matrix/ directory
Report per-category accuracy (0, 1, 2, 3)
Identify systematic biases (e.g., model over-predicts category 2)

Agreement Metrics

Cohen's kappa between LLM labels and human labels
Per-category precision, recall, F1
Overall accuracy

Usage

Compare evaluation results:

python3 .claude/skills/umbrela-eval/scripts/compare.py \
  --qrel dl19-passage \
  --run-a modified_qrels/dl19-passage_gpt-4o_01230_0_1.txt \
  --run-b modified_qrels/dl19-passage_gemini-pro_01230_0_1.txt

Or use the CLI directly:

# Run evaluation
umbrela evaluate --backend gpt --model gpt-4o \
  --qrel dl19-passage --result-file run.trec --output json

# View judgment artifacts
umbrela view judgments.jsonl --records 10

Reference Files

references/qrels.md — Standard qrels, nDCG@10 baselines, and evaluation conventions

Comparison Script

See scripts/compare.py for the side-by-side comparison tool.

Gotchas

evaluate uses cached modified qrels by default. Use --regenerate to force re-judging.
Confusion matrices compare LLM labels to human labels only for pairs where both exist — "holes" (new judgments) are not in the matrix.
nDCG@10 differences between original and modified qrels reflect the impact of judging previously-unjudged documents, not just label agreement.
--judge-cat 2,3 means only relevant-category pairs are judged — this changes which holes are filled and affects nDCG differently than judging all categories.
Ensemble results represent majority vote. Check individual backend confusion matrices for per-backend quality.
pyserini is required for evaluation (uv sync --extra pyserini).

umbrela-eval

Umbrela Eval

When to Use

What It Does

Backend Comparison

Confusion Matrix Analysis

Agreement Metrics

Usage

Reference Files

Comparison Script

Gotchas

同仓库更多 Skills

同仓库更多 Skills

Umbrela Eval

When to Use

What It Does

Backend Comparison

Confusion Matrix Analysis

Agreement Metrics

Usage

Reference Files

Comparison Script

Gotchas