name	rank-llm-eval
description	Use when analyzing rank_llm evaluation outputs across runs or models. Covers aggregated trec_eval JSONL files, response-analysis metrics, retrieval-cache handoff files, and side-by-side comparison of stored evaluation artifacts.

rank_llm Eval

Analyze and compare RankLLM evaluation outputs across models, prompt templates, or retrieval settings.

When to Use

After rank-llm evaluate
After rank-llm analyze
When comparing two rerank runs or model variants
When checking whether a retrieval-cache or rerank artifact is ready for downstream evaluation

What It Does

Aggregated Metric Comparison

Load two trec_eval_aggregated_results_<model>.jsonl files
Compare per-file metric values side by side
Report deltas for metrics such as ndcg_cut_10, map_cut_100, or recall-style scores

Response Analysis Interpretation

Read analyze summaries and error counts
Flag mixed valid and malformed outputs when partial_success appears
Separate response-format problems from ranking-quality problems

Retrieval Handoff Checks

Confirm that a run has the expected rerank JSONL or TREC artifacts before aggregation
Confirm that cached retrieval JSON exists when evaluation depends on precomputed retrieval results

Usage

Compare two evaluation outputs:

python3 .claude/skills/rank-llm-eval/scripts/compare.py \
  --run-a trec_eval_aggregated_results_a.jsonl \
  --run-b trec_eval_aggregated_results_b.jsonl

Or use the CLI directly:

rank-llm evaluate --model-name castorini/rank_zephyr_7b_v1_full
rank-llm analyze --files invocations.json --verbose
rank-llm view rerank_results.jsonl

Reference Files

references/metrics.md - Metric names, output files, and interpretation guidance

Comparison Script

See scripts/compare.py for the side-by-side comparison tool.

Gotchas

evaluate aggregates stored rerank outputs. It does not execute reranking by itself.
Comparing two aggregated evaluation files only makes sense when both runs cover the same datasets or TREC output files.
analyze focuses on response and parsing quality, not ranking quality. A clean analyze result does not imply good ndcg_cut_10.
partial_success means some responses were usable and some were malformed. Treat that as a data-quality issue before reading too much into the metric deltas.

name	rank-llm-eval
description	Use when analyzing rank_llm evaluation outputs across runs or models. Covers aggregated trec_eval JSONL files, response-analysis metrics, retrieval-cache handoff files, and side-by-side comparison of stored evaluation artifacts.

rank_llm Eval

Analyze and compare RankLLM evaluation outputs across models, prompt templates, or retrieval settings.

When to Use

After rank-llm evaluate
After rank-llm analyze
When comparing two rerank runs or model variants
When checking whether a retrieval-cache or rerank artifact is ready for downstream evaluation

What It Does

Aggregated Metric Comparison

Load two trec_eval_aggregated_results_<model>.jsonl files
Compare per-file metric values side by side
Report deltas for metrics such as ndcg_cut_10, map_cut_100, or recall-style scores

Response Analysis Interpretation

Read analyze summaries and error counts
Flag mixed valid and malformed outputs when partial_success appears
Separate response-format problems from ranking-quality problems

Retrieval Handoff Checks

Confirm that a run has the expected rerank JSONL or TREC artifacts before aggregation
Confirm that cached retrieval JSON exists when evaluation depends on precomputed retrieval results

Usage

Compare two evaluation outputs:

python3 .claude/skills/rank-llm-eval/scripts/compare.py \
  --run-a trec_eval_aggregated_results_a.jsonl \
  --run-b trec_eval_aggregated_results_b.jsonl

Or use the CLI directly:

rank-llm evaluate --model-name castorini/rank_zephyr_7b_v1_full
rank-llm analyze --files invocations.json --verbose
rank-llm view rerank_results.jsonl

Reference Files

references/metrics.md - Metric names, output files, and interpretation guidance

Comparison Script

See scripts/compare.py for the side-by-side comparison tool.

Gotchas

evaluate aggregates stored rerank outputs. It does not execute reranking by itself.
Comparing two aggregated evaluation files only makes sense when both runs cover the same datasets or TREC output files.
analyze focuses on response and parsing quality, not ranking quality. A clean analyze result does not imply good ndcg_cut_10.
partial_success means some responses were usable and some were malformed. Treat that as a data-quality issue before reading too much into the metric deltas.

rank-llm-eval

rank_llm Eval

When to Use

What It Does

Aggregated Metric Comparison

Response Analysis Interpretation

Retrieval Handoff Checks

Usage

Reference Files

Comparison Script

Gotchas

More from this repository

More from this repository

rank_llm Eval

When to Use

What It Does

Aggregated Metric Comparison

Response Analysis Interpretation

Retrieval Handoff Checks

Usage

Reference Files

Comparison Script

Gotchas