ワンクリックでManusで任意のスキルを実行

$pwd:

evaluating-llms-harness

Name: Evaluating Llms Harness
Author: OpenRaiser

// Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Manusで実行

$ git log --oneline --stat

stars:1,403

forks:100

updated:2026年5月7日 02:44

ファイルエクスプローラー

5 ファイル

SKILL.md

readonly

related-skills.json

同じリポジトリ

peft-fine-tuning.md

from "OpenRaiser/NanoResearch"

Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.

2026-05-071.4k

unsloth.md

from "OpenRaiser/NanoResearch"

Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization

2026-05-071.4k

nanoresearch-experiment.md

from "OpenRaiser/NanoResearch"

Generate a Python code skeleton from an experiment blueprint

2026-03-051.4k

package.json

"author": "OpenRaiser"

"repository": "OpenRaiser/NanoResearch"

GitHub リポジトリを開く Creator のリポジトリを見る

$ install --global

$ download --local

Manusで実行

$ useful --forSOC

データサイエンティストコンピュータ・数学職15-2051L4

name	evaluating-llms-harness
description	Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
version	1.0.0
author	Orchestra Research
license	MIT
tags	["Evaluation","LM Evaluation Harness","Benchmarking","MMLU","HumanEval","GSM8K","EleutherAI","Model Quality","Academic Benchmarks","Industry Standard"]
dependencies	["lm-eval","transformers","vllm"]

name	evaluating-llms-harness
description	Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
version	1.0.0
author	Orchestra Research
license	MIT
tags	["Evaluation","LM Evaluation Harness","Benchmarking","MMLU","HumanEval","GSM8K","EleutherAI","Model Quality","Academic Benchmarks","Industry Standard"]
dependencies	["lm-eval","transformers","vllm"]

evaluating-llms-harness

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

evaluating-llms-harness

このリポジトリの他の Skills

このリポジトリの他の Skills

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources