Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

evaluating-llms-harness

Name: Evaluating Llms Harness
Author: OpenRaiser

// Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

In Manus ausführen

$ git log --oneline --stat

stars:1.403

forks:100

updated:7. Mai 2026 um 02:44

Datei-Explorer

5 Dateien

SKILL.md

readonly

related-skills.json

gleiches Repository

peft-fine-tuning.md

from "OpenRaiser/NanoResearch"

Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.

2026-05-071.4k

unsloth.md

from "OpenRaiser/NanoResearch"

Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization

2026-05-071.4k

nanoresearch-experiment.md

from "OpenRaiser/NanoResearch"

Generate a Python code skeleton from an experiment blueprint

2026-03-051.4k

package.json

"author": "OpenRaiser"

"repository": "OpenRaiser/NanoResearch"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

DatenwissenschaftlerInformatik- und Mathematikberufe15-2051L4

name	evaluating-llms-harness
description	Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
version	1.0.0
author	Orchestra Research
license	MIT
tags	["Evaluation","LM Evaluation Harness","Benchmarking","MMLU","HumanEval","GSM8K","EleutherAI","Model Quality","Academic Benchmarks","Industry Standard"]
dependencies	["lm-eval","transformers","vllm"]

name	evaluating-llms-harness
description	Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
version	1.0.0
author	Orchestra Research
license	MIT
tags	["Evaluation","LM Evaluation Harness","Benchmarking","MMLU","HumanEval","GSM8K","EleutherAI","Model Quality","Academic Benchmarks","Industry Standard"]
dependencies	["lm-eval","transformers","vllm"]

evaluating-llms-harness

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

evaluating-llms-harness

Mehr aus diesem Repository

Mehr aus diesem Repository

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources