Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

evaluating-llms-harness

Name: Evaluating Llms Harness
Author: OpenRaiser

// Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Ejecutar en Manus

$ git log --oneline --stat

stars:1403

forks:100

updated:7 de mayo de 2026, 02:44

Explorador de archivos

5 archivos

SKILL.md

readonly

related-skills.json

mismo repositorio

peft-fine-tuning.md

from "OpenRaiser/NanoResearch"

Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.

2026-05-071.4k

unsloth.md

from "OpenRaiser/NanoResearch"

Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization

2026-05-071.4k

nanoresearch-experiment.md

from "OpenRaiser/NanoResearch"

Generate a Python code skeleton from an experiment blueprint

2026-03-051.4k

package.json

"author": "OpenRaiser"

"repository": "OpenRaiser/NanoResearch"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Científicos de datosOcupaciones informáticas y matemáticas15-2051L4

name	evaluating-llms-harness
description	Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
version	1.0.0
author	Orchestra Research
license	MIT
tags	["Evaluation","LM Evaluation Harness","Benchmarking","MMLU","HumanEval","GSM8K","EleutherAI","Model Quality","Academic Benchmarks","Industry Standard"]
dependencies	["lm-eval","transformers","vllm"]

name	evaluating-llms-harness
description	Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
version	1.0.0
author	Orchestra Research
license	MIT
tags	["Evaluation","LM Evaluation Harness","Benchmarking","MMLU","HumanEval","GSM8K","EleutherAI","Model Quality","Academic Benchmarks","Industry Standard"]
dependencies	["lm-eval","transformers","vllm"]

evaluating-llms-harness

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

evaluating-llms-harness

Más de este repositorio

Más de este repositorio

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources