تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

evaluating-llms-harness

Name: Evaluating Llms Harness
Author: OpenRaiser

// Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

تشغيل في Manus

$ git log --oneline --stat

stars:١٬٤٠٣

forks:١٠٠

updated:٧ مايو ٢٠٢٦ في ٠٢:٤٤

مستكشف الملفات

5 ملفات

SKILL.md

readonly

related-skills.json

نفس المستودع

peft-fine-tuning.md

from "OpenRaiser/NanoResearch"

Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.

2026-05-071.4k

unsloth.md

from "OpenRaiser/NanoResearch"

Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization

2026-05-071.4k

nanoresearch-experiment.md

from "OpenRaiser/NanoResearch"

Generate a Python code skeleton from an experiment blueprint

2026-03-051.4k

package.json

"author": "OpenRaiser"

"repository": "OpenRaiser/NanoResearch"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

علماء البياناتمهن الحاسوب والرياضيات15-2051L4

name	evaluating-llms-harness
description	Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
version	1.0.0
author	Orchestra Research
license	MIT
tags	["Evaluation","LM Evaluation Harness","Benchmarking","MMLU","HumanEval","GSM8K","EleutherAI","Model Quality","Academic Benchmarks","Industry Standard"]
dependencies	["lm-eval","transformers","vllm"]

name	evaluating-llms-harness
description	Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
version	1.0.0
author	Orchestra Research
license	MIT
tags	["Evaluation","LM Evaluation Harness","Benchmarking","MMLU","HumanEval","GSM8K","EleutherAI","Model Quality","Academic Benchmarks","Industry Standard"]
dependencies	["lm-eval","transformers","vllm"]

evaluating-llms-harness

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

evaluating-llms-harness

المزيد من هذا المستودع

المزيد من هذا المستودع

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources