Run any Skill in Manus with one click

evaluating-code-models

Stars9,996

Forks745

UpdatedDecember 14, 2025 at 00:38

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

Orchestra-Research

Orchestra-Research/AI-Research-SKILLs

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Data ScientistsComputer and Mathematical Occupations·SOC 15-2051

Software Quality Assurance Analysts and TestersSOC 15-1253

File Explorer

4 files

SKILL.md

readonly

More from this repository

same repository

model-merging

Orchestra-Research/AI-Research-SKILLs

Merge multiple fine-tuned models using mergekit to combine capabilities without retraining. Use when creating specialized models by blending domain-specific expertise (math + coding + chat), improving performance beyond single models, or experimenting rapidly with model variants. Covers SLERP, TIES-Merging, DARE, Task Arithmetic, linear merging, and production deployment strategies.

2026-06-1610.0k

ara-compiler

Orchestra-Research/AI-Research-SKILLs

Compiles any research input — PDF papers, GitHub repositories, experiment logs, code directories, or raw notes — into a complete Agent-Native Research Artifact (ARA) with cognitive layer (claims, concepts, heuristics), physical layer (configs, code stubs), exploration graph, and grounded evidence. Use when ingesting a paper or codebase into a structured, machine-executable knowledge package, building an ARA from scratch, or converting research outputs into a falsifiable, agent-traversable form.

2026-04-2810.0k

ara-research-manager

Orchestra-Research/AI-Research-SKILLs

Records research provenance as a post-task epilogue, scanning conversation history at the end of a coding or research session to extract decisions, experiments, dead ends, claims, heuristics, and pivots, and writing them into the ara/ directory with user-vs-AI provenance tags. Use as a session epilogue — never during execution — to maintain a faithful, auditable trace of how a research project actually evolved.

2026-04-2810.0k

ara-rigor-reviewer

Orchestra-Research/AI-Research-SKILLs

Performs ARA Seal Level 2 semantic epistemic review on Agent-Native Research Artifacts, scoring six dimensions (evidence relevance, falsifiability, scope calibration, argument coherence, exploration integrity, methodological rigor) and producing a constructive, severity-ranked report with a Strong Accept-to-Reject recommendation. Use after Level 1 structural validation passes, when an ARA needs an objective epistemic critique before publication or release.

2026-04-2810.0k

ml-paper-writing

Orchestra-Research/AI-Research-SKILLs

Write publication-ready ML/AI papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Use when drafting papers from research repos, structuring arguments, verifying citations, or preparing camera-ready submissions. For systems venues (OSDI, NSDI, ASPLOS, SOSP), use systems-paper-writing instead.

2026-04-1010.0k

presenting-conference-talks

Orchestra-Research/AI-Research-SKILLs

Generates conference presentation slides (Beamer LaTeX PDF and editable PPTX) from a compiled paper with speaker notes and talk script. Use when preparing oral talks, spotlight presentations, or invited talks for ML and systems conferences.

2026-04-1010.0k

name	evaluating-code-models
description	Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.
version	1.0.0
author	Orchestra Research
license	MIT
tags	["Evaluation","Code Generation","HumanEval","MBPP","MultiPL-E","Pass@k","BigCode","Benchmarking","Code Models"]
dependencies	["bigcode-evaluation-harness","transformers>=4.25.1","accelerate>=0.13.2","datasets>=2.6.1"]

BigCode Evaluation Harness - Code Model Benchmarking

Quick Start

BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages).

Installation:

git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config

Evaluate on HumanEval:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution \
  --save_generations

View available tasks:

python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"

Common Workflows

Workflow 1: Standard Code Benchmark Evaluation

Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+).

Checklist:

Code Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model and generation
- [ ] Step 3: Run evaluation with code execution
- [ ] Step 4: Analyze pass@k results

Step 1: Choose benchmark suite

Python code generation (most common):

HumanEval: 164 handwritten problems, function completion
HumanEval+: Same 164 problems with 80× more tests (stricter)
MBPP: 500 crowd-sourced problems, entry-level difficulty
MBPP+: 399 curated problems with 35× more tests

Multi-language (18 languages):

MultiPL-E: HumanEval/MBPP translated to C++, Java, JavaScript, Go, Rust, etc.

Advanced:

APPS: 10,000 problems (introductory/interview/competition)
DS-1000: 1,000 data science problems across 7 libraries

Step 2: Configure model and generation

# Standard HuggingFace model
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --do_sample True \
  --n_samples 200 \
  --batch_size 50 \
  --allow_code_execution

# Quantized model (4-bit)
accelerate launch main.py \
  --model codellama/CodeLlama-34b-hf \
  --tasks humaneval \
  --load_in_4bit \
  --max_length_generation 512 \
  --allow_code_execution

# Custom/private model
accelerate launch main.py \
  --model /path/to/my-code-model \
  --tasks humaneval \
  --trust_remote_code \
  --use_auth_token \
  --allow_code_execution

Step 3: Run evaluation

# Full evaluation with pass@k estimation (k=1,10,100)
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --temperature 0.8 \
  --n_samples 200 \
  --batch_size 50 \
  --allow_code_execution \
  --save_generations \
  --metric_output_path results/starcoder2-humaneval.json

Step 4: Analyze results

Results in results/starcoder2-humaneval.json:

{
  "humaneval": {
    "pass@1": 0.354,
    "pass@10": 0.521,
    "pass@100": 0.689
  },
  "config": {
    "model": "bigcode/starcoder2-7b",
    "temperature": 0.8,
    "n_samples": 200
  }
}

Workflow 2: Multi-Language Evaluation (MultiPL-E)

Evaluate code generation across 18 programming languages.

Checklist:

Multi-Language Evaluation:
- [ ] Step 1: Generate solutions (host machine)
- [ ] Step 2: Run evaluation in Docker (safe execution)
- [ ] Step 3: Compare across languages

Step 1: Generate solutions on host

# Generate without execution (safe)
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --max_length_generation 650 \
  --temperature 0.8 \
  --n_samples 50 \
  --batch_size 50 \
  --generation_only \
  --save_generations \
  --save_generations_path generations_multi.json

Step 2: Evaluate in Docker container

# Pull the MultiPL-E Docker image
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

# Run evaluation inside container
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro \
  -it evaluation-harness-multiple python3 main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --load_generations_path /app/generations.json \
  --allow_code_execution \
  --n_samples 50

Supported languages: Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racket

Workflow 3: Instruction-Tuned Model Evaluation

Evaluate chat/instruction models with proper formatting.

Checklist:

Instruction Model Evaluation:
- [ ] Step 1: Use instruction-tuned tasks
- [ ] Step 2: Configure instruction tokens
- [ ] Step 3: Run evaluation

Step 1: Choose instruction tasks

instruct-humaneval: HumanEval with instruction prompts
humanevalsynthesize-{lang}: HumanEvalPack synthesis tasks

Step 2: Configure instruction tokens

# For models with chat templates (e.g., CodeLlama-Instruct)
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks instruct-humaneval \
  --instruction_tokens "<s>[INST],</s>,[/INST]" \
  --max_length_generation 512 \
  --allow_code_execution

Step 3: HumanEvalPack for instruction models

# Test code synthesis across 6 languages
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks humanevalsynthesize-python,humanevalsynthesize-js \
  --prompt instruct \
  --max_length_generation 512 \
  --allow_code_execution

Workflow 4: Compare Multiple Models

Benchmark suite for model comparison.

Step 1: Create evaluation script

#!/bin/bash
# eval_models.sh

MODELS=(
  "bigcode/starcoder2-7b"
  "codellama/CodeLlama-7b-hf"
  "deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"

for model in "${MODELS[@]}"; do
  model_name=$(echo $model | tr '/' '-')
  echo "Evaluating $model"

  accelerate launch main.py \
    --model $model \
    --tasks $TASKS \
    --temperature 0.2 \
    --n_samples 20 \
    --batch_size 20 \
    --allow_code_execution \
    --metric_output_path results/${model_name}.json
done

Step 2: Generate comparison table

import json
import pandas as pd

models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []

for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        results.append({
            "Model": model,
            "HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
            "MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
        })

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

When to Use vs Alternatives

Use BigCode Evaluation Harness when:

Evaluating code generation models specifically
Need multi-language evaluation (18 languages via MultiPL-E)
Testing functional correctness with unit tests (pass@k)
Benchmarking for BigCode/HuggingFace leaderboards
Evaluating fill-in-the-middle (FIM) capabilities

Use alternatives instead:

lm-evaluation-harness: General LLM benchmarks (MMLU, GSM8K, HellaSwag)
EvalPlus: Stricter HumanEval+/MBPP+ with more test cases
SWE-bench: Real-world GitHub issue resolution
LiveCodeBench: Contamination-free, continuously updated problems
CodeXGLUE: Code understanding tasks (clone detection, defect prediction)

Supported Benchmarks

Benchmark	Problems	Languages	Metric	Use Case
HumanEval	164	Python	pass@k	Standard code completion
HumanEval+	164	Python	pass@k	Stricter evaluation (80× tests)
MBPP	500	Python	pass@k	Entry-level problems
MBPP+	399	Python	pass@k	Stricter evaluation (35× tests)
MultiPL-E	164×18	18 languages	pass@k	Multi-language evaluation
APPS	10,000	Python	pass@k	Competition-level
DS-1000	1,000	Python	pass@k	Data science (pandas, numpy, etc.)
HumanEvalPack	164×3×6	6 languages	pass@k	Synthesis/fix/explain
Mercury	1,889	Python	Efficiency	Computational efficiency

Common Issues

Issue: Different results than reported in papers

Check these factors:

# 1. Verify n_samples (need 200 for accurate pass@k)
--n_samples 200

# 2. Check temperature (0.2 for greedy-ish, 0.8 for sampling)
--temperature 0.8

# 3. Verify task name matches exactly
--tasks humaneval  # Not "human_eval" or "HumanEval"

# 4. Check max_length_generation
--max_length_generation 512  # Increase for longer problems

Issue: CUDA out of memory

# Use quantization
--load_in_8bit
# OR
--load_in_4bit

# Reduce batch size
--batch_size 1

# Set memory limit
--max_memory_per_gpu "20GiB"

Issue: Code execution hangs or times out

Use Docker for safe execution:

# Generate on host (no execution)
--generation_only --save_generations

# Evaluate in Docker
docker run ... --allow_code_execution --load_generations_path ...

Issue: Low scores on instruction models

Ensure proper instruction formatting:

# Use instruction-specific tasks
--tasks instruct-humaneval

# Set instruction tokens for your model
--instruction_tokens "<s>[INST],</s>,[/INST]"

Issue: MultiPL-E language failures

Use the dedicated Docker image:

docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

Command Reference

Argument	Default	Description
`--model`	-	HuggingFace model ID or local path
`--tasks`	-	Comma-separated task names
`--n_samples`	1	Samples per problem (200 for pass@k)
`--temperature`	0.2	Sampling temperature
`--max_length_generation`	512	Max tokens (prompt + generation)
`--batch_size`	1	Batch size per GPU
`--allow_code_execution`	False	Enable code execution (required)
`--generation_only`	False	Generate without evaluation
`--load_generations_path`	-	Load pre-generated solutions
`--save_generations`	False	Save generated code
`--metric_output_path`	results.json	Output file for metrics
`--load_in_8bit`	False	8-bit quantization
`--load_in_4bit`	False	4-bit quantization
`--trust_remote_code`	False	Allow custom model code
`--precision`	fp32	Model precision (fp32/fp16/bf16)

Hardware Requirements

Model Size	VRAM (fp16)	VRAM (4-bit)	Time (HumanEval, n=200)
7B	14GB	6GB	~30 min (A100)
13B	26GB	10GB	~1 hour (A100)
34B	68GB	20GB	~2 hours (A100)

Resources

GitHub: https://github.com/bigcode-project/bigcode-evaluation-harness
Documentation: https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs
BigCode Leaderboard: https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
HumanEval Dataset: https://huggingface.co/datasets/openai/openai_humaneval
MultiPL-E: https://github.com/nuprl/MultiPL-E