Run any Skill in Manus with one click

$pwd:

reproduce

Name: Reproduce
Author: fcakyon

// End-to-end paper reproduction from arxiv URL through smoke runs to replication experiments. Handles missing or partial official code, missing training scripts, missing hyperparameters, and private datasets via similar-public-dataset substitution. Use when the user asks to reproduce, implement, replicate, or re-run a paper from scratch, or pastes an arxiv URL with reproduction intent.

Run Skill in Manus

$ git log --oneline --stat

stars:250

forks:24

updated:April 30, 2026 at 11:03

File Explorer

8 files

SKILL.md

readonly

name	reproduce
description	End-to-end paper reproduction from arxiv URL through smoke runs to replication experiments. Handles missing or partial official code, missing training scripts, missing hyperparameters, and private datasets via similar-public-dataset substitution. Use when the user asks to reproduce, implement, replicate, or re-run a paper from scratch, or pastes an arxiv URL with reproduction intent.

Reproduce: paper reproduction from scratch

Reproducing an ML paper often means filling gaps the authors didn't ship, training scripts, hyperparameter tables, augmentation specifics, exact dataset splits. This skill walks seven stages from "I have an arxiv link" to "I have a replication run with measurable delta vs the paper's number."

Each stage has a separate reference file under references/ so this overview stays scannable.

When to run

The user just said any of:

"reproduce / implement / replicate / re-run paper X"
pasted an arxiv URL with reproduction intent ("can you redo this", "let's try this approach")
pointed at an OpenReview / proceedings link with the same intent
said "the paper has no code, can we build it"

Workflow

Stage	What	Reference
1	Paper acquisition (arxiv HTML → structured extract)	references/01-paper-fetch.md
2	Existing code discovery + inventory	references/02-code-clone.md
3	Gap analysis (extract every missing hyperparam from the prose)	references/03-gap-analysis.md
4	Implementation (uv venv, fill gaps, commit per gap)	references/04-implement.md
5	Dataset acquisition (HF datasets first; substitute if private)	references/05-dataset.md
6	Smoke runs (forward pass → 1 step → 20 iters)	references/06-smoke.md
7	Replication runs + comparison at paper's reported epochs	references/07-replicate.md

Walk them in order. Each stage has its own success criteria; do not advance to the next until the current one passes.

Working directory layout

For each paper reproduction, set up a dedicated workspace:

repro/<paper-arxiv-id>/
├── paper.md              # structured extract from stage 1
├── inventory.md          # what exists / missing from stage 2
├── gaps_filled.md        # hyperparam table with provenance from stage 3
├── code/                 # implementation from stage 4 (or cloned + extended)
├── data/                 # dataset symlinks or actual data from stage 5
├── dataset_substitution.md  # if a public dataset stood in for a private one
├── smoke_logs/           # outputs from stage 6
└── results.md            # replication outcomes from stage 7

This keeps reproductions self-contained and easy to revisit later.

Cross-references

After stage 3, hand the gap analysis off to the paper-verification skill for a round-trip check ("did I really capture every hyperparam the paper mentions").
Stage 4 implementation should be committed in small, reviewable pieces: each commit references the paper section that justified the filled value.
Stage 6 smoke failures route to the /phd-skills:debug skill, not to ad-hoc fixes.
Stage 7 launches go through the /phd-skills:launch checklist before any multi-hour run.
Stage 7 comparisons go through the /phd-skills:compare skill at the paper's reported epochs (never current-vs-final).

Output

For each reproduction, the final artifact is results.md with absolute deltas (not just %) and one of three labels per metric:

[matched within 0.X pp]: within the paper's reported variance
[gap, hypothesis: ...]: measurable underperformance, with a stated hypothesis for the cause
[fundamental disagreement, see X]: the result and the paper's claim are inconsistent in a way that needs investigation, not just more compute

If the workspace is on a public repo, link the workspace README from the project's main reproduction-tracking doc.

related-skills.json

same repository

compare.md

from "fcakyon/phd-skills"

Same-epoch comparison of training runs across wandb, neptune, tensorboard, or mlflow. Aligns runs at the student's current step (never current-vs-final-of-baseline) and separates proxy metrics from downstream targets. Use when the user asks to compare runs, check if a run is improving, track lag against a baseline, rank experiments, or evaluate run-vs-run performance.

2026-04-30250

debug.md

from "fcakyon/phd-skills"

Evidence-before-action diagnosis of failing ML experiments. Probes the system before guessing causes, process list, dmesg, GPU stats, log scrollback, checkpoint state, then states a hypothesis as a hypothesis and runs a smoke before claiming a root cause. Use when the user asks why a run is failing, diverging, OOMing, hanging, slow, producing weird metrics, has crashed, or asks to debug, diagnose, troubleshoot, or investigate a training issue.

2026-04-30250

launch.md

from "fcakyon/phd-skills"

Pre-flight checklist for long-running ML training jobs covering config diff, run naming, path verification, monitoring setup, and restart-cleanup. Use when the user asks to launch, kick off, start, restart, or kill a training run, or mentions launching a multi-hour or multi-day GPU job (python train, accelerate launch, torchrun, deepspeed, sbatch, tmux training).

2026-04-30250

dataset-curation.md

from "fcakyon/phd-skills"

Use when the user wants to analyze dataset bias, create stratified samples, evaluate fairness, or plan dataset collection. Triggers on phrases like "dataset bias", "stratified sample", "class imbalance", "data distribution", "fairness analysis", or "ethical review".

2026-03-12250

experiment-design.md

from "fcakyon/phd-skills"

Use when the user wants to design experiments, plan ablation studies, structure baselines, or create incremental evaluation strategies. Triggers on phrases like "design ablation", "plan experiment", "what experiments should I run", "baseline comparison", or "experiment matrix".

2026-03-12250

latex-setup.md

from "fcakyon/phd-skills"

Use when the user wants to set up or troubleshoot a LaTeX environment, choose between biber and bibtex, install packages for a specific venue template, or configure compilation. Triggers on phrases like "setup latex", "biber vs bibtex", "latex compilation error", "install latex packages", "venue template", or "texlive setup".

2026-03-12250

package.json

"author": "fcakyon"

"repository": "fcakyon/phd-skills"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

name	reproduce
description	End-to-end paper reproduction from arxiv URL through smoke runs to replication experiments. Handles missing or partial official code, missing training scripts, missing hyperparameters, and private datasets via similar-public-dataset substitution. Use when the user asks to reproduce, implement, replicate, or re-run a paper from scratch, or pastes an arxiv URL with reproduction intent.

Reproduce: paper reproduction from scratch

Each stage has a separate reference file under references/ so this overview stays scannable.

When to run

The user just said any of:

"reproduce / implement / replicate / re-run paper X"
pasted an arxiv URL with reproduction intent ("can you redo this", "let's try this approach")
pointed at an OpenReview / proceedings link with the same intent
said "the paper has no code, can we build it"

Workflow

Stage	What	Reference
1	Paper acquisition (arxiv HTML → structured extract)	references/01-paper-fetch.md
2	Existing code discovery + inventory	references/02-code-clone.md
3	Gap analysis (extract every missing hyperparam from the prose)	references/03-gap-analysis.md
4	Implementation (uv venv, fill gaps, commit per gap)	references/04-implement.md
5	Dataset acquisition (HF datasets first; substitute if private)	references/05-dataset.md
6	Smoke runs (forward pass → 1 step → 20 iters)	references/06-smoke.md
7	Replication runs + comparison at paper's reported epochs	references/07-replicate.md

Walk them in order. Each stage has its own success criteria; do not advance to the next until the current one passes.

Working directory layout

For each paper reproduction, set up a dedicated workspace:

repro/<paper-arxiv-id>/
├── paper.md              # structured extract from stage 1
├── inventory.md          # what exists / missing from stage 2
├── gaps_filled.md        # hyperparam table with provenance from stage 3
├── code/                 # implementation from stage 4 (or cloned + extended)
├── data/                 # dataset symlinks or actual data from stage 5
├── dataset_substitution.md  # if a public dataset stood in for a private one
├── smoke_logs/           # outputs from stage 6
└── results.md            # replication outcomes from stage 7

This keeps reproductions self-contained and easy to revisit later.

Cross-references

After stage 3, hand the gap analysis off to the paper-verification skill for a round-trip check ("did I really capture every hyperparam the paper mentions").
Stage 4 implementation should be committed in small, reviewable pieces: each commit references the paper section that justified the filled value.
Stage 6 smoke failures route to the /phd-skills:debug skill, not to ad-hoc fixes.
Stage 7 launches go through the /phd-skills:launch checklist before any multi-hour run.
Stage 7 comparisons go through the /phd-skills:compare skill at the paper's reported epochs (never current-vs-final).

Output

For each reproduction, the final artifact is results.md with absolute deltas (not just %) and one of three labels per metric:

[matched within 0.X pp]: within the paper's reported variance
[gap, hypothesis: ...]: measurable underperformance, with a stated hypothesis for the cause
[fundamental disagreement, see X]: the result and the paper's claim are inconsistent in a way that needs investigation, not just more compute

If the workspace is on a public repo, link the workspace README from the project's main reproduction-tracking doc.

reproduce

Reproduce: paper reproduction from scratch

When to run

Workflow

Working directory layout

Cross-references

Output

More from this repository

More from this repository

Reproduce: paper reproduction from scratch

When to run

Workflow

Working directory layout

Cross-references

Output