Run any Skill in Manus with one click

skills-autoresearch

Use when converting an existing agent skill project or scaffolding a new project to run through the skills-autoresearch Flue harness, including project layout, config.json, eval cases, rubric, seed skill, baseline artifacts, and run commands.

Run Skill in Manus

Stars3

Forks0

UpdatedMay 31, 2026 at 10:13

Source

schalkneethling

schalkneethling/skills-autoresearch-flue

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

SKILL.md

readonly

Run any Skill with one click

name	skills-autoresearch
description	Use when converting an existing agent skill project or scaffolding a new project to run through the skills-autoresearch Flue harness, including project layout, config.json, eval cases, rubric, seed skill, baseline artifacts, and run commands.

Skills Autoresearch

Use this skill to prepare a project for the skills-autoresearch harness. The goal is to create a small, auditable autoresearch project that can evaluate a seed agent skill, let a researcher model improve it, run producer evals against the candidate skill, and have a separate judge model score only the producer output.

Required Workflow

Follow these gates in order. Do not skip a gate, and do not tell the user the project is ready until the Final Validation Gate passes.

Discovery Gate: Locate the harness checkout, read the harness docs/example, inspect the target project, and check git state.
Shape Gate: Choose the single-skill or multi-skill project shape and create the required directories.
Skill Gate: Create or move a valid seed skill with frontmatter and enough instructions to run.
Eval Gate: Create config.json, evals/eval-cases.json, evals/rubric.md, input files, and reference files.
Baseline Gate: Either import/hand-author workspace/baseline/ or explicitly prepare the generated-baseline run path.
Final Validation Gate: Validate files, paths, schemas, rubric wording, baseline state, and git status before handing off.

Discovery Gate

Ask the user for the full path to the local skills-autoresearch checkout and read README.md, docs/using-the-harness.md, and the alpha fixture under fixtures/projects/release-notes-alpha/.
Inspect the project being converted before writing files. Identify the skill or workflow to improve, representative inputs, expected outputs, and any stable reference material.
Preserve user work. Check git status --short before editing. If there are uncommitted changes, notify the user and pause to ask whether they want to proceed, stash, commit, or switch branches before starting.
If appropriate, start a new feature branch.

Shape Gate

For a single-skill project, create:

project-root/
  config.json
  evals/
    eval-cases.json
    rubric.md
  input/
    ...
  reference/
    ...
  seed-skill/
    SKILL.md
  workspace/
    baseline/
      ...

For a multi-skill project, keep shared config/evals/reference at the root and put target skills under skills/:

project-root/
  config.json
  program.md
  evals/
    eval-cases.json
    rubric.md
  input/
    ...
  reference/
    ...
  skills/
    skill-one/
      SKILL.md
    skill-two/
      SKILL.md
  workspace/
    baseline/
      ...

The current alpha harness improves one seed skill per run. For multi-skill projects, run once per target skill and pass seedSkillDir in the Flue payload.

Skill Gate

When scaffolding a new project:

Create the directories shown above.
Write a minimal valid seed SKILL.md. It must include YAML frontmatter with name and description, plus enough body guidance for a producer agent to attempt the target task.
Add one or two small eval cases with concrete inputs and expectations.
Add a direct rubric that describes high-quality output.
Decide whether to import an existing baseline or generate one as the first harness run before model-backed research.

A minimal seed skill can be intentionally imperfect, but it should be runnable as an agent skill:

---
name: my-skill
description: Use when producing the target output for this autoresearch project.
---

# My Skill

Use the provided task input and reference material to produce the requested output.

Write the result in the format requested by the eval task.

When converting an existing project:

Move or copy the existing skill instructions into seed-skill/SKILL.md or skills/<name>/SKILL.md.
Put task inputs under input/.
Put stable background material, examples, API notes, policies, or domain facts under reference/.
Convert existing tests, examples, or acceptance criteria into evals/eval-cases.json and evals/rubric.md.
Import any known output as workspace/baseline/.
If there is no existing baseline, ask the user whether they want to:
- Generate a baseline as the first skills-autoresearch harness run.
- Create a small baseline by hand so the smoke run can validate loading and aggregation before spending model calls.

Eval Gate

Write `config.json`

Use this as the starting point and adjust names, paths, target score, models, and tracks:

{
  "skill_name": "my-skill",
  "topic_group": "my-topic",
  "origin_skill": "seed-skill",
  "target_score": 0.8,
  "max_iterations": 1,
  "max_concurrency": 1,
  "model": {
    "provider": "anthropic",
    "name": "claude-sonnet-4-6"
  },
  "models": {
    "producer": {
      "provider": "anthropic",
      "name": "claude-haiku-4-5"
    },
    "judge": {
      "provider": "anthropic",
      "name": "claude-sonnet-4-6"
    },
    "researcher": {
      "provider": "anthropic",
      "name": "claude-sonnet-4-6"
    }
  },
  "roles": {
    "judge": "eval-judge",
    "skill_builder": "skill-builder"
  },
  "tracks": [
    {
      "id": "main",
      "eval_type": "my-eval-type",
      "role": "task-producer",
      "target_skill": "my-skill",
      "requires_description": false
    }
  ]
}

Notes:

origin_skill is relative to the autoresearch project root unless absolute.
models.producer writes eval outputs.
models.judge scores producer outputs.
models.researcher patches the skill.
tracks[].eval_type must match the eval cases.
tracks[].role and roles.judge refer to Flue roles in the harness checkout under .flue/roles/.

Write Eval Cases

Create evals/eval-cases.json:

{
  "evals": [
    {
      "id": "case-001",
      "eval_type": "my-eval-type",
      "title": "Short human-readable title",
      "input": {
        "file": "INPUT.md"
      },
      "expectations": {
        "must_include": ["important requirement"]
      },
      "scoring_dimensions": [
        {
          "id": "correctness",
          "label": "Correct and useful output",
          "max_score": 1
        }
      ]
    }
  ]
}

Keep early evals small, concrete, and easy to inspect. Prefer one or two targeted cases over a broad suite.

Write The Rubric

Create evals/rubric.md with direct scoring guidance. The rubric should explain quality criteria and scoring expectations, but it must not instruct the judge to return a legacy or custom JSON shape.

# Rubric

A high-scoring answer:

- Satisfies the concrete user request.
- Uses the provided input and reference material accurately.
- Avoids unsupported claims.
- Produces the expected files or output format.

The judge should evaluate producer output against the eval case and rubric, not reward the candidate skill instructions directly.

The Flue harness expects judge output to match this EvalScore shape:

{
  "eval_id": "case-001",
  "eval_type": "my-eval-type",
  "track_id": "main",
  "total_score": 0.5,
  "max_score": 1,
  "dimensions": [
    {
      "id": "correctness",
      "score": 0.5,
      "max_score": 1,
      "rationale": "Brief evidence-grounded rationale."
    }
  ],
  "summary": "Brief overall assessment."
}

When converting an existing project, remove stale rubric instructions that mention legacy fields such as focus_dimensions, scores, composite_score, expectations_met, expectations_missed, additional_observations, or any output example that does not match EvalScore.

Baseline Gate

The project can start with an imported baseline or create one as an initial harness run. Do not assume workspace/baseline/ already exists.

If the user wants to import or hand-author a baseline, create this shape:

workspace/baseline/
  scores-0.json
  summary.json
  case-001/
    task.md
    input/
    output/

Each scores-*.json file should match:

{
  "eval_id": "case-001",
  "eval_type": "my-eval-type",
  "track_id": "main",
  "total_score": 0.5,
  "max_score": 1,
  "dimensions": [
    {
      "id": "correctness",
      "score": 0.5,
      "max_score": 1,
      "rationale": "Baseline is partially correct but misses an important requirement."
    }
  ],
  "summary": "Baseline is usable but incomplete."
}

summary.json should aggregate the baseline scores. Use the alpha fixture as the concrete example if the schema is unclear.

If the user wants the harness to generate the baseline, run without withBaseline and with runResearch:false. This creates workspace/baseline/ and does not count as a research iteration.

Run From The Harness Checkout

Install and validate the harness:

pnpm install
pnpm test
pnpm run typecheck
pnpm run build

Run a baseline smoke without model calls:

pnpm exec flue run autoresearch --target node --root . --id my-smoke --payload '{"projectRoot":"path/to/project-root","withBaseline":true,"runResearch":false,"sessionId":"my-smoke"}'

Expected smoke events should end with research-loop-ready.

To generate the initial baseline with the harness instead of importing one, validate credentials first:

pnpm run env:check

Then run without withBaseline:

varlock run -- pnpm exec flue run autoresearch --target node --root . --id my-baseline --payload '{"projectRoot":"path/to/project-root","runResearch":false,"sessionId":"my-baseline"}'

Expected generated-baseline events include baseline-started and baseline-generated.

For model-backed research after a baseline exists, run:

varlock run -- pnpm exec flue run autoresearch --target node --root . --id my-research --payload '{"projectRoot":"path/to/project-root","withBaseline":true,"runResearch":true,"seedSkillDir":"path/to/project-root/seed-skill","sessionId":"my-research"}'

For a multi-skill project, point seedSkillDir at the specific skill directory for this run, for example path/to/project-root/skills/security-audit.

Final Validation Gate

Before reporting success, perform this validation and fix any failures.

Re-run git status --short in the target project and report the changed, deleted, and untracked paths. If unexpected deletions or moves appear, pause and ask the user before continuing.
Confirm these files exist:
- config.json
- evals/eval-cases.json
- evals/rubric.md
- input/
- reference/
- seed-skill/SKILL.md for single-skill projects, or the selected skills/<name>/SKILL.md for multi-skill projects
Validate config.json against the harness fields:
- skill_name, topic_group, target_score, max_iterations, max_concurrency, roles, and at least one tracks[] entry are present.
- Every tracks[].eval_type is used by at least one eval case.
- origin_skill points to an existing skill directory unless the run will always pass seedSkillDir.
Validate evals/eval-cases.json:
- The top-level object has an evals array.
- Every eval has id, eval_type, title, and at least one scoring_dimensions[] entry.
- Every input.file path resolves under input/.
- Every eval's eval_type has a matching track in config.json.
Validate evals/rubric.md:
- It references scoring_dimensions, not focus_dimensions.
- It does not include legacy output examples using scores, composite_score, expectations_met, expectations_missed, or additional_observations.
- Any output-shape example matches EvalScore: eval_id, eval_type, track_id, total_score, max_score, dimensions, and summary.
Validate the seed skill:
- SKILL.md has YAML frontmatter with name and description.
- The body gives enough task guidance for the producer to attempt the evals.
- Supporting references are colocated with the skill or under project reference/ and are mentioned where useful.
Validate baseline readiness:
- If workspace/baseline/ exists, confirm it has summary.json, at least one scores-*.json, and one directory per eval id with task.md, input/, and output/.
- If workspace/baseline/ does not exist, the next recommended command must be the generated-baseline command without withBaseline and with runResearch:false.
Run a schema/path validation command if the harness checkout is available. A direct Node check is enough; do not rely only on visual inspection.

Inspect Results

After research, inspect artifacts under the autoresearch project root:

workspace/iterations/1/
  skill/SKILL.md
  skill/RESEARCH.md
  skill/.autoresearch-flue-transcript.json
  outputs/<eval-id>/RESULT.md
  outputs/<eval-id>/producer-flue-transcript.json
  outputs/<eval-id>/judge-flue-transcript.json
  scores-0.json
  summary.json

Check that:

The candidate skill/SKILL.md addresses observed baseline weaknesses without overfitting to one fixture.
Producer output is a real answer to the eval task.
Judge rationale cites producer output, not the skill instructions.
Scores improved for the intended reasons.
Transcripts do not leak secrets such as API keys or provider key prefixes.

Reruns

Iteration artifacts are created with exclusive writes. To rerun from scratch, remove generated iterations in the autoresearch project:

rm -rf path/to/project-root/workspace/iterations

Do not remove baseline artifacts unless the user is intentionally replacing the baseline.

name	skills-autoresearch
description	Use when converting an existing agent skill project or scaffolding a new project to run through the skills-autoresearch Flue harness, including project layout, config.json, eval cases, rubric, seed skill, baseline artifacts, and run commands.

Skills Autoresearch

Required Workflow

Follow these gates in order. Do not skip a gate, and do not tell the user the project is ready until the Final Validation Gate passes.

Discovery Gate: Locate the harness checkout, read the harness docs/example, inspect the target project, and check git state.
Shape Gate: Choose the single-skill or multi-skill project shape and create the required directories.
Skill Gate: Create or move a valid seed skill with frontmatter and enough instructions to run.
Eval Gate: Create config.json, evals/eval-cases.json, evals/rubric.md, input files, and reference files.
Baseline Gate: Either import/hand-author workspace/baseline/ or explicitly prepare the generated-baseline run path.
Final Validation Gate: Validate files, paths, schemas, rubric wording, baseline state, and git status before handing off.

Discovery Gate

Ask the user for the full path to the local skills-autoresearch checkout and read README.md, docs/using-the-harness.md, and the alpha fixture under fixtures/projects/release-notes-alpha/.
Inspect the project being converted before writing files. Identify the skill or workflow to improve, representative inputs, expected outputs, and any stable reference material.
Preserve user work. Check git status --short before editing. If there are uncommitted changes, notify the user and pause to ask whether they want to proceed, stash, commit, or switch branches before starting.
If appropriate, start a new feature branch.

Shape Gate

For a single-skill project, create:

project-root/
  config.json
  evals/
    eval-cases.json
    rubric.md
  input/
    ...
  reference/
    ...
  seed-skill/
    SKILL.md
  workspace/
    baseline/
      ...

For a multi-skill project, keep shared config/evals/reference at the root and put target skills under skills/:

project-root/
  config.json
  program.md
  evals/
    eval-cases.json
    rubric.md
  input/
    ...
  reference/
    ...
  skills/
    skill-one/
      SKILL.md
    skill-two/
      SKILL.md
  workspace/
    baseline/
      ...

The current alpha harness improves one seed skill per run. For multi-skill projects, run once per target skill and pass seedSkillDir in the Flue payload.

Skill Gate

When scaffolding a new project:

Create the directories shown above.
Write a minimal valid seed SKILL.md. It must include YAML frontmatter with name and description, plus enough body guidance for a producer agent to attempt the target task.
Add one or two small eval cases with concrete inputs and expectations.
Add a direct rubric that describes high-quality output.
Decide whether to import an existing baseline or generate one as the first harness run before model-backed research.

A minimal seed skill can be intentionally imperfect, but it should be runnable as an agent skill:

---
name: my-skill
description: Use when producing the target output for this autoresearch project.
---

# My Skill

Use the provided task input and reference material to produce the requested output.

Write the result in the format requested by the eval task.

When converting an existing project:

Move or copy the existing skill instructions into seed-skill/SKILL.md or skills/<name>/SKILL.md.
Put task inputs under input/.
Put stable background material, examples, API notes, policies, or domain facts under reference/.
Convert existing tests, examples, or acceptance criteria into evals/eval-cases.json and evals/rubric.md.
Import any known output as workspace/baseline/.
If there is no existing baseline, ask the user whether they want to:
- Generate a baseline as the first skills-autoresearch harness run.
- Create a small baseline by hand so the smoke run can validate loading and aggregation before spending model calls.

Eval Gate

Write `config.json`

Use this as the starting point and adjust names, paths, target score, models, and tracks:

{
  "skill_name": "my-skill",
  "topic_group": "my-topic",
  "origin_skill": "seed-skill",
  "target_score": 0.8,
  "max_iterations": 1,
  "max_concurrency": 1,
  "model": {
    "provider": "anthropic",
    "name": "claude-sonnet-4-6"
  },
  "models": {
    "producer": {
      "provider": "anthropic",
      "name": "claude-haiku-4-5"
    },
    "judge": {
      "provider": "anthropic",
      "name": "claude-sonnet-4-6"
    },
    "researcher": {
      "provider": "anthropic",
      "name": "claude-sonnet-4-6"
    }
  },
  "roles": {
    "judge": "eval-judge",
    "skill_builder": "skill-builder"
  },
  "tracks": [
    {
      "id": "main",
      "eval_type": "my-eval-type",
      "role": "task-producer",
      "target_skill": "my-skill",
      "requires_description": false
    }
  ]
}

Notes:

origin_skill is relative to the autoresearch project root unless absolute.
models.producer writes eval outputs.
models.judge scores producer outputs.
models.researcher patches the skill.
tracks[].eval_type must match the eval cases.
tracks[].role and roles.judge refer to Flue roles in the harness checkout under .flue/roles/.

Write Eval Cases

Create evals/eval-cases.json:

{
  "evals": [
    {
      "id": "case-001",
      "eval_type": "my-eval-type",
      "title": "Short human-readable title",
      "input": {
        "file": "INPUT.md"
      },
      "expectations": {
        "must_include": ["important requirement"]
      },
      "scoring_dimensions": [
        {
          "id": "correctness",
          "label": "Correct and useful output",
          "max_score": 1
        }
      ]
    }
  ]
}

Keep early evals small, concrete, and easy to inspect. Prefer one or two targeted cases over a broad suite.

Write The Rubric

Create evals/rubric.md with direct scoring guidance. The rubric should explain quality criteria and scoring expectations, but it must not instruct the judge to return a legacy or custom JSON shape.

# Rubric

A high-scoring answer:

- Satisfies the concrete user request.
- Uses the provided input and reference material accurately.
- Avoids unsupported claims.
- Produces the expected files or output format.

The judge should evaluate producer output against the eval case and rubric, not reward the candidate skill instructions directly.

The Flue harness expects judge output to match this EvalScore shape:

{
  "eval_id": "case-001",
  "eval_type": "my-eval-type",
  "track_id": "main",
  "total_score": 0.5,
  "max_score": 1,
  "dimensions": [
    {
      "id": "correctness",
      "score": 0.5,
      "max_score": 1,
      "rationale": "Brief evidence-grounded rationale."
    }
  ],
  "summary": "Brief overall assessment."
}

Baseline Gate

The project can start with an imported baseline or create one as an initial harness run. Do not assume workspace/baseline/ already exists.

If the user wants to import or hand-author a baseline, create this shape:

workspace/baseline/
  scores-0.json
  summary.json
  case-001/
    task.md
    input/
    output/

Each scores-*.json file should match:

{
  "eval_id": "case-001",
  "eval_type": "my-eval-type",
  "track_id": "main",
  "total_score": 0.5,
  "max_score": 1,
  "dimensions": [
    {
      "id": "correctness",
      "score": 0.5,
      "max_score": 1,
      "rationale": "Baseline is partially correct but misses an important requirement."
    }
  ],
  "summary": "Baseline is usable but incomplete."
}

summary.json should aggregate the baseline scores. Use the alpha fixture as the concrete example if the schema is unclear.

If the user wants the harness to generate the baseline, run without withBaseline and with runResearch:false. This creates workspace/baseline/ and does not count as a research iteration.

Run From The Harness Checkout

Install and validate the harness:

pnpm install
pnpm test
pnpm run typecheck
pnpm run build

Run a baseline smoke without model calls:

pnpm exec flue run autoresearch --target node --root . --id my-smoke --payload '{"projectRoot":"path/to/project-root","withBaseline":true,"runResearch":false,"sessionId":"my-smoke"}'

Expected smoke events should end with research-loop-ready.

To generate the initial baseline with the harness instead of importing one, validate credentials first:

pnpm run env:check

Then run without withBaseline:

varlock run -- pnpm exec flue run autoresearch --target node --root . --id my-baseline --payload '{"projectRoot":"path/to/project-root","runResearch":false,"sessionId":"my-baseline"}'

Expected generated-baseline events include baseline-started and baseline-generated.

For model-backed research after a baseline exists, run:

varlock run -- pnpm exec flue run autoresearch --target node --root . --id my-research --payload '{"projectRoot":"path/to/project-root","withBaseline":true,"runResearch":true,"seedSkillDir":"path/to/project-root/seed-skill","sessionId":"my-research"}'

For a multi-skill project, point seedSkillDir at the specific skill directory for this run, for example path/to/project-root/skills/security-audit.

Final Validation Gate

Before reporting success, perform this validation and fix any failures.

Re-run git status --short in the target project and report the changed, deleted, and untracked paths. If unexpected deletions or moves appear, pause and ask the user before continuing.
Confirm these files exist:
- config.json
- evals/eval-cases.json
- evals/rubric.md
- input/
- reference/
- seed-skill/SKILL.md for single-skill projects, or the selected skills/<name>/SKILL.md for multi-skill projects
Validate config.json against the harness fields:
- skill_name, topic_group, target_score, max_iterations, max_concurrency, roles, and at least one tracks[] entry are present.
- Every tracks[].eval_type is used by at least one eval case.
- origin_skill points to an existing skill directory unless the run will always pass seedSkillDir.
Validate evals/eval-cases.json:
- The top-level object has an evals array.
- Every eval has id, eval_type, title, and at least one scoring_dimensions[] entry.
- Every input.file path resolves under input/.
- Every eval's eval_type has a matching track in config.json.
Validate evals/rubric.md:
- It references scoring_dimensions, not focus_dimensions.
- It does not include legacy output examples using scores, composite_score, expectations_met, expectations_missed, or additional_observations.
- Any output-shape example matches EvalScore: eval_id, eval_type, track_id, total_score, max_score, dimensions, and summary.
Validate the seed skill:
- SKILL.md has YAML frontmatter with name and description.
- The body gives enough task guidance for the producer to attempt the evals.
- Supporting references are colocated with the skill or under project reference/ and are mentioned where useful.
Validate baseline readiness:
- If workspace/baseline/ exists, confirm it has summary.json, at least one scores-*.json, and one directory per eval id with task.md, input/, and output/.
- If workspace/baseline/ does not exist, the next recommended command must be the generated-baseline command without withBaseline and with runResearch:false.
Run a schema/path validation command if the harness checkout is available. A direct Node check is enough; do not rely only on visual inspection.

Inspect Results

After research, inspect artifacts under the autoresearch project root:

workspace/iterations/1/
  skill/SKILL.md
  skill/RESEARCH.md
  skill/.autoresearch-flue-transcript.json
  outputs/<eval-id>/RESULT.md
  outputs/<eval-id>/producer-flue-transcript.json
  outputs/<eval-id>/judge-flue-transcript.json
  scores-0.json
  summary.json

Check that:

The candidate skill/SKILL.md addresses observed baseline weaknesses without overfitting to one fixture.
Producer output is a real answer to the eval task.
Judge rationale cites producer output, not the skill instructions.
Scores improved for the intended reasons.
Transcripts do not leak secrets such as API keys or provider key prefixes.

Reruns

Iteration artifacts are created with exclusive writes. To rerun from scratch, remove generated iterations in the autoresearch project:

rm -rf path/to/project-root/workspace/iterations

Do not remove baseline artifacts unless the user is intentionally replacing the baseline.

skills-autoresearch

Skills Autoresearch

Required Workflow

Discovery Gate

Shape Gate

Skill Gate

Eval Gate

Write config.json

Write Eval Cases

Write The Rubric

Baseline Gate

Run From The Harness Checkout

Final Validation Gate

Inspect Results

Reruns

Skills Autoresearch

Required Workflow

Discovery Gate

Shape Gate

Skill Gate

Eval Gate

Write config.json

Write Eval Cases

Write The Rubric

Baseline Gate

Run From The Harness Checkout

Final Validation Gate

Inspect Results

Reruns

Write `config.json`

Write `config.json`