Execute qualquer Skill no Manus
com um clique

Execute qualquer Skill no Manus com um clique

$pwd:

synthesize-task

Name: Synthesize Task
Author: PrimeIntellect-ai

// Synthesize a new general-agent task family from a seed task, evolving through difficulty tiers with empirical pass-rate gating. Use when asked to create new tasks or grow the task set.

Executar no Manus

$ git log --oneline --stat

stars:54

forks:18

updated:30 de abril de 2026 às 00:22

SKILL.md

readonly

related-skills.json

mesmo repositório

env-sync-push.md

from "PrimeIntellect-ai/research-environments"

Push all local environments to the Prime Intellect Environments Hub. Use when environments are out of sync and need to be published.

2026-05-1954

websearch.md

from "PrimeIntellect-ai/research-environments"

Search the web via the Exa API. Accepts up to 10 queries in parallel. Returns titles, URLs, and highlighted snippets from each result.

2026-05-1154

websearch.md

from "PrimeIntellect-ai/research-environments"

Search Google via the Serper API. Accepts up to 10 queries in parallel. Returns titles, URLs, snippets, and knowledge-graph data.

2026-05-1154

edit.md

from "PrimeIntellect-ai/research-environments"

Replace a unique string in a file. old_str must appear exactly once in the file.

2026-04-3054

cleanup-tasks.md

from "PrimeIntellect-ai/research-environments"

Diagnose and repair-or-remove broken tasks under environments/general_agent/tasks. Use after `general-agent validate` reports [FAIL] entries (post-synthesis, after a refactor, or on a periodic sweep).

2026-04-3054

open-webpage.md

from "PrimeIntellect-ai/research-environments"

Fetch a webpage via Exa and return an LLM-generated summary. Accepts an optional query that steers what the summary focuses on.

2026-04-2454

package.json

"author": "PrimeIntellect-ai"

"repository": "PrimeIntellect-ai/research-environments"

Abrir repositório GitHub Ver repositórios do creator

$ install --global

$ download --local

Executar no Manus

$ useful --forSOC

Professores de ciência da computação, pós-secundárioEducação e Biblioteconomia25-1021L4

name	synthesize-task
description	Synthesize a new general-agent task family from a seed task, evolving through difficulty tiers with empirical pass-rate gating. Use when asked to create new tasks or grow the task set.

Synthesize Task

Create a new task family for the general-agent environment. A task family is a sequence of task instances (tier 0 → tier N) in the same task family, where each tier is a superset of the previous: higher tiers may extend the DB schema with new entities and relationships, add new tools, and introduce harder constraints. Tier i+1 should be strictly more complex than tier i.

The overarching goal is to evolve a trivial task into an ultra-challenging one. Tier 0 should be dead simple (1-3 tool calls, small DB). By tier 4, the task should push even the strongest models to near-zero pass rates through a combination of large-scale data, layered constraints, and multi-step reasoning.

Task format

Each task instance is a directory under tasks/. Naming convention: <task>_t<tier> (e.g. hotel_booking_t0, hotel_booking_t1). The tier 0 seed is always <task>_t0.

<task>_t0/
├── task.toml              # [metadata] name, description, tier, parent, difficulty_methods, [[pass_rates]]
├── instruction.md         # agent prompt (what to do)
├── db.json                # initial database state (Pydantic-serialized)
├── tools.py               # TaskDB(DB), TaskTools(Tools), verify(db) — schema + tools + verifier
├── gold.json              # gold tool-call chain: [["tool_name", {kwargs}], ...]
└── gen_db.py              # (tier 2+ only) script that generates db.json — kept for reproducibility

The key abstractions (defined in general_agent/tools.py):

DB: Pydantic BaseModel loaded from db.json. The agent reads/writes this through tools.
Tools: Class with @tool-decorated methods that read/write self.db.
verify(db): A function in tools.py that checks whether the task goal is satisfied. Returns 1.0 on success, 0.0 on failure. This is REQUIRED — it allows scoring alternative valid solutions beyond the exact gold path.
@tool decorator: Plain decorator, no arguments. Methods get self (with self.db) pre-bound.

Example task tools.py:

from general_agent.tools import DB, Tools, tool
from pydantic import BaseModel

class Order(BaseModel):
    id: str
    customer: str
    total: float
    status: str = "pending"

class TaskDB(DB):
    orders: list[Order] = []

class TaskTools(Tools):
    db: TaskDB

    @tool
    def get_order(self, order_id: str) -> dict:
        """Look up an order by ID.

        Args:
            order_id: The order ID.
        """
        for o in self.db.orders:
            if o.id == order_id:
                return o.model_dump()
        raise ValueError(f"Order {order_id} not found")

    @tool
    def cancel_order(self, order_id: str) -> str:
        """Cancel an order.

        Args:
            order_id: The order ID to cancel.
        """
        for o in self.db.orders:
            if o.id == order_id:
                o.status = "cancelled"
                return f"Order {order_id} cancelled"
        raise ValueError(f"Order {order_id} not found")


def verify(db: TaskDB) -> float:
    """Check whether the task goal is satisfied.

    REQUIRED for every task. Must return 1.0 on success, 0.0 on failure.
    Should check the goal semantically, not just match the gold solution exactly.
    For example: "order ORD-001 is cancelled" rather than "DB matches gold hash".
    """
    order = next((o for o in db.orders if o.id == "ORD-001"), None)
    if order is None:
        return 0.0
    return 1.0 if order.status == "cancelled" else 0.0

IMPORTANT: Every tools.py MUST define a verify(db: TaskDB) -> float function. This function checks the task goal semantically — it should accept ANY valid solution, not just the exact gold path. The rubric uses verify(db) as a fallback when the agent's solution differs from the gold but is still correct.

Task Synthesis Procedure

Stage 0: Choose a task and design the DB schema

First, list every existing family so you don't collide with one:

general-agent list

Pick a family root name (the prefix before _t<N>, e.g. food_truck, hotel_booking). The following are FORBIDDEN — every root that already appears in general-agent list is taken, and the rubric rejects any rollout whose manifest collides with an existing family. Your root name MUST differ from every existing one — not just a suffix variant (food_truck_2, food_truck_v2), a different root word entirely. If unsure, pick a more specific niche (crepe_stand instead of food_truck).

Good tasks have:

Multiple entity types with relationships (e.g., customers + orders + products)
Natural tools to manipulate (read, modify, write) the data
Room for constraints that scale difficulty (numerical thresholds, cross-entity coupling)

Commit to the name before doing any other work. Write the chosen root name to a file so the harness can audit it:

mkdir -p /workspace/.synthesizer
echo "<root>" > /workspace/.synthesizer/family_name.txt

GATE 0a: the chosen root does not collide with any existing family. This must pass before you write any code:

root=$(cat /workspace/.synthesizer/family_name.txt)
if general-agent list | awk '{print $1}' | grep -E "^${root}(_t[0-9]+)?$" >/dev/null; then
  echo "COLLISION: '${root}' is already a family — pick a different root" >&2
  exit 1
fi
echo "OK: ${root} is available"

If this fails, pick a new root, overwrite family_name.txt, and re-run the check. Do NOT proceed to writing tools.py until it passes.

Now design the Pydantic DB schema and write tools.py with TaskDB(DB) and TaskTools(Tools).

GATE 0b: tools.py loads cleanly:

python -c "
from general_agent.utils import load_attr
from pathlib import Path
p = Path('tasks/<name>/tools.py')
assert load_attr(p, 'TaskDB') is not None, 'TaskDB not found'
assert load_attr(p, 'TaskTools') is not None, 'TaskTools not found'
assert load_attr(p, 'verify') is not None, 'verify function not found'
print('OK')
"

Stage 1: Write the seed task (tier 0)

Create the simplest useful tier for this task:

Write db.json with realistic seed data (5-20 entities). At tier 0, hand-writing the JSON is fine.
Write instruction.md — the task prompt, written as a natural user request (conversational tone, one or two paragraphs, no bullet points or numbered steps). The agent should figure out the steps from the description, not be told them explicitly.
Write gold.json — the gold tool-call chain
Write task.toml with tier = 0

The seed should require 1-3 tool calls and be easily solvable by any capable model.

IMPORTANT: Instructions must read like something a real user would type — not structured task specs. Bad: "Step 1: call get_recipe. Step 2: call check_pantry." Good: "I want to cook pasta tonight, can you check what ingredients I'm missing and add them to my shopping list?"

Dates must be unambiguous. If the gold solution passes a date string to a tool (e.g. date="2026-05-15"), the instruction must include the same explicit year (e.g. "for May 15, 2026" or "on 2026-05-15"). Never write dates as "May 15" / "May 15th" / "next Tuesday" and expect the agent to infer the year — solver models with pre-2026 pretraining cutoffs will default to 2025 (or whatever their cutoff implies) and the rollout will fail verify(db) despite otherwise solving the task. This was a corpus-wide failure mode in the GLM-5.1-FP8 RLM eval; ~30% of single-tier zero-pass-rate tasks were caused by it. Same rule applies to anything else that depends on "today's date" — make it explicit in the instruction.

GATE 1: Gold solution validates:

general-agent validate <name>

Stage 2: Empirical pass-rate check (tier 0)

Run the task against the solver model with exactly 20 rollouts (-r 20) to get a reliable difficulty estimate:

vf-eval general-agent-solver-local --disable-env-server -c -1 -b <solver_base_url> -k <solver_api_key_var> -m <solver_model> -n 1 -r 20 -d -a '{"task":"<name>"}'

You MUST use -r 20. Fewer rollouts give unreliable estimates — do not reduce this number.

Note: This command can take a few minutes. Set the command timeout high enough to avoid premature termination.

GATE 2: avg reward must be ≥ 0.80. If not, the seed is too hard — simplify the instruction or DB, then re-run. Do NOT proceed to tier 1 until this gate passes.

This gate is critical. If the solver command fails (e.g. API errors), fix the issue and retry. Do NOT skip pass-rate gating — tasks without empirical difficulty validation are worthless.

Record the pass rate (the reward: avg line from the output) in task.toml as an entry in the [[metadata.pass_rates]] array of tables:

[metadata]
name = "<task>_t0"
description = "..."
tier = 0
parent = ""
difficulty_methods = []

[[metadata.pass_rates]]
solver = "local"            # solver type: local, opencode, rlm
model = "openai/gpt-5-mini"
k = 20                      # number of rollouts (must be 20)
value = 0.95                # mean reward from the empirical pass-rate check

Each measurement is keyed on (solver, model, k); multiple entries are allowed if the task has been measured under more than one solver/model.

STAGE 3: Difficulty evolution (tiers 1-4)

For each tier k = 1, 2, 3, 4:

Copy the full previous tier directory to tasks/<task>_t<k>/ as a starting point. Use cp -rT tasks/<task>_t<k-1> tasks/<task>_t<k> — the -T flag treats the destination as a literal name and prevents the footgun where cp -r src dst/ nests src inside an already-existing dst/ (producing e.g. tasks/<task>_t4/<task>_t3/…). After copying, verify with ls tasks/<task>_t<k>/ — expect exactly db.json, gold.json, instruction.md, task.toml, tools.py and no nested tier directory.
Extend the task — each tier should be a superset of the previous. You are encouraged to:
- Add new DB entity types and relationships
- Add new tools (including distractor tools)
- Make the schema more complex
Add at least one constraint from this menu (pick the most natural for the task):
- Stricter numerical thresholds: budget limits, rating minimums, capacity caps
- Conditional rules: "if X then Y must also hold" (e.g., if luxury hotel then 4+ stars)
- Cross-entity coupling: no repeats across days, sum constraints, dependency chains
- Multi-step reasoning: answer requires combining results from 2+ tool calls
- Larger DB: more entities to search through, more distractors
- Ambiguity resolution: instruction requires the agent to disambiguate via tool calls, or tools have less informative names/docstrings/arg names that force exploration
- Tool proliferation: add more tools, including distractor tools that are irrelevant to the task but plausible-looking
- Noisy instructions: introduce realistic typos, misspellings, or grammatical errors into instruction.md — the kind a real user would make when typing quickly (e.g. "restraunt" for "restaurant", "teh" for "the", swapped letters, missing words). The agent must parse intent despite surface noise. Don't overdo it — 2-5 typos per instruction is enough to add friction without making it unreadable.
Update instruction.md — weave the new constraint into the natural language request. Do NOT list steps or mention tool names. The instruction should read like a user message, not a spec.
Update db.json if needed (more entities, edge cases).
- Tier 0-1: Hand-written JSON is fine (5-30 entities).
- Tier 2+: Generate db.json programmatically using a Python script (gen_db.py) placed in the task directory. The DB should contain hundreds or thousands of entities to force the agent to search, filter, and reason over large datasets. Use random.seed(42) for reproducibility. The script writes db.json to the same directory. Keep gen_db.py in the task directory — it is not part of the task runtime, but it must be committed for reproducibility so anyone can re-generate the DB. Example: a tier-3 hotel booking task might generate 500 hotels across 50 cities, 2000 room listings, and 300 guest profiles, so the agent must filter and cross-reference rather than eyeball the data.
Write gold.json — the gold solution MUST require strictly more tool calls than tier k-1
Write task.toml, recording which difficulty method(s) you used and the empirical pass rate. The difficulty_methods and [[metadata.pass_rates]] entries are required metadata for every tier (including tier 0). They document how each tier was made harder and how hard it actually is empirically.
```
[metadata]
name = "<task>_t<k>"
description = "..."
tier = <k>
parent = "<task>_t<k-1>"   # or "<task>_t0" for k=1
difficulty_methods = ["cross_entity_coupling", "conditional_rules"]  # mechanisms used to increase difficulty vs parent

[[metadata.pass_rates]]
solver = "local"
model = "openai/gpt-5-mini"
k = 20
value = 0.55              # avg@20 from empirical pass-rate check (GATE 3b)
```
Diversity requirement: across the entire task family (tiers 0-4), you must use at least 5 unique difficulty methods. Don't reuse the same method in consecutive tiers. The full set is: multi_step_reasoning, conditional_rules, cross_entity_coupling, stricter_thresholds, larger_db, ambiguity_resolution, tool_proliferation, schema_extension, noisy_instructions.

GATE 3a: Gold solution validates:

general-agent validate <task>_t<k>

GATE 3b: Empirical pass-rate is in the target band (exactly 50 rollouts):

vf-eval general-agent-solver-local --disable-env-server -c -1 -b <solver_base_url> -k <solver_api_key_var> -m <solver_model> -n 1 -r 20 -d -a '{"task":"<task>_t<k>"}'

Target bands per tier:

Tier 0: ≥ 80% (seed, must be easy)
Tier 1: 60-80%
Tier 2: 40-60%
Tier 3: 20-40%
Tier 4: 0-20%

If pass rate is above the band: constraint too weak — tighten it or add another. If pass rate is below the band: constraint too hard — relax it or simplify the DB.

Do NOT proceed to tier k+1 until tier k's gate passes. Do NOT skip this check — if the solver command fails, fix and retry. A task family where the solver gets 100% on all tiers is useless for training.

STAGE 4: Final validation

Lint and type-check all created task files:

ruff check tasks/<task>_t*/tools.py
ruff format --check tasks/<task>_t*/tools.py

Fix any issues before proceeding.

Run all tiers for the task you just created:

general-agent validate <task>
vf-eval general-agent-solver-local --disable-env-server -c -1 -b <solver_base_url> -k <solver_api_key_var> -m <solver_model> -n -1 -r 1 -d -v -a '{"task":"<task>"}'

Verify:

All task files pass ruff lint and format
All gold solutions validate
Pass rates decrease monotonically across tiers
No tier has 0% or 100% pass rate

Example: Evolved prompt with layered constraints (from DeepSeek V3.2)

This is a high-tier trip-planning task with cross-entity coupling, conditional budget rules, and multi-day constraints:

I'm planning a three-day trip starting from Hangzhou, and I need help creating an itinerary from October 1st to October 3rd, 2025. A few important requirements: I don't want to repeat any cities, hotels, attractions, or restaurants during the entire trip. Also, please make sure that every hotel, restaurant, and attraction you recommend is actually located in the city where I'll be staying that day. One more thing about the second day - I'm trying to be smart about my budget. If I end up booking a luxury hotel that costs 800 CNY or more per night, then I need to be more careful with other expenses: my total spending on both restaurants (lunch and dinner) should stay under 350 CNY, both restaurants should be rated at least 4.0 stars, and the afternoon attraction ticket needs to be less than 120 CNY. If the hotel on day 2 is in the mid-to-high range (500-800 CNY), then I have a bit more flexibility - I just need to make sure at least one of my restaurant choices is rated 4.0 or higher, and the attraction ticket should be below 180 CNY. For more affordable hotels (200-500 CNY range), I only need to ensure that at least one restaurant has a rating of 3.2 or above. Can you help me put together this itinerary?

Notice how this single prompt layers multiple difficulty levers:

Cross-entity coupling: no repeats across days for cities, hotels, attractions, restaurants
Location consistency: every recommendation must be in the correct city
Conditional budget rules: hotel price tier determines restaurant/attraction constraints
Numerical thresholds: specific price caps and rating minimums per budget tier

A tier-0 seed for this task would be: "Book a hotel in Hangzhou for October 1st." Each subsequent tier adds one more constraint.

synthesize-task

Mais deste repositório

Mais deste repositório

Synthesize Task

Task format

Task Synthesis Procedure

Stage 0: Choose a task and design the DB schema

Stage 1: Write the seed task (tier 0)

Stage 2: Empirical pass-rate check (tier 0)

STAGE 3: Difficulty evolution (tiers 1-4)

STAGE 4: Final validation

Example: Evolved prompt with layered constraints (from DeepSeek V3.2)

Synthesize Task

Task format

Task Synthesis Procedure

Stage 0: Choose a task and design the DB schema

Stage 1: Write the seed task (tier 0)

Stage 2: Empirical pass-rate check (tier 0)

STAGE 3: Difficulty evolution (tiers 1-4)

STAGE 4: Final validation

Example: Evolved prompt with layered constraints (from DeepSeek V3.2)