con un clic
ds-baseline
// Use when a quest needs to attach, import, reproduce, repair, verify, compare, or publish a baseline and its metrics.
// Use when a quest needs to attach, import, reproduce, repair, verify, compare, or publish a baseline and its metrics.
Generate/edit images with OpenAI gpt-image-2 by default, falling back to Gemini (gemini-3.1-flash-image-preview) when OPENAI_API_KEY is unset. Supports text-to-image + image-to-image; 1K/2K/4K; use --input-image for editing, --provider to force a provider, --model to override the model.
Mandatory pre-flight compute resource check before running experiments. Detects whether local/remote GPU or compute resources are actually available. If resources are unavailable, STOPS the experiment pipeline immediately and reports to the user — preventing the model from hallucinating fake experiment results. Use when: about to run experiments, deploy training, or any GPU-intensive task.
Workflow 1.5: Bridge between idea discovery and auto review. Reads EXPERIMENT_PLAN.md, implements experiment code, deploys to GPU, collects initial results. Use when user says "实现实验", "implement experiments", "bridge", "从计划到跑实验", "deploy the plan", or has an experiment plan ready to execute.
Deploy and run ML experiments on local, remote, Vast.ai, or Modal serverless GPU. Use when user says "run experiment", "deploy to server", "跑实验", or needs to launch training jobs.
Dr. Claw workspace skill for project lookup, session inspection, TaskMaster progress, OpenClaw structured schema, and event-driven reporting
Use when a quest needs one or more follow-up runs such as ablations, robustness checks, error analysis, or failure analysis after a main experiment.
| name | ds-baseline |
| description | Use when a quest needs to attach, import, reproduce, repair, verify, compare, or publish a baseline and its metrics. |
| skill_role | stage |
| license | MIT |
| metadata | {"author":"ResearAI/DeepScientist","version":"1.0.0"} |
This skill establishes the reference system the quest will compare against. The target is one trustworthy baseline line, not an endless reproduction diary.
bash_exec; do not use any other terminal path for setup, reproduction, monitoring, verification, Git, Python, package-manager, or file-inspection commands.bash_exec for setup, reproduction, monitoring, and verification commands so the baseline line stays durable and auditable.shell_command / command_execution in this skill.bash_exec(...).artifact.git(...) before raw shell git commands.bash_exec(...) in an isolated scratch repository.artifact.arxiv(paper_id=..., full_text=False) for actually reading a source arXiv paper when it existsfull_text=True only when the short form is insufficientuvThe baseline stage should produce a usable reference point through one of four routes:
Keep the classic control flow:
These are control gates, not paperwork walls.
PLAN.md and CHECKLIST.md; short-form files are enough for simple fast-path work.1-2 sentence summary of trust status and next anchor.Default to the lightest baseline path that can still establish a trustworthy comparison. Default to a fast path when it can establish trust with less work.
Fast path is the default when any of the following is true:
requested_baseline_ref or confirmed_baseline_ref already points to the active baseline objectattach or importFast path means:
PLAN.md, a minimal CHECKLIST.md, one bounded smoke test when needed, and then one real validation or runEscalate from fast path to fuller audit only when:
Do not proceed to comparison-heavy downstream work unless one of the following is durably true:
Operationally:
artifact.confirm_baseline(...) once the accepted baseline root and trusted comparison contract are clearartifact.waive_baseline(...) when the quest must continue without a baselineBefore substantial baseline setup, code edits, or a real baseline run, create a quest-visible PLAN.md and CHECKLIST.md.
references/baseline-plan-template.md as the canonical structure for PLAN.md.references/baseline-checklist-template.md as the canonical structure for CHECKLIST.md.analysis_plan.md and REPRO_CHECKLIST.md remain acceptable compatibility alias files when an older quest already depends on them.PLAN.md and CHECKLIST.md are enough.PLAN.md before continuing.Default retry discipline:
repair, record blocked, or route through decisionThe baseline stage should usually leave behind:
baselines/local/ or baselines/imported/PLAN.md and CHECKLIST.mdartifact.confirm_baseline(...), or an explicit waiver via artifact.waive_baseline(...)For simple attach/import flows or a straightforward reproduce flow, do not stall just to precreate every optional note file.
Useful optional notes:
setup.mdexecution.mdverification.mdSTRUCTURE.md when the layout is non-obviousPLAN.md or compatibility alias analysis_plan.md is the required route contract before substantial setup, code edits, or a real run; it should state the route, source identity, command path, expected outputs, acceptance condition, main risks, and fallback.CHECKLIST.md or compatibility alias REPRO_CHECKLIST.md is the required living state tracker; it should show whether the baseline object, smoke decision, real run decision, and final accept / block / waive outcome are explicit.setup.md is optional unless environment or layout choices are non-trivial; if used, record the working directory, environment route, important config paths, source revision, and notable setup deviations.execution.md is optional unless the run is long, multi-step, or rerun-heavy; if used, record the launched commands, durable log paths, checkpoints, exit state, and any reruns or repairs.verification.md is optional as a filename but required in substance before acceptance or blocked closeout; either this file or an equivalent report should record trusted metrics, expected-versus-observed comparison, caveats, canonical output paths, and the next anchor.STRUCTURE.md becomes required when the workspace layout, mounts, symlinks, or generated outputs are non-obvious or meant for reuse; it should map the important directories and say which paths are canonical.attachment.yaml is required for attached or imported baselines under baselines/imported/; preserve source identity, selected variant when relevant, and attachment provenance there.<baseline_root>/json/metric_contract.json is the canonical accepted comparison contract; once the baseline is accepted, do not leave the authoritative metric surface only in chat, memory, or prose.Result/metric.md is scratch-only; it may help during execution, but it is never the final source of truth.Minimum stability rules:
Use the real runtime paths consistently.
Quest-local paths:
<quest_root>/baselines/local/<baseline_id>/<quest_root>/baselines/imported/<baseline_id>/<quest_root>/baselines/imported/<baseline_id>/attachment.yaml<baseline_root>/json/metric_contract.json<quest_root>/artifacts/baselines/<artifact_id>.json<quest_root>/artifacts/reports/<artifact_id>.jsonquest.yaml -> confirmed_baseline_refGlobal reusable registry paths:
~/DeepScientist/config/baselines/index.jsonl~/DeepScientist/config/baselines/entries/<baseline_id>.yamlbaseline_id should be short, stable, and filesystem-safe., _, or -/, \\, or ..baseline_id with structured variants instead of inventing many near-duplicate entriesdefault_variant_id, baseline_variants, and per-variant metric summaries stable enough that later experiment and write stages can cite them directlyDo not invent parallel durable locations when these runtime contracts already exist. Do not leave the authoritative metric contract only in chat, memory, or prose once the baseline is accepted.
If a baseline is reproduced only because an analysis campaign needs an extra comparator:
artifact.confirm_baseline(...) for that supplementary case unless the quest truly intends to replace the canonical baselineOne quest may legitimately need more than one baseline.
Prefer this order:
Prefer reuse over redundant reproduction.
Before running anything substantial, determine:
Default analysis discipline:
requested_baseline_ref or confirmed_baseline_ref, validate that concrete object before restarting broad discoveryEscalate to a fuller audit only when the command path is unclear, the repo is large or confusing, repair mode is active, or custom code changes look likely.
When the fuller audit is necessary, capture only what later stages truly need:
If the source paper is available, record:
You may inspect local feasibility with shell-based checks for OS, GPU, CPU, RAM, disk, Python version, and whether uv is available.
The analysis phase should leave behind a concrete plan rather than only conversational intent.
Prepare the selected route:
For Python baselines, standardize environment setup around uv.
uvuv.lock or a solid pyproject.toml, use uv syncuv venvuv pip install ...uv run ...Practical rules:
.venvuv run python ... or uv run bash ... over relying on shell activation stateuv venv --python 3.11 or uv run --python 3.11 ...uv pipuv route when there is a concrete blocker that cannot be resolved locallyCommon uv patterns:
uv syncuv venv --python 3.11uv pip install -r requirements.txtuv run python scripts/smoke_test.pyuv run python train.py --config ...Setup should record:
uv route and Python versionFallbacks:
analysis_plan.md or REPRO_CHECKLIST.md, keep the compatibility alias explicit rather than splitting truth across two active plansRun only the work required to establish the baseline credibly.
Execution rules:
Long-running execution discipline:
bash_exec(mode='detach', ...)bash_exec(mode='history') or bash_exec(mode='list')bash_exec(mode='read', id=...) returns the full saved log when it is 2000 lines or fewer; for longer logs, inspect omitted middle windows with start and tailbash_exec(mode='read', id=..., tail_limit=..., order='desc'), and after the first read prefer incremental checks with after_seq=last_seen_seqsilent_seconds, progress_age_seconds, signal_age_seconds, and watchdog_overdue as the default staleness cluesbash_exec(mode='kill', id=..., wait=true, timeout_seconds=...), document why, and relaunch cleanly30-minute visibility bound pass without a real inspection and a next expected update timetqdm progress reporter and periodic __DS_PROGRESS__ markers when feasibleKeep retries bounded:
Verification is mandatory before baseline acceptance.
Verify:
Classify the outcome as one of:
verified_matchverified_closeverified_divergedbrokenVerification must explicitly separate:
Verification should answer:
A verification report should be self-contained enough that a later stage can answer:
The baseline stage is not complete just because something ran. It is complete when later stages can compare against it fairly.
Before declaring a baseline usable, make the comparability contract explicit:
Unless the user explicitly specifies otherwise, treat the original paper's evaluation protocol as the canonical baseline contract.
If any of these fields are still materially unknown, do not pretend the baseline is a clean downstream reference.
For the fuller checklist and verdict meanings, read references/comparability-contract.md.
Before acceptance, classify feasibility as one of:
full_reproducibledegraded_but_acceptableblockedAnd classify downstream trust as one of:
verifiedpartially_verifiedoperational_but_incomparablefailedDo not silently upgrade a degraded or merely operational result into a normal trusted baseline.
The accepted baseline artifact should include at least:
baseline_idbaseline_kindpathtaskdatasetprimary_metricmetrics_summaryenvironmentsourcesummaryIf variants exist, also include:
default_variant_idbaseline_variantsMetric-contract rules:
<baseline_root>/json/metric_contract.jsonprimary_metric as the headline metric only; do not let it erase the rest of the comparison surfacemetrics_summary as a flat top-level dictionary keyed by the paper-facing metric idsdescription, either derivation or origin_path, and source_refmetrics_summary plus structured rows rather than one cherry-picked scalarjson/metric_contract.json, reuse that richer contract instead of hand-writing a thinner one that keeps only one averaged scalarResult/metric.md is optional temporary scratch memory only; reconcile against it before calling artifact.confirm_baseline(...), but do not treat it as a required durable fileUse the registry deliberately, not as an afterthought.
If the result is reusable beyond the current quest:
artifact.publish_baseline(...)publish_global: true only when verification is complete and reuse is justifiedIf the current quest should reuse an existing baseline:
artifact.attach_baseline(...)baseline_idvariant_id when one is usedbaselines/imported/If runtime state already includes requested_baseline_ref or a matching confirmed_baseline_ref:
baselineFor a clearer attach/import/reproduce/repair rubric, read references/route-selection.md.
For reusable-package expectations, read references/publishable-baseline-package.md.
Stage-start requirement:
memory.list_recent(scope='quest', limit=5)memory.search(...) before new baseline analysis, repair, or rerun workrequested_baseline_ref or confirmed_baseline_ref and the immediate task is only to validate or reattach that concrete baseline, you may skip broad retrievalWrite memory only for reusable lessons such as:
When calling memory.write(...), pass tags as an array like ["stage:baseline", "baseline:<baseline_id>", "type:repro-lesson"], not as one comma-joined string.
Stage-end requirement:
memory.write(...) before leaving the stageTypical artifact sequence:
progress for long-running setup or execution checkpointsreport for analysis notes or verification notesdecision for route choice, blocked routing, or accept/reject/rerun/repair callsbaseline only for an accepted baseline recordFor stable field shapes, read references/artifact-payload-examples.md.
The baseline handoff should make these items obvious:
baseline_idbaseline_variant_id when relevantIf this packet is not obvious from the accepted artifact plus verification note, the baseline line is not stable enough yet.
Do not hide failures.
If blocked, record the class explicitly:
missing_sourcemissing_codemissing_metric_contractenvironment_infeasiblecommand_unknownrun_failedverification_failedA blocked result must state:
Reasonable autonomous fixes before escalation:
If a fix would change confirmed scope, metrics, permissions, or resource assumptions, stop and return to analysis rather than applying it silently.
Exit the baseline stage once one of the following is durably true:
Typical next anchors:
ideaexperiment in tightly scoped follow-on casesdecision if the baseline line remains contested