Run any Skill in Manus with one click

execution-grounded-selection

Pick among candidate outputs (code, configs, plans) by running them on diverse inputs and clustering by behavioural fingerprint, rather than by textual aggregation or log-probability. Activates when an executor returns multiple plausible candidates that need disambiguation, when output-majority voting would be the default choice, or when reviewing generated code that has not yet been validated. The 2026 evidence (Semantic Voting, arxiv 2605.08680v1) is that any execution-based selector dominates output-majority voting by 19-52pp; sketch-generated inputs beat random fuzz by 11.3pp. Triggers: "pick the best candidate", "majority vote on code", "select from N samples", "validate the generated output", "behavioural verification".

Run Skill in Manus

Stars64

Forks7

UpdatedJune 4, 2026 at 20:18

Source

Tibsfox

Tibsfox/gsd-skill-creator

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

SKILL.md

readonly

name	execution-grounded-selection
description	Pick among candidate outputs (code, configs, plans) by running them on diverse inputs and clustering by behavioural fingerprint, rather than by textual aggregation or log-probability. Activates when an executor returns multiple plausible candidates that need disambiguation, when output-majority voting would be the default choice, or when reviewing generated code that has not yet been validated. The 2026 evidence (Semantic Voting, arxiv 2605.08680v1) is that any execution-based selector dominates output-majority voting by 19-52pp; sketch-generated inputs beat random fuzz by 11.3pp. Triggers: "pick the best candidate", "majority vote on code", "select from N samples", "validate the generated output", "behavioural verification".
user-invocable	true
version	1.0.0
format	"2025-10-02T00:00:00.000Z"
triggers	["pick the best candidate","majority vote on code","select from N samples","validate the generated output","behavioural verification of code"]
updated	"2026-05-16T00:00:00.000Z"
status	ACTIVE
source	arxiv 2605.08680v1 (Semantic Voting), 2605.07248v1 (Plan-on-Trigger)

Execution-Grounded Selection

Why

Output-majority voting (pick the most common string output) is dominated by any selector that actually runs the candidates. Semantic Voting (arxiv 2605.08680v1) shows 19-52pp improvement over output voting across multiple code benchmarks. The specific aggregation rule (majority, weighted, MBR-Exec) is statistically indistinguishable once execution evidence is present — execution is the dominant signal, aggregation is the residual.

This is the code-domain analogue of the noise-as-exploration Rosetta concept: execution diversity is the exploration mechanism; behavioural fingerprint is the equilibrium signal.

How

The Semantic Voting pipeline:

Sample N candidates — generate at temperature > 0 (typical N = 5-10).
Generate diverse inputs — sketch-generated inputs (derived from candidate population structure) beat random fuzz by ~11pp. If sketch generation is infeasible, fall back to LLM-generated test inputs > random fuzz.
Execute each candidate on each input — collect (candidate_i, input_j, output_ij) tuples. Crash counts as a distinct fingerprint, not a discard.
Cluster by fingerprint — equivalence on [output_ij for j in inputs] defines the cluster.
Pick the largest cluster — break ties by candidate self-confidence or by Pareto on execution cost.

When to skip

Generation is deterministic (temperature = 0) — there's nothing to select among.
Execution is expensive or has side effects (touches the network, modifies state) — fall back to dry-run / static analysis.
The output isn't executable (e.g., natural-language summary) — use paired-trace audit instead.

Integration

wrap:verify and gsd-verify-work — pre-merge gate when verification produces multiple candidate fixes.
gsd-code-fixer — when the fixer proposes more than one fix per finding.
code-review — adds a "did you actually run it?" subsection to the review rubric.

Cross-references

Rosetta concept #9 (Execution-Grounded Selection) — canonical definition
College: agent-systems / agentic-code-generation / agent-execution-grounded-selection
Related skills: test-generator (generates the inputs; pair with this skill for the full loop)

More from this repository

same repository

intelligence-investigator

Tibsfox/gsd-skill-creator

Generate intelligence briefings for the planning dashboard. Use this skill whenever a request file appears in `.planning/console/inbox/pending/` whose `type` field starts with `intelligence.` (refresh_briefing, triage_finding, snapshot_diff, investigate_section, dismiss_finding). The skill reads the per-project KB at `.gsd/intelligence/intelligence.db`, synthesizes a briefing with a causal hypothesis + acknowledged uncertainty + confidence label and ranked moves, then writes the result back to the KB. Always trigger this skill for these request types — do not generate briefings manually.

2026-06-0664

sigreg

Tibsfox/gsd-skill-creator

Sketched Isotropic Gaussian Regularization primitive. Scalar loss matching the embedding distribution to a standard-normal target via Cramér-Wold slicing and the Epps-Pulley empirical characteristic function test. Port of rbalestr-lab/lejepa (MIT). Default-off in v1.49.571.

2026-06-0664

adversarial-pr-review

Tibsfox/gsd-skill-creator

Adversarial spec-compliance PR review — cross-references diffs against approved specs, verifies runtime claims against source, detects competing PRs, audits scope/convention compliance. Use before merging.

2026-06-0464

image-to-mission

Tibsfox/gsd-skill-creator

Extract creative intent from images into executable build specs. Activates on images + build intent, "image to mission", "i2m", or capturing visual energy in code/design.

2026-06-0464

intent-router

Tibsfox/gsd-skill-creator

Classify the information-need of a query and dispatch it to the appropriate retrieval or reasoning strategy. Use before read-side memory access, before multi-strategy retrieval, or any time you'd otherwise default to "one retriever for everything". Returns a strategy label, a token budget, and a retrieval depth so downstream handlers can be specialised. Backed by Pre-Route (arxiv 2605.10235v2) and MemFlow (arxiv 2605.03312v1), which together show LLMs possess latent routing ability elicitable via a structured prompt — and that externalising the routing decision improves small-model performance by ~2x. Triggers: "route this", "what strategy", "before retrieving", "intent classification", or any query whose ideal handling depends on what KIND of question it is.

2026-06-0464

skill-counterfactual-audit

Tibsfox/gsd-skill-creator

Audit a skill by running a paired probe — the same task once with the skill loaded and once without — segment both traces into goal-directed phases, align phases, and emit a SIP report (surface anchoring, template copy, excess planning, task recovery, off-task artifact). Use whenever a skill is created, modified, or proposed for retirement. Pass-rate is BLIND to most skill effects: CTA (arxiv 2605.11946v1) shows a single skill can produce 522 measurable behavioural changes across 49 tasks while pass-rate moves only +0.3%. Triggers: "audit this skill", "is this skill helping", "retire skill", "before shipping skill", "behavioural impact of skill X", or any skill review event.

2026-06-0464

name	execution-grounded-selection
description	Pick among candidate outputs (code, configs, plans) by running them on diverse inputs and clustering by behavioural fingerprint, rather than by textual aggregation or log-probability. Activates when an executor returns multiple plausible candidates that need disambiguation, when output-majority voting would be the default choice, or when reviewing generated code that has not yet been validated. The 2026 evidence (Semantic Voting, arxiv 2605.08680v1) is that any execution-based selector dominates output-majority voting by 19-52pp; sketch-generated inputs beat random fuzz by 11.3pp. Triggers: "pick the best candidate", "majority vote on code", "select from N samples", "validate the generated output", "behavioural verification".
user-invocable	true
version	1.0.0
format	"2025-10-02T00:00:00.000Z"
triggers	["pick the best candidate","majority vote on code","select from N samples","validate the generated output","behavioural verification of code"]
updated	"2026-05-16T00:00:00.000Z"
status	ACTIVE
source	arxiv 2605.08680v1 (Semantic Voting), 2605.07248v1 (Plan-on-Trigger)

Execution-Grounded Selection

Why

This is the code-domain analogue of the noise-as-exploration Rosetta concept: execution diversity is the exploration mechanism; behavioural fingerprint is the equilibrium signal.

How

The Semantic Voting pipeline:

Sample N candidates — generate at temperature > 0 (typical N = 5-10).
Generate diverse inputs — sketch-generated inputs (derived from candidate population structure) beat random fuzz by ~11pp. If sketch generation is infeasible, fall back to LLM-generated test inputs > random fuzz.
Execute each candidate on each input — collect (candidate_i, input_j, output_ij) tuples. Crash counts as a distinct fingerprint, not a discard.
Cluster by fingerprint — equivalence on [output_ij for j in inputs] defines the cluster.
Pick the largest cluster — break ties by candidate self-confidence or by Pareto on execution cost.

When to skip

Generation is deterministic (temperature = 0) — there's nothing to select among.
Execution is expensive or has side effects (touches the network, modifies state) — fall back to dry-run / static analysis.
The output isn't executable (e.g., natural-language summary) — use paired-trace audit instead.

Integration

wrap:verify and gsd-verify-work — pre-merge gate when verification produces multiple candidate fixes.
gsd-code-fixer — when the fixer proposes more than one fix per finding.
code-review — adds a "did you actually run it?" subsection to the review rubric.

Cross-references

Rosetta concept #9 (Execution-Grounded Selection) — canonical definition
College: agent-systems / agentic-code-generation / agent-execution-grounded-selection
Related skills: test-generator (generates the inputs; pair with this skill for the full loop)