تشغيل أي مهارة في Manus بنقرة واحدة

research-planning-cleanup

Use when doing post-hoc meta-review of completed cards, scanning for shallow/confabulated work that passed gate review, cleaning up planning debt, or identifying systemic Jerry-behaviour patterns across cards. Load when "planning cleanup", "meta-review", "completed card audit", or "card quality scan" is mentioned.

تشغيل في Manus

نظرة عامة

أمر التثبيت

npx skills add https://github.com/dzackgarza/lattice-research --skill research-planning-cleanup

انسخ والصق هذا الأمر في Claude Code لتثبيت المهارة

المصدر

dzackgarza/lattice-research

النجوم٠

التفرعات٠

آخر تحديث١٨ مايو ٢٠٢٦ في ١٣:٣١

SKILL.md

readonly

المزيد من هذا المستودع

نفس المستودع

category-spec-workflow

dzackgarza/lattice-research

Use when handling category-spec workflow state: cards, plans, decisions, TODO triage, retired cards, visual windows, branch/PR routing, validation handoff, or status updates.

2026-05-200

research-state-machine

dzackgarza/lattice-research

Use when moving research work from plan to execution, deciding whether a card is atomic, routing implementation/self-check/adversarial audit, promoting or rejecting results, or claiming GOAL.md discharge.

2026-05-200

category-framework-design

dzackgarza/lattice-research

Use when designing or reviewing category refinement order, Sage category creation, Hom/End/Aut structures, autsets, WithGenerators/FinitelyPresented axioms, dual-object routing, or constructor interception policy.

2026-05-180

research-proof-auditing

dzackgarza/lattice-research

Use when auditing computational or formal proofs, writing or reviewing mathematical argument notes, checking evidence sufficiency, reviewing Sage/GAP/Lean/Aristotle verification, detecting proof fraud, or deciding whether a result is accepted.

2026-05-180

category-spec-planning

dzackgarza/lattice-research

Use after the user asks to plan category-spec work, decompose an approved plan into cards, or organize plan phases into task/decision/research work.

2026-05-160

category-spec-style

dzackgarza/lattice-research

Use when editing, reviewing, or authoring category specs, type surfaces, Sage wrappers, constructors, method ownership, smoke files, or category-spec implementation code.

2026-05-160

المصدر

dzackgarza

dzackgarza/lattice-research

فتح مستودع GitHub عرض مستودعات المنشئ

أمر التثبيت

تنزيل

تشغيل في Manus

مفيد لـSOC

مطوّرو البرمجياتمهن الحاسوب والرياضيات15-1252L4

name	research-planning-cleanup
description	Use when doing post-hoc meta-review of completed cards, scanning for shallow/confabulated work that passed gate review, cleaning up planning debt, or identifying systemic Jerry-behaviour patterns across cards. Load when "planning cleanup", "meta-review", "completed card audit", or "card quality scan" is mentioned.

Research Planning Cleanup

Meta-review and cleanup of completed cards. This skill operates one layer above gate-based per-card review. Gate review (see research-state-machine) checks that the work satisfies formal criteria. This skill asks: was the work substantively done, or did a subagent fill in plausible text without doing the work?

When to Use

Use this skill for:

Scanning recently completed cards for Jerry-behaviour (paraphrase-as-review, checklist theater, evidence-shaped evidence, self-certification).
Identifying cards where the evidence claims are plausible but the claimed work would have been impossible without evidence that is not present.
Flagging completed cards whose reviewers produced zero findings — a structural Jerry signal (real artifacts have problems).
Cleaning up planning debt: cards that are misclassified, oversized, duplicated, or blocking downstream work by sitting in the wrong state.
Finding systemic patterns (three cards with the same shallow review, identical gate-justification text, no line numbers anywhere).

Do NOT use this skill for:

Initial gate review of a needs-agent-review card. That is the review kernel's job.
Plan approval. Plans are human-gated.
Proving mathematical correctness. That is research-proof-auditing.

Core Philosophy

Agents give their best effort. We are not requiring every line to be proved. We are looking for POSSIBLE positives for:

Task value blindness: a review spends model judgment on checking receipts the user did not need a model for: file existence, checked boxes, row counts, or whether a worker reported doing the work. This is economically incoherent when the hard part is deciding whether the work is mathematically, technically, or strategically sound.
Self-certification: the worker's own report is treated as evidence that the work happened correctly. In this failure mode, self-report is biased toward approval: the agent knows what a good report should say, whether or not it performed the underlying inspection.
Reward-hacking: work that satisfies the formal shape of a task (status updated, gates marked, review log present) without the substance.
Brainless adherence: a subagent given a task card follows the letter of the acceptance criteria but fills in plausible text rather than discovering the actual answer through research, computation, or inspection.
Confabulation of card content: claims in the card body that read like work was done (commit hashes, source citations, evidence summaries) but which collapse under inspection — the commit doesn't contain what the card says, the source doesn't say what the card claims, the evidence is a paraphrase of the card's own claim.
Shallow work reported as complete: a task that required reading 11 spec files produced a review log that could have been written without opening any of them.

The Economic Argument

Falsified data, weak claims, and shallow reviews are not harmless shortcuts. They are technical and process debt that accrues interest and compounds:

A weak claim in card A becomes the foundation for card B.
Card B's implementer trusts the claim and builds on it.
Card C's review checks B against A, finds agreement, and passes.
The error is now three layers deep and requires disproving an entire chain of dependent work to correct.

The compounding cost means weak pillars must be found and vetted early. A card that took 10 minutes of shallow work to produce can cost hours to unwind if it poisons downstream cards. The meta-review pass is not bureaucracy; it is the cheapest possible intervention against compounding process debt.

Scanning Procedure

1. Select Cards

Scan cards with status: complete or status: done that were completed recently (last N days, last review session, current phase). Prioritize:

Cards whose reviewers share a model family with the implementer.
Cards whose review logs are short relative to the claimed work (a 5-line review for a card that claimed to audit 11 spec files).
Cards where multiple reviews produced identical or near-identical gate justifications.
Cards with zero negative findings across all gates.

2. Apply Jerry Signals

For each selected card, check the Jerry structural invariants (see jerry-behaviour/references/jerry-patterns.md):

Signal	What to check
Zero negative findings	Did every gate pass? Real artifacts have problems.
No line numbers or code excerpts	Does the review cite specific files and lines, or only card-body paraphrases?
No external cross-checks	Did the reviewer check any claim against a source outside the card body?
Generic gate justifications	Could the same gate text apply to any card of this type?
Convergent language	Do multiple cards share nearly identical review language?
Fluency-biased praise	Does the review evaluate presentation ("well-structured") rather than correctness?
Status-only card diff	Does `git diff` of the card file change only the `status` line? A real review adds evidence to the card body under ## Review Log — the card grows content. A status line change with no body growth is a box check, not a review.

2.5. Demand Synthesis

Before spot-checking details, ask what synthesis the review produced. A substantive review should change the reader's understanding of the source, mathematical owner, implementation boundary, QC-tooling obligation, or failure mode. It should state that change directly.

If the review only inventories files, confirms rows, quotes worker reports, or restates the card's acceptance criteria, classify it as shallow even if every listed item is true. Inventories may support synthesis; they do not replace it.

3. Spot-Check Evidence

For cards that trigger Jerry signals, do one spot-check:

If the review cites a commit hash, open that commit and verify it contains what the review claims.
If the review claims a source was checked, open the source and verify the specific claim.
If the review claims a test passed, run the test and verify the output.
If the card claims a certain input shape, configuration, or behavior was assumed, check whether that assumption was stated in the task card or was silently invented by the agent. An agent that assumes facts not in evidence is confabulating certainty. The Karpathy observation: "The most common category is that the models make wrong assumptions on your behalf and just run along with them without checking."

One disproven claim is sufficient to flag the entire card as suspicious. You do not need to verify every claim — the goal is to find grounds for kicking back.

4. Classify Findings

For each suspicious card, classify the finding:

Classification	Description	Action
Confabulated evidence	A specific evidence claim is false (commit doesn't contain claimed change, source doesn't say what card claims).	Kick back with the disproven claim cited.
Shallow review	The review log passes gates but contains no evidence of actual inspection (no line numbers, no source cross-checks, zero findings).	Kick back. Require re-review by a different model family.
Undersized work	The task scope required substantial work (audit 11 files, implement a new category surface), but the evidence suggests a subagent filled in plausible prose.	Kick back. Require specific evidence: file-and-line findings, test output, diff excerpts.
Wrong assumptions	The agent assumed facts not stated in the task card (a specific input shape, a configuration value, a behavior contract) and proceeded without surfacing the assumption. The output is plausible but built on premises the task never supplied.	Kick back. Ask: "Where in the task card did you find this assumption?" If the agent cannot point to it, the work is confabulated.
Systemic pattern	Multiple cards from the same session show the same shallow-review pattern.	Flag the batch. Kick back the weakest exemplars. Create a phase-level note.
Genuine but thin	The card is probably correct but the review is too thin to be confident.	Kick back with a request for specific evidence. Not a rejection — a request for proof of work.

5. Kick Back

When kicking back a card:

Set status: revision-required.
Add a dated entry to the card's Review Log explaining:
- Which specific Jerry signal was triggered (cite the invariant).
- What the spot-check found (quote the disproven claim, show the actual source content, show the test output).
- What concrete evidence would satisfy re-review (specific files to check, values to verify, tests to run with expected output).
- What made the card suspicious — not "this looks bad" but "the review claims X but the commit only contains Y" or "the review cites no line numbers despite claiming to audit 11 files."
Do NOT punish. The implementer and reviewer were acting in good faith. The feedback should be: "This specific claim needs verification. Here is how to verify it. When you have done so, resubmit with the evidence."

6. Avoid Churn

This scan is surgical, not bureaucratic. Do not:

Kick back cards because the review could have been more thorough, when the existing review is substantively adequate.
Require re-review of every card in a batch when only some are suspicious.
Create new process rules, tracking systems, or metadata fields to "prevent this from happening again." The Jerry patterns are already documented.
Turn the scan into a checklist. If you find yourself checking items off a list rather than reading cards and thinking about whether the work was real, you have become the next Jerry.

A good scan kicks back 2-5 suspicious cards and leaves the rest alone. A scan that kicks back 20 cards is either identifying a systemic crisis (which should be escalated to a phase-level note, not 20 individual kicks) or is itself a Jerry scan — checking boxes rather than reading evidence.

Cross-References

jerry-behaviour: The structural invariants, detection signals, and countermeasure principles. Load before scanning.
anti-slop: Surface-level slop patterns (boilerplate prose, generic names). Jerry-behaviour is the meta-level evaluator failure; anti-slop is the surface-level artifact failure. Both are relevant.
research-state-machine/references/review-kernel.md: The gate protocol that every card passes through before reaching complete. The meta-review checks whether the gate review was substantive or performative.
research-proof-auditing: For spot-checking mathematical claims. Use when the suspicious card contains mathematical assertions to verify.
category-spec-audit: For Red Flag Log requirements. A card whose review includes no introspection red flags despite touching implementation code is suspicious — real code has isinstance/hasattr patterns, and a reviewer who found none was not looking.