Run any Skill in Manus with one click

improvement-methodology

Methodology for analyzing factory telemetry and proposing evidence-based improvements. Query patterns, evidence standards, and risk classification. Injected into Oracle's context.

Run Skill in Manus

Overview

Methodology for analyzing factory telemetry and proposing evidence-based improvements. Query patterns, evidence standards, and risk classification. Injected into Oracle's context.

Install command

npx skills add https://github.com/gulati8/justice-league-factory --skill improvement-methodology

Copy and paste this command into Claude Code to install the skill

Source

gulati8/justice-league-factory

Stars3

Forks0

UpdatedApril 28, 2026 at 16:46

SKILL.md

readonly

More from this repository

same repository

architectural-principles

gulati8/justice-league-factory

Shared architectural and engineering principles for planning, implementing, and reviewing code. Single source of truth for SOLID, 12-factor, DRY, KISS, and defensive design. Referenced by Martian Manhunter, Cyborg, and Wonder Woman.

2026-04-283

database-patterns

gulati8/justice-league-factory

Database migration and schema change standards for any project using raw SQL migrations. Activate whenever a task involves schema changes, column renames, field additions, table creation, data backfills, index creation, or any modification to how data is stored. Also activate when backend TypeScript interfaces need to reflect a new database shape.

2026-04-283

deep-research

gulati8/justice-league-factory

Deep research methodology for refining abstract concepts into shippable product specs. Six phases: concept extraction, landscape survey, constraint discovery, shape definition, risk assessment, and output crystallization. Can be invoked directly by users or injected into Brainiac's context.

2026-04-283

documentation-standards

gulati8/justice-league-factory

Documentation methodology. What to document, format templates, and accuracy-first principles. Injected into Lois Lane's context.

2026-04-283

e2e-regression-testing

gulati8/justice-league-factory

Comprehensive Playwright E2E regression testing methodology. Activate this skill when Flash needs to create a full regression test suite by auditing an entire codebase from scratch — not when testing specific acceptance criteria from a plan. That is testing-methodology's domain. This skill applies when the instruction is "regression test this application", "write a full E2E suite", or "test coverage from scratch". It covers codebase discovery, live app exploration via Playwright MCP tools, Page Object Model architecture, multi-viewport testing (375px / 768px / 1280px), measurable coverage thresholds, and output artifact production.

2026-04-283

factory-workflow

gulati8/justice-league-factory

Orchestration playbook for the Justice League factory. Describes the team, artifact dependencies, multi-phase dispatch patterns, autonomy gates, and failure handling. Injected into Batman's context — not user-invocable.

2026-04-283

Source

gulati8

gulati8/justice-league-factory

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Industrial Engineering Technologists and TechniciansArchitecture and Engineering Occupations17-3026L4

Industrial EngineersL4

name	improvement-methodology
description	Methodology for analyzing factory telemetry and proposing evidence-based improvements. Query patterns, evidence standards, and risk classification. Injected into Oracle's context.
user-invocable	false
disable-model-invocation	true
last_reviewed	"2026-04-28T00:00:00.000Z"

Improvement Methodology

This guides how you analyze the factory's performance and propose changes. Every proposal must be backed by evidence from the telemetry database — not intuition, not best practices, not "I think this would be better."

Querying Telemetry

The telemetry database is at eval/factory.db (SQLite). Use Bash to query it.

Common Queries

Agent failure rates:

SELECT agent, 
       COUNT(*) as total_runs,
       SUM(CASE WHEN verdict='fail' THEN 1 ELSE 0 END) as failures,
       ROUND(100.0 * SUM(CASE WHEN verdict='fail' THEN 1 ELSE 0 END) / COUNT(*), 1) as fail_rate
FROM agent_runs 
GROUP BY agent 
ORDER BY fail_rate DESC;

Token usage by agent (cost optimization):

SELECT agent, model,
       AVG(output_tokens) as avg_tokens,
       MIN(output_tokens) as min_tokens,
       MAX(output_tokens) as max_tokens
FROM agent_runs
GROUP BY agent, model;

Duration trends:

SELECT agent,
       AVG(duration_ms) as avg_duration,
       MAX(duration_ms) as max_duration
FROM agent_runs
GROUP BY agent
ORDER BY avg_duration DESC;

Failed run transcripts (for root cause analysis):

SELECT ar.agent, ar.verdict, at.prompt_text, at.response_text
FROM agent_runs ar
JOIN agent_transcripts at ON ar.id = at.agent_run_id
WHERE ar.verdict = 'fail'
ORDER BY ar.started_at DESC;

Evidence Standards

What Counts as a Pattern

A single failure is noise — don't propose changes for one-off issues
Two similar failures are suspicious — note them but don't propose yet
Three or more similar failures are a pattern — propose a change
A consistent behavior across ALL runs (e.g., token waste) is a pattern even if no individual run "failed"

Citing Evidence

Every proposal must reference specific data:

"Wonder Woman failed in 4 of 8 runs (50% failure rate)"
"In runs #3, #5, #7, the failure was TypeScript 'any' types flagged as critical"
"Cyborg averages 45,000 tokens on Opus vs 38,000 tokens on Sonnet with no quality difference (both 100% pass rate from Wonder Woman)"

Reading Transcripts

When metrics show a pattern, read the transcripts to understand WHY:

What did the agent actually say/do?
Was the issue in the prompt, the skill, or the task itself?
Would a prompt change fix it, or is this a fundamental limitation?

Bumping last_reviewed When You Modify a Skill

Every skill carries a last_reviewed: YYYY-MM-DD field in frontmatter. When your improvements.json proposes changing a skill, the proposed_value MUST also update last_reviewed to today's date. This is non-negotiable: without the bump, the field stops tracking validation cadence and decays into noise within a few cycles.

If a proposal modifies skill text but leaves last_reviewed unchanged, the proposal is incomplete. Wonder Woman should reject the resulting PR.

Skill Attribution

Skills are loaded statically per agent through .claude/agents/<name>.md frontmatter (skills: line, comma-separated). To attribute a run's outcome to the skills that shaped it, two paths are available:

1. From the skills_loaded column (preferred when present): The telemetry hook records the skills active at the time of each agent run.

SELECT agent, skills_loaded FROM agent_runs WHERE id = ?;

2. From current agent frontmatter (fallback): Read .claude/agents/<agent_name>.md, extract the skills: line, split on commas. Use this when analyzing runs older than the column's introduction or when validating recent changes against the latest config.

Staleness as Evidence Weight

last_reviewed is a prior on improvement priority, not a signal on its own. Combine staleness with metrics before proposing:

Skill state	Metrics on agents that load it	Action
Stale (>90d), high failure rate	Two signals agreeing	Propose change — high priority
Stale (>90d), healthy metrics	Don't fix what isn't broken	No proposal — staleness alone isn't evidence
Recent (<14d edit), high failure rate	Edit hasn't soaked yet	Wait one or two more runs before re-proposing
Recent (<14d edit), healthy metrics	Working as intended	No proposal

Compute staleness from last_reviewed first. If the field is missing or you suspect a recent edit didn't bump it:

git log -1 --format=%ai -- .claude/skills/<name>/SKILL.md

Skill Health Pass

Before producing improvements.json, build a per-skill health snapshot:

SELECT
  json_each.value AS skill,
  COUNT(*)        AS runs,
  SUM(CASE WHEN verdict='fail' THEN 1 ELSE 0 END) AS fails,
  MAX(started_at) AS last_run
FROM agent_runs, json_each(agent_runs.skills_loaded)
WHERE skills_loaded IS NOT NULL
GROUP BY skill;

(SQLite ships json_each since 3.38. If skills_loaded is unavailable for a given run, fall back to grouping by agent and joining via current frontmatter.)

For each skill, join the resulting rate with its last_reviewed and emit a skill_health entry in improvements.json. Sort by (staleness_days × failure_rate) descending — the top of that list is where to focus next analysis cycle. A skill's appearance there does NOT itself constitute a proposal; it's a pointer to where to look harder.

Risk Classification

Safe (auto-apply to branch)

Changes that can only help, never hurt:

Adding examples to prompts (shows desired behavior)
Clarifying ambiguous instructions (reduces misinterpretation)
Tightening severity definitions (reduces false positives)
Adding a constraint that prevents observed errors

Needs Review (document in PR body only)

Changes that could affect other agents:

Modifying output schema fields
Changing workflow steps
Altering what artifacts an agent produces
Changing model assignments

Dangerous (document in PR body with warning)

Changes that weaken safety:

Expanding tool access for any agent
Removing constraints
Adding a new agent to the roster
Changing read-only agents to read-write

PR Format

Your PR should be structured for easy human review:

Title: "Oracle: [N] improvements based on [M] factory runs"

Body:

## Applied Changes (safe)
- [Change 1]: [rationale] — Evidence: [citation]
- [Change 2]: [rationale] — Evidence: [citation]

## Proposed Changes (needs review)
- [Change 3]: [rationale] — Evidence: [citation]

## Flagged Concerns (dangerous)
- [Change 4]: [rationale] — Evidence: [citation]

## Telemetry Summary
- Runs analyzed: [N]
- Overall pass rate: [X]%
- Most frequent failures: [agent] ([rate]%)

Improvement Schema

Your output must conform to .claude/schemas/improvement.schema.json:

{
  "run_count_analyzed": 8,
  "patterns_detected": ["description of each pattern"],
  "proposals": [
    {
      "target_agent": "wonder-woman",
      "change_type": "prompt",
      "current_value": "current text",
      "proposed_value": "proposed text",
      "rationale": "why this change helps",
      "evidence": ["run #3: ...", "run #5: ..."],
      "risk_level": "safe"
    }
  ]
}