Run any Skill in Manus with one click

$pwd:

kayba-stage-2-domain-context

Name: Kayba Stage 2 Domain Context
Author: kayba-ai

// Gather domain context about the repository and agent — system prompt, tool definitions, domain docs, and behavior patterns from traces. Trigger when the user says "run stage 2", "gather context", "domain context", or when invoked by the kayba-pipeline orchestrator.

Run Skill in Manus

$ git log --oneline --stat

stars:2,243

forks:277

updated:March 17, 2026 at 10:21

SKILL.md

readonly

name	kayba-stage-2-domain-context
description	Gather domain context about the repository and agent — system prompt, tool definitions, domain docs, and behavior patterns from traces. Trigger when the user says "run stage 2", "gather context", "domain context", or when invoked by the kayba-pipeline orchestrator.

Stage 2: Domain Context Gathering

Understand the agent's world — what it does, what tools it has, and what "success" looks like.

Inputs

TRACES_FOLDER — path to directory containing trace JSON files

Process

0. Detect trace format

Before reading traces, identify the framework that produced them. Read 1 trace file and check:

Signal	Framework
`info.agent_info.implementation`, `info.environment_info`, `simulation.messages[]` with `role`/`tool_calls`/`turn_idx`	tau2-bench
`runs[].steps[]` with `type: "tool"`, `lc_kwargs`	LangChain / LangSmith
`events[]` with `event_type`, `span_id`, `parent_id`	LlamaIndex
`choices[].message.tool_calls[]` at top level	Raw OpenAI API logs
`trace.spans[]` with `attributes`, `trace_id`	OpenTelemetry / Arize / Langfuse

Record the detected format in the output under Trace Format. All subsequent trace-reading steps use the field paths appropriate for that format.

If the format is unrecognized, note the top-level keys and structure, then proceed best-effort with field names found in the data.

1. Detect architecture

Read 2-3 traces and determine if this is a single-agent or multi-agent system:

Single agent: one agent_info entry, one conversation thread, tool calls from one identity
Multi-agent / router: look for multiple agent_info entries, routing tool calls (e.g., transfer_to_*, delegate_to_*), sub-conversation arrays, or distinct system prompts per agent identity

If multi-agent: document each agent separately (name, role, tools, handoff triggers) and note the routing logic. The remaining steps apply per-agent.

2. Find the system prompt

Use a fallback chain — stop at the first hit:

Config files — grep for keys: system_prompt, system_message, instructions, AGENT_INSTRUCTION, SYSTEM_PROMPT in YAML/JSON/TOML/Python/JS files
Source code — search for prompt template strings, f-strings, or .format() calls that build the system message (look in agent implementation files)
Trace extraction — read 3 trace files from {TRACES_FOLDER}:
- Check info.environment_info.policy (tau2-bench format)
- Check first message with role: "system" in the messages array
- Check raw_data fields for system-level content
Not found — if none of the above yields a system prompt, explicitly record SYSTEM_PROMPT_STATUS: NOT_FOUND in the output and flag this for the orchestrator. Do not fabricate or guess.

When found, record both the prompt content and its source location (file path + line, or trace field path).

3. Extract tool definitions

Two-pass approach: source code first (ground truth), then traces (usage evidence).

Pass 1 — Source code discovery:

Search for tool/function definition patterns: @tool, @is_tool, def tool_, function schema arrays, OpenAPI specs, tools=[] arguments
For each tool, extract from source:
- Name
- Input parameters with types and defaults
- Return type / output schema (document the structure, not just "returns a dict")
- Side effects: READ (no state change), WRITE (mutates state), GENERIC (neither)
- Validation rules the tool does NOT enforce (critical — grep for comments like "API does not check", "agent must enforce")

Pass 2 — Trace usage evidence:

Read ALL traces (if <= 20) or a stratified sample (see step 4 for sampling)
Extract every unique tool_calls[].name from assistant messages
Extract every role: "tool" response to document actual output shapes
For each tool, record one example input/output pair from traces

Reconcile the two passes:

Tools in source but NOT in traces = "available but unused" — flag these; they may be relevant for edge cases the agent should handle
Tools in traces but NOT in source = possible dynamic tools or external APIs — investigate

Output the full tool inventory as a table with columns: Name, Category, Input Schema, Output Schema, Observed in Traces (Y/N), Unvalidated Rules.

4. Find domain documentation

READMEs, product docs, wiki links
Policy files (e.g., data/*/policy.md, domain-specific docs)
Inline code comments explaining business logic
Test files that describe expected behavior
Anything that explains what the agent does and what "success" means for its users

5. Catalogue agent behavior patterns

Trace selection — stratified sampling (do not just grab "5-10 random traces"):

Count total traces in {TRACES_FOLDER}. If <= 20, read ALL of them.
If > 20, select a stratified sample:
- Sort by termination_reason — include at least 2 per unique reason
- Sort by conversation length (message count) — include shortest, longest, and 2 median
- Sort by tool call count — include lowest and highest
- If task outcomes are available (pass/fail), include at least 3 of each
- Target: ~15 traces total, or 30% of the corpus, whichever is larger

For each selected trace, document:

Function call frequency — which tools are called most, in what order
Tool call sequences — common tool chains (e.g., get_user -> get_reservation -> cancel)
Success patterns — what does a thread that accomplishes its goal look like?
Failure patterns — what does a thread that fails or gets stuck look like?
Error patterns — what error strings appear in tool outputs? Group by root cause
Policy violation patterns — where does the agent break its own rules? (e.g., multiple tool calls per turn, acting without confirmation)
User feedback signals — reverts, ratings, explicit corrections, escalations, stop tokens, transfer tokens

6. Write findings

Write all findings to eval/stage2_domain_context.md:

# Domain Context

## Trace Format
- Framework: [detected framework name]
- Key field paths: [e.g., simulation.messages[], info.environment_info.policy]

## Architecture
- Type: [single-agent | multi-agent]
- [If multi-agent: agent roster with roles and handoff triggers]

## Agent Purpose
[1-2 sentence summary of what this agent does]

## System Prompt
- **Source**: [file path + line, or trace field path, or NOT_FOUND]
- **Status**: [verbatim | reconstructed | not_found]

[The system prompt content, or "NOT_FOUND — downstream stages should account for missing system prompt"]

## Tools
| Tool | Category | Input Schema | Output Schema | In Traces? | Unvalidated Rules |
|------|----------|-------------|---------------|------------|-------------------|
| tool_name | READ/WRITE/GENERIC | `{param: type}` | `{field: type}` | Y/N | "API does not check X" |

### Tools available but never called in traces
- [tool_name — why it matters]

## Domain Rules
[Key business rules, constraints, policies the agent must follow]

## Behavior Patterns

### Success patterns
- [pattern 1]

### Failure patterns
- [pattern 1]

### Policy violation patterns
- [violation with frequency: N/M turns]

### Error patterns
| Error | Frequency | Root cause |
|-------|-----------|------------|
| error string | N traces | cause |

### User feedback signals
- [signal 1]

Outputs

eval/stage2_domain_context.md

related-skills.json

same repository

kayba-pipeline.md

from "kayba-ai/agentic-context-engine"

End-to-end agent evaluation and improvement pipeline. Takes a traces folder and optional HITL flag, then orchestrates sub-agents through 7 stages — each stage is its own skill invoked by a dedicated sub-agent. Trigger when the user says "run the pipeline", "kayba pipeline", "evaluate and fix", "full eval", "analyze traces and fix", or provides a traces folder with intent to improve their agent.

2026-03-172.2k

kayba-stage-1-api-analysis.md

from "kayba-ai/agentic-context-engine"

Fetch pre-computed insights from the Kayba API and build a structured summary. Does NOT upload traces or trigger generation — analysis is assumed to already exist. Trigger when the user says "run stage 1", "get insights", "fetch skills", "kayba analyze", or when invoked by the kayba-pipeline orchestrator. Requires the kayba CLI to be installed and KAYBA_API_KEY to be set.

2026-03-172.2k

kayba-stage-3-metrics.md

from "kayba-ai/agentic-context-engine"

Define metrics from Kayba insights, implement them as Python measurement code, run against traces, and iterate until the metrics are clean and meaningful. Trigger when the user says "run stage 3", "define metrics", "build metrics", "compute baselines", or when invoked by the kayba-pipeline orchestrator. Requires eval/stage1_insights_summary.md and eval/stage2_domain_context.md to exist.

2026-03-172.2k

kayba-stage-4-rubric.md

from "kayba-ai/agentic-context-engine"

Organize computed metrics into a tiered evaluation rubric with leading, lagging, and quality indicators. Trigger when the user says "run stage 4", "build rubric", "tier metrics", or when invoked by the kayba-pipeline orchestrator. Requires eval/baseline_metrics.json and eval/compute_baselines.py to exist.

2026-03-172.2k

kayba-stage-5-action-plan.md

from "kayba-ai/agentic-context-engine"

Triage each insight into discard/code-fix/prompt-fix and produce a prioritized action plan with specific recommendations. Trigger when the user says "run stage 5", "make action plan", "triage skills", or when invoked by the kayba-pipeline orchestrator. Requires eval outputs from stages 1-4.

2026-03-172.2k

kayba-stage-6-hitl.md

from "kayba-ai/agentic-context-engine"

Human-In-The-Loop gate that presents the action plan with full context, collects an informed approval/modification/rejection decision, and records the outcome. Trigger when the user says "run stage 6", "HITL review", "approve action plan", or when invoked by the kayba-pipeline orchestrator. Requires eval/action_plan.md and eval/baseline_metrics.md to exist.

2026-03-172.2k

package.json

"author": "kayba-ai"

"repository": "kayba-ai/agentic-context-engine"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Computer Systems AnalystsComputer and Mathematical Occupations15-1211L4

Software DevelopersL4

name	kayba-stage-2-domain-context
description	Gather domain context about the repository and agent — system prompt, tool definitions, domain docs, and behavior patterns from traces. Trigger when the user says "run stage 2", "gather context", "domain context", or when invoked by the kayba-pipeline orchestrator.

Stage 2: Domain Context Gathering

Understand the agent's world — what it does, what tools it has, and what "success" looks like.

Inputs

TRACES_FOLDER — path to directory containing trace JSON files

Process

0. Detect trace format

Before reading traces, identify the framework that produced them. Read 1 trace file and check:

Signal	Framework
`info.agent_info.implementation`, `info.environment_info`, `simulation.messages[]` with `role`/`tool_calls`/`turn_idx`	tau2-bench
`runs[].steps[]` with `type: "tool"`, `lc_kwargs`	LangChain / LangSmith
`events[]` with `event_type`, `span_id`, `parent_id`	LlamaIndex
`choices[].message.tool_calls[]` at top level	Raw OpenAI API logs
`trace.spans[]` with `attributes`, `trace_id`	OpenTelemetry / Arize / Langfuse

Record the detected format in the output under Trace Format. All subsequent trace-reading steps use the field paths appropriate for that format.

If the format is unrecognized, note the top-level keys and structure, then proceed best-effort with field names found in the data.

1. Detect architecture

Read 2-3 traces and determine if this is a single-agent or multi-agent system:

Single agent: one agent_info entry, one conversation thread, tool calls from one identity
Multi-agent / router: look for multiple agent_info entries, routing tool calls (e.g., transfer_to_*, delegate_to_*), sub-conversation arrays, or distinct system prompts per agent identity

If multi-agent: document each agent separately (name, role, tools, handoff triggers) and note the routing logic. The remaining steps apply per-agent.

2. Find the system prompt

Use a fallback chain — stop at the first hit:

Config files — grep for keys: system_prompt, system_message, instructions, AGENT_INSTRUCTION, SYSTEM_PROMPT in YAML/JSON/TOML/Python/JS files
Source code — search for prompt template strings, f-strings, or .format() calls that build the system message (look in agent implementation files)
Trace extraction — read 3 trace files from {TRACES_FOLDER}:
- Check info.environment_info.policy (tau2-bench format)
- Check first message with role: "system" in the messages array
- Check raw_data fields for system-level content
Not found — if none of the above yields a system prompt, explicitly record SYSTEM_PROMPT_STATUS: NOT_FOUND in the output and flag this for the orchestrator. Do not fabricate or guess.

When found, record both the prompt content and its source location (file path + line, or trace field path).

3. Extract tool definitions

Two-pass approach: source code first (ground truth), then traces (usage evidence).

Pass 1 — Source code discovery:

Search for tool/function definition patterns: @tool, @is_tool, def tool_, function schema arrays, OpenAPI specs, tools=[] arguments
For each tool, extract from source:
- Name
- Input parameters with types and defaults
- Return type / output schema (document the structure, not just "returns a dict")
- Side effects: READ (no state change), WRITE (mutates state), GENERIC (neither)
- Validation rules the tool does NOT enforce (critical — grep for comments like "API does not check", "agent must enforce")

Pass 2 — Trace usage evidence:

Read ALL traces (if <= 20) or a stratified sample (see step 4 for sampling)
Extract every unique tool_calls[].name from assistant messages
Extract every role: "tool" response to document actual output shapes
For each tool, record one example input/output pair from traces

Reconcile the two passes:

Tools in source but NOT in traces = "available but unused" — flag these; they may be relevant for edge cases the agent should handle
Tools in traces but NOT in source = possible dynamic tools or external APIs — investigate

Output the full tool inventory as a table with columns: Name, Category, Input Schema, Output Schema, Observed in Traces (Y/N), Unvalidated Rules.

4. Find domain documentation

READMEs, product docs, wiki links
Policy files (e.g., data/*/policy.md, domain-specific docs)
Inline code comments explaining business logic
Test files that describe expected behavior
Anything that explains what the agent does and what "success" means for its users

5. Catalogue agent behavior patterns

Trace selection — stratified sampling (do not just grab "5-10 random traces"):

Count total traces in {TRACES_FOLDER}. If <= 20, read ALL of them.
If > 20, select a stratified sample:
- Sort by termination_reason — include at least 2 per unique reason
- Sort by conversation length (message count) — include shortest, longest, and 2 median
- Sort by tool call count — include lowest and highest
- If task outcomes are available (pass/fail), include at least 3 of each
- Target: ~15 traces total, or 30% of the corpus, whichever is larger

For each selected trace, document:

Function call frequency — which tools are called most, in what order
Tool call sequences — common tool chains (e.g., get_user -> get_reservation -> cancel)
Success patterns — what does a thread that accomplishes its goal look like?
Failure patterns — what does a thread that fails or gets stuck look like?
Error patterns — what error strings appear in tool outputs? Group by root cause
Policy violation patterns — where does the agent break its own rules? (e.g., multiple tool calls per turn, acting without confirmation)
User feedback signals — reverts, ratings, explicit corrections, escalations, stop tokens, transfer tokens

6. Write findings

Write all findings to eval/stage2_domain_context.md:

# Domain Context

## Trace Format
- Framework: [detected framework name]
- Key field paths: [e.g., simulation.messages[], info.environment_info.policy]

## Architecture
- Type: [single-agent | multi-agent]
- [If multi-agent: agent roster with roles and handoff triggers]

## Agent Purpose
[1-2 sentence summary of what this agent does]

## System Prompt
- **Source**: [file path + line, or trace field path, or NOT_FOUND]
- **Status**: [verbatim | reconstructed | not_found]

[The system prompt content, or "NOT_FOUND — downstream stages should account for missing system prompt"]

## Tools
| Tool | Category | Input Schema | Output Schema | In Traces? | Unvalidated Rules |
|------|----------|-------------|---------------|------------|-------------------|
| tool_name | READ/WRITE/GENERIC | `{param: type}` | `{field: type}` | Y/N | "API does not check X" |

### Tools available but never called in traces
- [tool_name — why it matters]

## Domain Rules
[Key business rules, constraints, policies the agent must follow]

## Behavior Patterns

### Success patterns
- [pattern 1]

### Failure patterns
- [pattern 1]

### Policy violation patterns
- [violation with frequency: N/M turns]

### Error patterns
| Error | Frequency | Root cause |
|-------|-----------|------------|
| error string | N traces | cause |

### User feedback signals
- [signal 1]

Outputs

eval/stage2_domain_context.md

kayba-stage-2-domain-context

Stage 2: Domain Context Gathering

Inputs

Process

0. Detect trace format

1. Detect architecture

2. Find the system prompt

3. Extract tool definitions

4. Find domain documentation

5. Catalogue agent behavior patterns

6. Write findings

Outputs

More from this repository

More from this repository

Stage 2: Domain Context Gathering

Inputs

Process

0. Detect trace format

1. Detect architecture

2. Find the system prompt

3. Extract tool definitions

4. Find domain documentation

5. Catalogue agent behavior patterns

6. Write findings

Outputs