Run any Skill in Manus with one click

$pwd:

otto-as-user

Name: Otto As User
Author: logpie

// Simulate real users using Otto end to end. Use this for behavioral testing, result auditing, bug hunting, and pressure tests on real-world projects through Mission Control Web first, with CLI as the power-user/agent path. Requires real workflows, success measures, logs, artifacts, browser evidence when relevant, and external verification.

Run Skill in Manus

$ git log --oneline --stat

stars:1

forks:0

updated:May 6, 2026 at 05:36

File Explorer

2 files

SKILL.md

readonly

package.json

"author": "logpie"

"repository": "logpie/otto"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	otto-as-user
description	Simulate real users using Otto end to end. Use this for behavioral testing, result auditing, bug hunting, and pressure tests on real-world projects through Mission Control Web first, with CLI as the power-user/agent path. Requires real workflows, success measures, logs, artifacts, browser evidence when relevant, and external verification.

Otto As User

Purpose

Simulate a real person using Otto to accomplish real software work from end to end. This skill is for:

behavioral testing of Otto itself
auditing whether Otto's output actually satisfies the user's goal
finding bugs, UX issues, reliability gaps, and design gaps
pressure testing Otto on increasingly complex real-world projects

Pressure tests are not "intended to fail." They are intended to reveal whether Otto holds under realistic work. When a pressure project fails, treat the failure as evidence: expose the bug or design gap, fix generic root causes when that is in scope, add regression coverage, and rerun a meaningful scenario. Only defer a finding when the remaining work is explicitly tracked with evidence and a reason it cannot be fixed in the current pass.

The primary entry is Mission Control Web. Most users experience Mission Control as Otto's main product, not as an optional dashboard, so the default Mission Control mode is a true browser test. Use CLI as a secondary surface for power users, agents, scripted runs, direct otto run pressure tests, queue runners, proof inspection, and recovery/debug work.

Default rule: if the claim is about Mission Control usability, real-user confidence, or whether Otto works for a person, run the true Web path unless the scenario explicitly says CLI/API/backend diagnostics are the realistic user path. Non-Web as-user work must say why it is valid for that scenario.

Do not treat this as a command cookbook. Treat it as a real-user simulation protocol.

True User Simulation

When the scenario is Mission Control Web, behave like a real person using the product. The browser is the interface under test. What the user would see and do is what the agent should see and do. This is the default for Mission Control-facing otto-as-user work, not an optional higher tier.

Focus true-user Web runs on Mission Control itself. Generated-app browsing is external product verification after Otto finishes; it must not replace the Mission Control journey. The primary user journey is launcher -> project workspace -> job dialog -> queue/watch progress -> run detail/logs/diff/proof -> review/land/retry.

Do not use hidden shortcuts for user actions:

Do not enqueue, cancel, retry, merge, approve, or inspect work through direct API calls when a visible UI path exists.
Do not seed fake UI state unless the scenario is explicitly a seeded regression harness, not a true as-user run.
Do not bypass the project launcher for first-user or project-selection flows.
Do not use CLI submission unless the scenario is explicitly CLI/Web interop or a power-user path.
Do not treat API/state checks as proof that the user experience worked. API and disk checks are supporting evidence after the UI path is exercised.

Exercise normal user behavior, including behavior during waits:

click through visible tabs, logs, diffs, proof, artifacts, and history while a job is queued or running
use Mission Control's visible refresh controls, not only browser reload
move through the project launcher or project switcher and return to the original project when that surface is enabled
use browser back/forward and verify the UI recovers honestly
reload during a long wait when a real user might wonder whether progress is stale
scroll long logs and panels
open and close drawers, dialogs, and detail views
switch tabs or background the page for long-running work when relevant
use keyboard navigation and typing for realistic text entry and at least one accessibility-oriented path when relevant
try plausible mistaken or exploratory clicks without breaking the scenario

For every user action, write down the expected visible outcome before judging the page:

Action: what the user did, such as click New job, submit intent, open Logs, browser Back, refresh, close a drawer, or switch projects.
User expectation: what a reasonable person expects to see next.
Observation: what Mission Control actually shows in the browser: URL, visible controls, loading/error/success state, selected run, drawer/dialog state, and any confusing or missing feedback.
Verdict: expected, surprising-but-acceptable, confusing UX, blocked, or state contradiction.

Do not treat a click as tested just because Playwright did not throw. The test must inspect the post-action page against the user's expectation. API, disk, and git checks can confirm durable truth, but they do not replace the visible expectation check.

False-confidence blockers for Mission Control Web:

If a button exists, click it and assert the requested evidence surface changes visibly. Especially cover run-level and group-level Logs, Diff, Artifacts, and Proof actions. A visible button whose click only logs to the console or leaves the same page content is a hard failure.
Evidence checks must be panel-specific. An already-open drawer, unchanged route, or generic page shell is not proof that Logs, Diff, Artifacts, or Proof opened. Diff must resolve out of loading state, and "no changes" is acceptable only when git/API/disk evidence agrees.
Cancel, merge, retry, pause/resume, approve, reject, and similar operations must be driven by visible UI controls in true-web scenarios. API calls, disk reads, and git commands are verification layers after the click, not substitutes for the user action.
If Mission Control shows a running job, compare active stage, elapsed time, current group, and evidence panels against each other. Build active with Spec review pending, a stale WALL 0s after a nontrivial wait, or activity text that does not identify the active work is a hard failure.
User-facing stage names must explain the work. Internal terms such as seed must be flagged unless the UI also explains them in user language.
A Diff panel that says there are no changes while supporting evidence or the worktree shows active uncommitted/untracked product files is a hard failure.
Top-level PASS/FAIL for the true-web run must be derived from the expectation ledger and semantic contradictions, not only from whether the script reached the end or Otto produced a terminal run packet.
The harness wrapper must honor scenario-level FAIL/INFRA results and any collected failures. INFRA is not green unless the user explicitly asks to ignore infra. Summary-level errors, Playwright timeouts, console/page errors, and network failures cannot coexist with a PASS.
Recovery and outage scenarios must actually perform the claimed recovery action. "Could not simulate outage" is a FAIL/INFRA finding, not a note.
Product-verification scenarios must verify the generated product as a user: launch the app or service, interact with core behavior, inspect layout/UX, and run native tests where available. Scanning for source files is smoke evidence only.

Use bounded randomness only for realistic Mission Control exploration:

keep golden scenarios deterministic unless user-behavior variance is explicitly enabled
randomize within a fixed, stratified set of plausible Mission Control actions, not blind page clicks
force coverage across categories over time: evidence inspection, UI refresh, browser reload, back/forward, project switcher/launcher round trip, background/return, keyboard open/type/cancel, scrolling, and layout checks
record the seed, phase, chosen action, user expectation, observation, selector, URL, verdict, and screenshot/artifact path so the run can be replayed
treat randomized UX findings as hard failures only when they block, mislead, hide state, corrupt workflow state, or contradict API/disk/git truth
prefer seeded variants such as inspect logs first vs proof first, reload vs browser back/forward during a wait, desktop vs mobile viewport, keyboard vs pointer submission, and close/reopen drawer

Inspect UI/UX as part of the run, not as a separate screenshot pass:

first-screen clarity: the next useful action is obvious within a few seconds
layout: no overlapping text, clipped buttons, unreadable logs, or cramped controls at the tested viewport
interaction feedback: submits, loading states, disabled controls, errors, and success states are visible and honest
navigation model: back/forward, project home, task cards, tabs, drawers, and dialogs behave predictably
responsiveness: the same core workflow is usable on the target desktop and mobile/tablet viewports when those are in scope
recovery: refresh/reconnect/restart states do not strand the user or imply false success

Core Rule

For a real-user scenario, success means:

real user goal -> Otto workflow -> inspect evidence -> external verification -> user-acceptable result

An Otto PASS, a harness PASS, or a screenshot alone is not enough.

Doorway Smoke

Before claiming Mission Control Web is usable for a real user, run a doorway smoke in the same mode the user will use. This is a skill-level requirement, not a separate script requirement.

Minimum doorway smoke:

Start the real server with launcher mode and remote access:

uv run --extra dev python -m otto.cli web --host 0.0.0.0 --port 9000 --allow-remote --project-launcher --projects-root /Users/yuxuan/otto-projects --no-open

Open the actual URL in a browser with Playwright or agent-browser.
Assert the first screen is actionable:
- project launcher is visible when no project is selected
- no raw backend error such as "Failed to load sessions"
- no unexpected first-load 4xx/5xx in normal UI requests
- static bundle is fresh and served from the running server
- the primary user job is obvious within five seconds
- useful controls appear above the fold on 1280x800 and 1440x900 desktop
- decorative or explanatory content does not dominate the first viewport
Create or open a managed project.
Confirm the UI transitions into the project workflow:
- selected project identity is visible
- run cards/open actions use the intended drawer or detail surface
- project-home/brand navigation returns to the launcher in launcher mode
Save at least one screenshot, the server log path, and any bounding-box or browser evidence used for the first-screen verdict.

Doorway smoke proves the entrance works. It does not prove Otto can build or repair a real product; continue with a real scenario for that.

Mission Control Lifecycle Truth Smoke

When testing Mission Control as a real user, run at least one live lifecycle smoke after the doorway smoke. The point is to verify that Web can drive Otto, not merely display routes.

Minimum lifecycle truth checks:

Queue a real build from Web with the intended provider. Inspect the submitted request and the queued task row.
- The provider/model summary must not combine one provider with another provider's default model. Example failure: codex · model sonnet.
- Build jobs must not expose or submit legacy controls whose flags are ignored by i2p, such as --split, --agentic, or build --rounds.
Start the queue runner from Web when that is the visible user path.
- The UI must show that work has actually started within a few seconds.
- If otto_logs/sessions/<id>/spec/compile-agent/ contains live logs, the run must not still look merely "queued".
Open the run drawer/detail while the job is active.
- The task row story/feature count must match the compiled spec or proof.
- Groups and features must be visible when spec/spec.json contains groups[*].feature_ids.
- Phase token/time telemetry must be visible when provider usage exists: header token spend, stage token spend, and top expensive agent calls from proof-packet.json / RunView.
- Logs and diffs actions must either open visible evidence or show a clear loading/error state. Silent clicks fail.
Inspect logs before diagnosing:
- otto_logs/sessions/<id>/spec/spec.json
- otto_logs/sessions/<id>/spec/repo-index.json for brownfield compile
- otto_logs/sessions/<id>/spec-state.jsonl
- otto_logs/sessions/<id>/build/**/narrative.log
- otto_logs/sessions/<id>/build/**/context-packet.json
- otto_logs/sessions/<id>/audit/**/evidence-packet.json
- otto_logs/sessions/<id>/proof-packet.json when present

If any of these fail, classify it as a Web/API truthfulness bug unless the logs prove Otto itself never started.

If the work under test changed Mission Control UI, also run the otto-frontend-rua product-level gate. Do not accept a component-level check that only proves elements exist or routes click.

Surface Selection

Default to true Mission Control Web when the task is about how a person would use Otto:

create or open a project
submit build/improve/certify work
review spec, logs, diffs, proof, and artifacts
watch queue or run progress
pause/resume/retry/cancel
land, reject, or repair work
capture browser screenshots/video/traces

Use CLI when it is the realistic user path:

direct i2p pressure: otto run ...
agent/script automation
provider/model/budget/turn overrides
queue runner operation
proof/debug inspection
recovery after interruption
reproducing a failure with tighter control

Use both surfaces when testing operator workflows: submit or inspect in Web, run/repair from CLI, then return to Web to review and land.

Real-User Loop

Choose a real project and user goal.
- Prefer existing repos, brownfield apps, open-source projects, or realistic internal products.
- Toy projects only count as smoke tests.
- For pressure tests, escalate complexity after each pass.
Define success measures before running Otto.
- Product behavior: what must work for the end user?
- Preservation: what existing behavior must not regress?
- External verifier: native tests, build, lint, API checks, browser checks, database/state checks, or manual product assertions.
- Evidence: which logs, proof packets, screenshots, videos, diffs, or artifacts will prove the result?
- Acceptance: would a real user accept the output?
Execute like a user.
- Start in Mission Control Web unless CLI is explicitly the natural path.
- Use real provider runs when the user asks for live/paid/pressure evidence.
- Capture session ids, run ids, wall time, provider/model, and cost when available.
Inspect evidence before judging.
- Read otto_logs/sessions/<id>/summary.json.
- Read the compact packets first: spec/repo-index.json, build/**/context-packet.json, and audit/**/evidence-packet.json when present.
- Read narrative logs, audit logs, proof packet JSON/HTML, screenshots, and videos when present.
- Inspect phase usage and hot calls from RunView or proof-packet.json. A pressure run is incomplete if it cannot explain which phase or agent call consumed the most wall time/tokens.
- For Web/queue flows, inspect live records, history, command requests, and acks when behavior is unclear.
Verify outside Otto.
- Run the target repo's native verifier.
- Check durable state for persistence-backed apps.
- Use browser or HTTP assertions for web behavior.
- Do not weaken the oracle to make Otto pass.
Classify every issue.
- Otto bug
- Otto UX issue
- output/product bug
- external verifier/oracle bug
- project-specific invalid request
- provider/runtime flake
- design gap
- beyond current capability
Fix only generic root causes when asked to fix bugs.
- Do not overfit to one test project.
- Add regression tests for code fixes.
- Re-run the smallest meaningful gate, then re-run the scenario or a narrower reproduction.

Success Measures

Define these per scenario. A good scenario has at least one measure from each required category.

Required:

Workflow success: the real user can complete the intended Otto workflow from start to finish.
Product success: the produced or modified software satisfies the user goal.
Regression safety: existing project behavior still works.
External verification: an independent command or check passes outside Otto.
Evidence completeness: logs, proof artifacts, and UI/browser evidence are sufficient to debug or trust the outcome.

Use when applicable:

UX success: Mission Control is understandable, responsive, and recoverable.
Persistence success: durable state survives reload/restart.
Queue success: queued/running/ready/landed states are coherent.
Recovery success: resume/retry/cancel behaves honestly.
Provider success: requested provider/model/effort propagated to child agents.
Cost/time success: budget, max-turns, and reported spend are credible.
Telemetry success: phase-level wall/token spend and top hot agent calls are visible enough for a user to know where the run got expensive.
Accessibility success: controls have labels, focus behavior, and usable keyboard/mouse paths where relevant.

Scenario Types

Use scenario types rather than memorized harness names.

First product from scratch: create a new app from a real product intent.
Brownfield feature: add behavior to an existing app while preserving tests.
Bug repair: reproduce a real bug, ask Otto to fix it, verify the fix.
Spec review: edit or approve a generated spec before work starts.
Operator queue: submit multiple jobs, run watcher, inspect progress, land successful work, handle failures.
Recovery: interrupt, resume, retry, cancel, or restart Mission Control.
Proof audit: inspect proof packet, logs, screenshots, and verifier evidence.
Full-stack/persistence: verify durable state, migrations, API, and UI.
Open-source repo: run the repo's native test/lint/build commands.
Beyond-capability probe: large repo, ambiguous requirement, migration, concurrency, auth, multi-service setup, or deep data-model change.

Commands

Run from the active worktree. Start with:

pwd
git branch --show-current
git status --short --branch

Mission Control Web

uv run --extra dev python -m otto.cli web --no-open
uv run --extra dev python -m otto.cli web --host 0.0.0.0 --port 9000 --allow-remote --project-launcher --projects-root /Users/yuxuan/otto-projects --no-open

otto dashboard is a compatibility alias for otto web. otto queue dashboard has been removed.

Queue runner:

uv run --extra dev python -m otto.cli queue run --no-dashboard

Direct i2p CLI

Use for power-user/agent pressure tests:

uv run --extra dev python -m otto.cli run "build a product..." --project-kind webapp --budget 3600 --max-turns 160 --verbose
uv run --extra dev python -m otto.cli run "build a CLI tool..." --project-kind cli --effort high --budget 2400
uv run --extra dev python -m otto.cli run --from-spec otto_logs/sessions/<id>/spec/spec.json --budget 2400
uv run --extra dev python -m otto.cli run --resume --auto-approve

Important otto run flags:

--provider, --model, --effort
--build-provider, --build-model, --build-effort
--certifier-provider, --certifier-model, --certifier-effort
--fix-provider, --fix-model, --fix-effort
--budget, --max-turns
--review-gate, --auto-approve, --gate-timeout
--resume, --reset-budget, --force
--break-lock, --base-url, --from-spec

otto build, otto improve, and otto certify remain compatibility or specialized surfaces. Use them when testing those surfaces or when Mission Control queues them. Prefer otto run for direct redesigned i2p pressure.

Harnesses

Harnesses support the simulation; they do not replace judgment.

Harness	Use	Counts As
`scripts/e2e_web_mission_control.py`	cheap browser regression over seeded Mission Control states	UI regression evidence
`scripts/web_as_user.py`	paid real-provider Mission Control/operator workflows	live web/operator evidence
direct `otto run` on real projects	pressure testing i2p output quality	direct i2p evidence

Useful commands:

uv run --extra dev python scripts/e2e_web_mission_control.py --scenario all --artifacts /tmp/otto-web-e2e --viewport 1440x900
uv run --extra dev python scripts/web_as_user.py --list
OTTO_ALLOW_REAL_COST=1 uv run --extra dev python scripts/web_as_user.py --mode quick
OTTO_ALLOW_REAL_COST=1 uv run --extra dev python scripts/web_as_user.py --scenario W1 --user-behavior mc-realistic --user-seed 42
OTTO_ALLOW_REAL_COST=1 uv run --extra dev python scripts/web_as_user.py --tier nightly --scenario-delay 10

Current high-signal browser scenarios:

project-launcher
fresh-queue
ready-land
dirty-blocked
watcher-stop-ui
job-submit-matrix
bulk-land
long-log-layout
control-tour

Current high-signal live web-as-user scenarios:

W1: first-time user creates project, submits build, inspects tabs.
W7: mobile/iPhone W1 flow.
W11: operator day, CLI/web interop, queue, watcher, cancel, merge.
W4: merge happy path.
W5: merge blocked with clear reason.
W12b: CLI-queued task through Web start/run/merge.
W13: restart otto web mid-build and verify recovery.

Evidence To Collect

Always capture:

project path/source
user goal and scenario type
surface used: Web, CLI, or mixed
exact command(s)
provider/model/effort if visible
Mission Control user-behavior mode, seed, and action log when enabled
session id and run id
wall time and cost when available
token spend, phase breakdown, and top expensive agent calls when available
Otto verdict
external verifier command/result
proof packet path
bugs/issues found
root cause if known
fix made and regression tests if applicable
final user verdict

Primary artifact paths:

otto_logs/sessions/<session-id>/summary.json
otto_logs/sessions/<session-id>/spec/spec.json
otto_logs/sessions/<session-id>/spec/repo-index.json
otto_logs/sessions/<session-id>/spec-state.jsonl
otto_logs/sessions/<session-id>/proof-packet.html
otto_logs/sessions/<session-id>/proof-packet.json
otto_logs/sessions/<session-id>/build/**/context-packet.json
otto_logs/sessions/<session-id>/audit/**/evidence-packet.json
otto_logs/sessions/<session-id>/**/narrative.log
otto_logs/sessions/<session-id>/audit/**/feature-verdicts.json
otto_logs/sessions/<session-id>/audit/**/screenshots/
otto_logs/sessions/<session-id>/audit/**/videos/
otto_logs/cross-sessions/runs/live/<run_id>.json
otto_logs/cross-sessions/history.jsonl
otto_logs/sessions/<run_id>/commands/requests.jsonl
otto_logs/sessions/<run_id>/commands/acks.jsonl
bench-results/web-as-user/<run_id>/<scenario>/mc-user-behavior.jsonl
bench-results/web-as-user/<run_id>/<scenario>/mc-layout-*.json

Bug Hunting Rules

Logs first, then diagnosis.
Compact packets first, raw transcripts last. Do not bulk-read messages.jsonl when summary.json, proof-packet.json, context-packet.json, or evidence-packet.json already answer the question.
Deterministic-first audit: contract tests, cross-slice checks, and browser/walkthrough artifacts beat source inspection and model narrative. If a deterministic oracle is wrong, document why and fix the oracle generically rather than ignoring it.
Reproduce or narrow before fixing when behavior is ambiguous.
Fix generic root causes only.
Add regression tests for code fixes.
Do not hide product failures behind "provider flake" without evidence.
Do not declare beyond-current-capability without failed logs, external verifier evidence, and at least one retry or narrower reproduction.

Report Format

Use this compact report for each scenario:

Scenario:
User goal:
Project:
Surface: Web / CLI / mixed
Provider/model:
MC user behavior:
Session/run ids:
Success measures:
Steps performed:
Evidence inspected:
External verifier:
Bugs/issues:
Fixes/tests:
Verdict: accept / reject / retry / fix / defer / beyond-current-capability
Next escalation:

For pressure-test campaigns, append every run to the requested handoff/results document and keep escalating until the stop condition is evidence-based.

name	otto-as-user
description	Simulate real users using Otto end to end. Use this for behavioral testing, result auditing, bug hunting, and pressure tests on real-world projects through Mission Control Web first, with CLI as the power-user/agent path. Requires real workflows, success measures, logs, artifacts, browser evidence when relevant, and external verification.