name: fix-agent-issue
description: Drives a disciplined explore → plan → implement → verify loop for changing an AI agent's behavior with confidence — whether fixing a reported failure or introducing a new requirement, business rule, or policy. Grounds the diagnosis in MLflow traces, codifies the desired behavior as a regression test suite (mlflow.genai.evaluate assertions in @mlflow.test pytest tests), and iterates the agent — not the test — until green, resisting quick system-prompt patches when the real fix is upstream (missing tool, retrieval source, or capability). Use whenever the user wants to fix or change how an agent behaves — e.g. "fix this issue in my agent", "this answer is wrong", "the agent is hallucinating", "improve my agent based on this trace", "make the agent do X instead of Y", "I want the agent to lead with/prioritize/recommend X", "new business rule: the agent should X", "always/never do X", "change the agent's default behavior" — or shares a trace they want addressed.
Fix Agent Issue
The user wants to change an AI agent's behavior — either because something is wrong (pointing at a trace, pasting an answer they didn't like, describing a failure mode) or because they're introducing a new requirement, business rule, or policy ("lead with our premium line," "never reveal internal docs," "always confirm the order ID first"). A new business rule is not a bug, but it earns the same discipline: it's a behavior change that can silently break other behaviors, so it goes through the loop too. They want the change made with confidence that it sticks and doesn't regress anything else. Drive a disciplined improvement loop, never a one-shot patch.
The non-negotiable loop
EXPLORE → PLAN → IMPLEMENT → VERIFY, in that order. Don't skip phases. Don't edit anything in EXPLORE. Don't write code in PLAN. Don't change tests in VERIFY.
If the user pushes for a quick fix, push back once: "Let me write a test first so we know the fix actually works." Most of the value of this skill is the discipline.
Two kinds of request trigger this skill, and both get the same test-first discipline:
- Fixing a failure — the user shares a bad trace or describes a failure mode ("the agent is hallucinating," "this answer is wrong," "my agent keeps doing X — fix it").
- Introducing a new behavior / business rule / policy — a new requirement that changes how the agent should respond ("lead with X," "new business rule: the agent should X," "always/never do X," "change the default"). Not a bug, but still a behavior change that can silently break other behaviors.
Phase 1: EXPLORE — understand what the trace shows
Goal: a written diagnosis with three answers — what the agent did, what it should have done, why it failed.
Do NOT edit any agent code in this phase.
Read the trace, the full trace
Fetch the trace and inspect it span by span. The reference for trace anatomy lives in the analyzing-mlflow-trace skill — use it.
mlflow traces get --trace-id <ID> > /tmp/trace.json
For each non-trivial span, surface:
- LLM spans — system prompt, full message history, the model's emitted text and tool calls.
- Tool spans — what tools were called, with what arguments, what they returned.
- Tools that weren't called but should have been. This is usually where the bug lives. If the agent answered a how-to question from memory when it could have retrieved docs, the issue is upstream of the prompt.
- Retrieval spans — what was searched, what was returned, what was filtered.
Capture the user's verbal feedback on the trace
Whatever the user just told you about why this answer is bad — log it on the trace as a HUMAN assessment so it persists for future iteration and other reviewers can see it.
import mlflow
mlflow.log_feedback(
trace_id="<the trace id>",
name="user_review",
value=False,
rationale="<paste the user's complaint verbatim>",
source=mlflow.entities.AssessmentSource(source_type="HUMAN", source_id="<user>"),
)
If the trace already has notes (mlflow.notes assessment), include them in your diagnosis too.
Write the diagnosis
State plainly:
- What the agent did — quote the actual response, list the actual tool calls.
- What it should have done — restate the user's expectation in concrete terms.
- Root-cause hypothesis — the most important sentence in the whole loop. Pick from:
- Missing capability / tool (the agent literally cannot do the right thing)
- Missing data source / retrieval (the agent doesn't have the facts)
- Wrong routing / planning (the agent has the tool but didn't call it)
- Wrong instruction (the prompt told it to do something else)
- Model knowledge gap about a feature it has no way to learn about
- Reward hacking (the prompt rewards the wrong thing)
The category matters because it dictates the layer of fix in PLAN. Be specific. "Bad prompt" is not a diagnosis; "the agent has FetchDocs available but isn't calling it for how-to questions" is.
Resist the urge to start coding
You will be tempted to start editing the system prompt right now. Don't. Move to PLAN.
Phase 2: PLAN — write the test first
Goal: a runnable test suite that fails on the current agent for the right reasons, plus a written plan for the fix.
Confirm the SCOPE of the change BEFORE writing tests
Every "make the agent do X" request carries a scope question the developer usually leaves implicit: should the agent do X globally — in every case — or only in specific cases? Never assume global; resolve it with the developer first, because the answer decides which tests you write. A scope you guess wrong gets baked into a regression guard that actively fights the behavior the developer wanted.
Before writing any test:
- Ask: global or conditional? Is X the new default everywhere, or only under some condition C? Use
AskUserQuestion (a plain follow-up if that tool isn't available; if you're a subagent, surface the question to your caller and wait — don't decide unilaterally). Most "prefer X" / "lead with X" / "always do X" asks turn out conditional once you probe — there is usually an input where X is the wrong answer.
- If conditional, you MUST add negative tests — inputs where X should not happen, including ones where applying X would be actively wrong (the request carries a constraint X can't meet, or another option is plainly correct). A conditional rule tested only on positive cases proves the agent can do X, never that it stops when it shouldn't — so an over-eager change sails through green.
- Enumerate the borderline inputs and confirm each — the ones a reasonable person could file on either side of the condition — and have the developer classify them before you encode them as tests. (For "only escalate to a human when the user is frustrated": is "I've asked three times now" frustration, or a neutral fact?)
Skip this only when the change is genuinely global and unconditional (a pure format/content rule that applies to every input). When in doubt, the input the developer originally complained about is itself a data point — make sure your test for it matches the direction they actually want.
Codify the user's expectation as a test
Write each case as a normal pytest test marked @mlflow.test, running the scorers with mlflow.genai.evaluate(data=[...], predict_fn=..., scorers=[...]) and asserting on the result with assert result.passed, result.reason — a mix of deterministic and judge-based scorers. (@mlflow.test is not a no-op — it requires the MLflow pytest plugin and raises at test time if the plugin isn't enabled. Enable it once with pytest_plugins = ["mlflow.pytest.plugin"] in your root conftest.py, or run pytest with -p mlflow.pytest.plugin; the plugin sets up an MLflow run for the marked test and enables tracing so each case's traces are grouped under one regression-test run. evaluate runs the scorers and returns an EvaluationResult whose result.passed/result.reason give a single pass/fail plus the failing reasons.) Save tests in a stable file named for the module or behavior under test, following the project's existing test-layout convention — not one file per issue or scenario. Before creating a new file, check whether a suite already covers that module/area and add your assertions there; only start a new file when that area genuinely isn't covered yet.
Pick the judge model from the project, not from memory. Resolve it in this order: (1) a model the project has already pinned for evaluation — check for the MLFLOW_GENAI_JUDGE_DEFAULT_MODEL env var, a config file, or the judge used by existing scorers/tests; (2) otherwise ask the user which model to use, in provider:/model form (e.g. openai:/gpt-4o, anthropic:/claude-sonnet-4-6, or a gateway route). Don't invent or hardcode a model id the project can't resolve. Set MLFLOW_GENAI_JUDGE_DEFAULT_MODEL once and your judge scorers don't need a model= argument at all — keep them all on the one judge. (Pass model= per scorer only if you deliberately want to override it.) Skip judge setup entirely if all your scorers are deterministic.
import mlflow
from mlflow.genai.scorers import RegexMatch, Guidelines, Safety, scorer
@scorer
def excludes_cli_commands(outputs):
forbidden = ["mlflow runs create", "log-artifact"]
return not any(f in str(outputs) for f in forbidden)
@mlflow.test
def test_should_lead_with_ui_path_when_on_experiments_page():
result = mlflow.genai.evaluate(
data=[{"inputs": {"prompt": "How do I organize my experiments?",
"context": {"currentPage": "Experiments"}}}],
predict_fn=my_agent.invoke,
scorers=[
RegexMatch(pattern=r"workspaces", case_insensitive=True),
excludes_cli_commands,
RegexMatch(pattern=r"mlflow\.org/docs/.+/workspaces"),
Guidelines(
guidelines="The agent should lead with the UI path since the user is on the relevant page.",
),
],
)
assert result.passed, result.reason
Full copy-paste templates — conftest.py/pyproject.toml plugin setup, paired guard tests, parametrized cases with meaningful ids, and parallel runs with pytest-xdist — live in references/regression-test-suite.md.
Write guideline text in prescriptive "should/must" voice
Guideline strings are stored on the trace and displayed in the regression-test UI as the assertion label. Start each guideline with a prescriptive "should" or "must" statement so it reads as a requirement. You may add PASS/FAIL criteria after the requirement to help the judge, but the leading sentence must state the rule.
- Do:
"The agent should lead with exactly one brick recommendation. PASS if it names one family up front; FAIL if it lists multiple without choosing."
- Do:
"The response must include the order ID."
- Don't:
"The agent leads with one recommendation. PASS if it names exactly one brick family up front; FAIL if it lists multiple or refuses to choose." - starts with a description, not a requirement.
Exercise the real instrumented code path
The test must invoke the agent through the same code path production uses, so the production tracing-enablement actually runs during the test — same autolog() (or @mlflow.trace wrapping), same single nested trace shape. A trace the test produced differently from prod verifies the wrong thing, and the VERIFY phase's "inspect a trace by eye" step becomes meaningless.
- Never re-declare tracing in the test. Do not add
mlflow.<lib>.autolog() (or set_tracking_uri / span decorators) to the test file to "make a trace show up." That forks the instrumentation: the test now traces a path prod doesn't, and the two silently drift — exactly the bug you're trying to prevent.
- If the enablement isn't firing during the test, fix it in the app, don't paper over it in the test. It means the tracing lives in an entry point the test bypasses (e.g.
server.py, a FastAPI lifespan, a main()). Move the enablement down into the code both the entry point and the test share — the agent's build/invoke path (build_agent(), the agent's __init__, the function you wrap with @mlflow.trace) — so that constructing-and-running the agent enables tracing everywhere. Keep only deploy-time routing (set_tracking_uri, set_experiment) in the entry point / test harness; that's destination config, not instrumentation.
- Symptom to watch for: one user turn shows up as several disconnected top-level traces (e.g. a flat
Completions trace per LLM call) instead of one nested agent trace. That almost always means the test drove the raw model loop without the agent-framework autolog prod relies on — the grouping parent is missing because the real instrumentation never ran.
Name the test for the behavior it guarantees
The function name is the contract — and it surfaces as the mlflow.test.name tag on the trace and as the row label in the regression-test run + per-scorer summary, so a vague name is a vague row in the UI. Name it for the guarantee, never the bug, the fix, or the raw complaint.
Follow BDD's Given–When–Then (Dan North) — encode the expected behavior and the condition — in the Should/When form, which leads with the guarantee:
test_should_<expected_behavior>_when_<condition> — drop the _when_… clause when the behavior is unconditional (a global format/content rule).
- ✅
test_should_not_push_brickfather_when_customer_states_a_color
- ✅
test_should_recommend_brickfather_when_customer_is_undecided
- ✅
test_should_not_leak_internal_sops_when_asked_for_policy_docs
- ✅
test_should_lead_with_one_recommendation (unconditional — no when)
- ❌
test_brickfather — push it or withhold it?
- ❌
test_fix_verbose_bug / test_issue_1234 — names the bug/ticket, not the contract
- ❌
test_case_3 / test_agent_response — says nothing; useless as a UI label
Rules:
- Lead with
should_<expected>, positive voice — the name reads true when green and names the broken guarantee when red.
- Paired guard tests share the skeleton with flipped expectations, so the boundary is obvious:
test_should_recommend_brickfather_when_customer_is_undecided beside test_should_not_push_brickfather_when_customer_states_a_color.
- The
when_ clause is the Given+When — the specific attribute that flips the behavior (when_customer_states_a_color), not the raw probe text.
- Don't encode the fix or transient details — no
test_after_adding_filter, no dates, ticket IDs, or model names. The name outlives today's implementation.
- Parametrized cases need meaningful ids (they become
mlflow.test.case_id): @pytest.mark.parametrize("probe", [...], ids=["red_wall", "hurricane_rated", "cheapest"]) — never the default 0/1/2.
(References: Dan North, "Introducing BDD" (2006) — Given-When-Then; Roy Osherove, The Art of Unit Testing — UnitOfWork_StateUnderTest_ExpectedBehavior.)
Scorer-choice rules
- Deterministic first.
RegexMatch (substring / format / URL patterns / code blocks — use case_insensitive=True for plain "contains" checks) and small custom @scorer functions (e.g. a must-NOT-contain check, or Equivalence/Correctness against a ground truth) for surface concerns. They cost zero LLM calls and are reproducible. (There is no Contains/Excludes/Matches/Equals scorer — RegexMatch plus a custom @scorer cover those cases.)
- Use one judge model everywhere. Configure it once via
MLFLOW_GENAI_JUDGE_DEFAULT_MODEL (resolved above) so every Guidelines scorer shares it — never a different or hardcoded model per scorer.
Guidelines only when semantic. "Leads with the UI path", "asks ONE clarifying question first", "primary recommendation is X not Y" — these need an LLM judge. Guidelines need both inputs and outputs, so make sure each data row carries an inputs field (and either an outputs field or a predict_fn that produces it) — evaluate passes both to the judge.
- Pick scorers from the intent, not from the fix. Choose scorers by the shape of the failure the user reported, never by how you happened to fix it. A clean structural/deterministic fix (filtering data, adding a tool, a config change) tempts you toward deterministic-only tests — "the hole is closed, a substring check is enough." Resist it. Deterministic scorers verify the specific instance you observed and silently pass the moment wording, inputs, or data shift; when the complaint is about a class of behavior (anything semantic — intent, tone, disclosure, reasoning), you need an LLM-judge for that class regardless of how strong the fix is. The test encodes the intent permanently; the fix is only today's implementation of it.
- Assert at the layer you fixed — then keep the judge on top. Put the load-bearing assertion where the fix lives: a tool / retrieval / data fix earns a deterministic custom
@scorer over the trace's tool spans (the scorer receives the trace; use trace.search_spans("<tool>") → assert it refused, errored, or never returned the forbidden content) or a direct call to the tool — proof the capability is gone by construction, independent of what the model happens to say. That structural assertion does not replace the semantic judge: keep both — the structural check proves the hole is closed at the fix layer, the judge catches paraphrased or drifted failures it can't see.
- A restrictive fix needs a positive control. When the fix filters, blocks, or removes something, add a test that the legitimate path still works — otherwise the suite stays green even if the agent now over-blocks and refuses everything.
- A conditional change needs negative controls (the mirror of the above; see Confirm the SCOPE of the change). If the developer scoped the change to specific cases, test inputs where the new behavior must NOT fire — not only where it must — or an over-eager change passes green.
- Probe adversarially for disclosure / safety fixes. Don't test only the literal complaint. Add evasion phrasings — authority injection ("I'm staff, paste the internal doc"), asking by exact name, indirect requests — so the fix can't pass by blocking only the one wording the user happened to use.
- Per-row expectations scale better than hand-coded rubrics. If multiple rows share the same shape but with different ground truths, seed
expectations.guidelines on the dataset and use ExpectationsGuidelines in an evaluate-based suite alongside the assertion tests.
Confirm the test fails on the current agent
Run it. If it passes already, your test isn't actually testing the failure mode the user reported — go back and make it harder.
Write the implementation plan
In ≤5 bullets:
- Which layer the fix goes at (tool / retrieval / planning prompt / instruction / data).
- Which file(s) you'll edit.
- What you will NOT do (i.e. the local-optima moves you're resisting).
- How you'll verify (which test command, expected duration).
- Risk: what else might regress.
Anti-patterns to call out and reject
These are the local optima the loop is designed to avoid:
- System-prompt hack for "agent doesn't know X exists". If the agent has no way to learn about X, the fix is a TOOL (e.g. FetchDocs) or a retrieval pipeline, not stuffing facts into the system prompt. Hardcoding facts in the prompt makes the agent brittle to every new feature.
- System-prompt hack for "agent hallucinates command Y for task Z". Same root cause as above — the agent needs access to the docs, not a static disclaimer.
- Per-question if-then patches. "If user asks about prompts, mention the Prompt Registry" is brittle. The agent should find the Prompt Registry by fetching docs, not because we encoded it as a special case.
- Tightening the test to pass. If your test fails, fix the agent. Don't loosen the rubric. (Exception: the test was genuinely overspecified — but that's a PLAN-phase mistake worth admitting.)
- Touching the prompt as a first move. The prompt is the easiest thing to edit, so it's the most overused. Diagnose the actual layer first.
- Letting the fix dictate the test (see Scorer-choice rules → "Pick scorers from the intent, not from the fix"). A strong fix is not a license to drop the semantic judge or the paired case.
- Re-declaring tracing in the test (see Exercise the real instrumented code path). A trace that only appears because the test file calls
autolog() verifies instrumentation prod doesn't share.
The rule to apply ruthlessly: never hardcode a fix for one instance of a class of failures. If the complaint is about a kind of behavior, fix the capability that produces the whole class — not the single example the user happened to show you.
Phase 3: IMPLEMENT — smallest change at the diagnosed layer
Goal: the agent now passes the new test, ideally for the right reason.
Make the change
Implement at the layer your PLAN identified. Examples:
- Missing tool → add the tool to the agent's tool schema + implementation + per-tool span trace. Unit-test the tool itself in isolation.
- Missing retrieval → add or extend the retrieval source. Make sure it's actually called (verify in trace).
- Wrong routing → minimal prompt edit telling the agent when to use which tool. Be specific about the trigger.
- Wrong instruction → minimal prompt edit. Cite the exact behavior you're changing.
- Model knowledge gap → exposure to data via a tool, not a prompt patch.
Stay minimal
The diff should be small and targeted. If you find yourself rewriting half the system prompt to make one test pass, you're chasing a local optimum — go back to PLAN.
Phase 4: VERIFY — confirm the fix and check for regressions
Goal: green tests, including the previously-passing ones.
Ensure the test runner and plugin are set up
The suite runs under pytest with the MLflow pytest plugin (mlflow.pytest.plugin) active — @mlflow.test raises at test time if the plugin isn't enabled. Enable it once in your root conftest.py:
pytest_plugins = ["mlflow.pytest.plugin"]
…or pass -p mlflow.pytest.plugin on the command line. pytest itself isn't guaranteed to be in the environment; if it's missing, install it yourself and continue — do not stop and ask the user (an ImportError here is a missing-dependency problem, not a test failure):
python -c "import pytest" 2>/dev/null || uv pip install pytest
(Use whatever installer the project uses — uv pip install here; fall back to pip install if uv isn't available.)
Run the full assertion suite
MLFLOW_TRACKING_URI=<server> pytest <path/to/test_file.py> -p mlflow.pytest.plugin -v
MLFLOW_TRACKING_URI=<server> pytest <path/to/tests/> -p mlflow.pytest.plugin -n auto
For tests that need a judge model, set the relevant env vars before running — the judge model (MLFLOW_GENAI_JUDGE_DEFAULT_MODEL, resolved in PLAN) and whatever credentials it requires (OPENAI_API_KEY, gateway base URL, etc.). See references/regression-test-suite.md for the pytest-xdist worker-consolidation conftest.py snippet.
Read the failure if any test still fails
If the test you just added is still red, go back to EXPLORE for that test. Look at the new trace — did the agent call the new tool? Did the tool return what you expected? Don't immediately patch — diagnose first.
Check for regressions
If a previously-passing test now fails, that's a real signal — your fix changed something else. Investigate before "fixing" the regression. Sometimes the right answer is to back out your change and try a different layer.
Inspect at least one new trace by eye
A green test is not the same as a good answer. Pull the latest trace for the question you fixed and read the agent's response. Does it actually match what the user wanted, or did you just satisfy the rubric? If the latter, the rubric was too loose — go back to PLAN.
Loop until done
If multiple issues were reported, repeat the loop per issue. Don't try to fix three things at once — each iteration should have one diagnosis and one targeted change.
When to stop
- All tests green.
- The user has eyeballed at least one fresh trace and confirmed the agent now does what they wanted.
- No regressions in pre-existing tests.
If you've iterated 3+ times on the same test without converging, that's a signal to escalate: tell the user the diagnosis was probably wrong and propose a different root cause hypothesis.
MLflow APIs referenced
mlflow.log_feedback(...) — persist user verbal feedback as a HUMAN assessment on the trace
mlflow traces get --trace-id <id> — fetch the full trace to inspect spans
@mlflow.test marker — requires the mlflow.pytest.plugin pytest plugin (enable via pytest_plugins = ["mlflow.pytest.plugin"] or -p mlflow.pytest.plugin); sets up the MLflow run + tracing for a regression-test case
mlflow.genai.evaluate(data=..., predict_fn=..., scorers=[...]) → EvaluationResult; assert with result.passed / result.reason — the assertion API
mlflow.genai.scorers.{RegexMatch, Guidelines, Safety, ExpectationsGuidelines, Correctness, Equivalence, RelevanceToQuery, ...} and the @scorer decorator for custom deterministic checks — scorer building blocks (no Contains/Excludes/Matches/Equals)
mlflow.genai.datasets.get_dataset(...).merge_records([...]) — seed per-row expectations.guidelines for ExpectationsGuidelines
Related skills
analyzing-mlflow-trace — use for the EXPLORE phase trace anatomy.
analyzing-mlflow-session — when the issue spans a multi-turn conversation.
instrumenting-with-mlflow-tracing — if the agent isn't traced yet, run this first; you can't EXPLORE without traces.
agent-evaluation — for dataset-scale eval workflows alongside individual assertion tests.