Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

fix-agent-issue

Sterne57

Forks18

Aktualisiert18. Juni 2026 um 07:01

Drives a disciplined explore → plan → implement → verify loop for changing an AI agent's behavior with confidence — whether fixing a reported failure or introducing a new requirement, business rule, or policy. Grounds the diagnosis in MLflow traces, codifies the desired behavior as a regression test suite (`mlflow.genai.evaluate` assertions in `@mlflow.test` pytest tests), and iterates the agent — not the test — until green, resisting quick system-prompt patches when the real fix is upstream (missing tool, retrieval source, or capability). Use whenever the user wants to fix or change how an agent behaves — e.g. "fix this issue in my agent", "this answer is wrong", "the agent is hallucinating", "improve my agent based on this trace", "make the agent do X instead of Y", "I want the agent to lead with/prioritize/recommend X", "new business rule: the agent should X", "always/never do X", "change the agent's default behavior" — or shares a trace they want addressed.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

mlflow

mlflow/skills

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

SoftwareentwicklerInformatik- und Mathematikberufe·SOC 15-1252

Datei-Explorer

2 Dateien

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

agent-evaluation

mlflow/skills

Use this when you need to EVALUATE OR IMPROVE or OPTIMIZE an existing LLM agent's output quality - including improving tool selection accuracy, answer quality, reducing costs, or fixing issues where the agent gives wrong/incomplete responses. Evaluates agents systematically using MLflow evaluation with datasets, scorers, and tracing. IMPORTANT - Always also load the instrumenting-with-mlflow-tracing skill before starting any work. Covers end-to-end evaluation workflow or individual components (tracing setup, dataset creation, scorer definition, evaluation execution).

2026-05-2957

instrumenting-with-mlflow-tracing

mlflow/skills

Instruments Python and TypeScript code with MLflow Tracing for observability. Must be loaded when setting up tracing as part of any workflow including agent evaluation. Triggers on adding tracing, instrumenting agents/LLM apps, getting started with MLflow tracing, tracing specific frameworks (LangGraph, LangChain, OpenAI, Gemini, DSPy, CrewAI, AutoGen), or when another skill references tracing setup. Examples - "How do I add tracing?", "Instrument my agent", "Trace my LangChain app", "Set up tracing for evaluation"

2026-05-2957

mlflow-onboarding

mlflow/skills

Onboards users to MLflow by determining their use case (GenAI agents/apps or traditional ML/deep learning) and guiding them through relevant quickstart tutorials and initial integration. If an experiment ID is available, it should be supplied as input to help determine the use case. Use when the user asks to get started with MLflow, set up tracking, add observability, or integrate MLflow into their project. Triggers on "get started with MLflow", "set up MLflow", "onboard to MLflow", "add MLflow to my project", "how do I use MLflow".

2026-05-2957

mlflow-agent

mlflow/skills

Master dispatcher for all MLflow workflows. Use this skill when the user wants to do anything with MLflow — tracing, evaluating, debugging, or improving an agent. Routes to the right MLflow sub-skill automatically. Triggers on: "use mlflow", "help with mlflow", "mlflow agent", "add mlflow to my project", "trace my agent", "evaluate my agent", or any MLflow task without a specific skill in mind.

2026-03-3157

analyzing-mlflow-session

mlflow/skills

Analyzes an MLflow session — a sequence of traces from a multi-turn chat conversation or interaction. Use when the user asks to debug a chat conversation, review session or chat history, find where a multi-turn chat went wrong, or analyze patterns across turns. Triggers on "analyze this session", "what happened in this conversation", "debug session", "review chat history", "where did this chat go wrong", "session traces", "analyze chat", "debug this chat".

2026-01-2857

analyzing-mlflow-trace

mlflow/skills

Analyzes a single MLflow trace to answer a user query about it. Use when the user provides a trace ID and asks to debug, investigate, find issues, root-cause errors, understand behavior, or analyze quality. Triggers on "analyze this trace", "what went wrong with this trace", "debug trace", "investigate trace", "why did this trace fail", "root cause this trace".

2026-01-2857

name: fix-agent-issue description: Drives a disciplined explore → plan → implement → verify loop for changing an AI agent's behavior with confidence — whether fixing a reported failure or introducing a new requirement, business rule, or policy. Grounds the diagnosis in MLflow traces, codifies the desired behavior as a regression test suite (`mlflow.genai.evaluate` assertions in `@mlflow.test` pytest tests), and iterates the agent — not the test — until green, resisting quick system-prompt patches when the real fix is upstream (missing tool, retrieval source, or capability). Use whenever the user wants to fix or change how an agent behaves — e.g. "fix this issue in my agent", "this answer is wrong", "the agent is hallucinating", "improve my agent based on this trace", "make the agent do X instead of Y", "I want the agent to lead with/prioritize/recommend X", "new business rule: the agent should X", "always/never do X", "change the agent's default behavior" — or shares a trace they want addressed.

Fix Agent Issue

The user wants to change an AI agent's behavior — either because something is wrong (pointing at a trace, pasting an answer they didn't like, describing a failure mode) or because they're introducing a new requirement, business rule, or policy ("lead with our premium line," "never reveal internal docs," "always confirm the order ID first"). A new business rule is not a bug, but it earns the same discipline: it's a behavior change that can silently break other behaviors, so it goes through the loop too. They want the change made with confidence that it sticks and doesn't regress anything else. Drive a disciplined improvement loop, never a one-shot patch.

The non-negotiable loop

EXPLORE → PLAN → IMPLEMENT → VERIFY, in that order. Don't skip phases. Don't edit anything in EXPLORE. Don't write code in PLAN. Don't change tests in VERIFY.

If the user pushes for a quick fix, push back once: "Let me write a test first so we know the fix actually works." Most of the value of this skill is the discipline.

Two kinds of request trigger this skill, and both get the same test-first discipline:

Fixing a failure — the user shares a bad trace or describes a failure mode ("the agent is hallucinating," "this answer is wrong," "my agent keeps doing X — fix it").
Introducing a new behavior / business rule / policy — a new requirement that changes how the agent should respond ("lead with X," "new business rule: the agent should X," "always/never do X," "change the default"). Not a bug, but still a behavior change that can silently break other behaviors.

Phase 1: EXPLORE — understand what the trace shows

Goal: a written diagnosis with three answers — what the agent did, what it should have done, why it failed.

Do NOT edit any agent code in this phase.

Read the trace, the full trace

Fetch the trace and inspect it span by span. The reference for trace anatomy lives in the analyzing-mlflow-trace skill — use it.

mlflow traces get --trace-id <ID> > /tmp/trace.json

For each non-trivial span, surface:

LLM spans — system prompt, full message history, the model's emitted text and tool calls.
Tool spans — what tools were called, with what arguments, what they returned.
Tools that weren't called but should have been. This is usually where the bug lives. If the agent answered a how-to question from memory when it could have retrieved docs, the issue is upstream of the prompt.
Retrieval spans — what was searched, what was returned, what was filtered.

Capture the user's verbal feedback on the trace

Whatever the user just told you about why this answer is bad — log it on the trace as a HUMAN assessment so it persists for future iteration and other reviewers can see it.

import mlflow
mlflow.log_feedback(
    trace_id="<the trace id>",
    name="user_review",
    value=False,                            # or a numeric score / category
    rationale="<paste the user's complaint verbatim>",
    source=mlflow.entities.AssessmentSource(source_type="HUMAN", source_id="<user>"),
)

If the trace already has notes (mlflow.notes assessment), include them in your diagnosis too.

Write the diagnosis

State plainly:

What the agent did — quote the actual response, list the actual tool calls.
What it should have done — restate the user's expectation in concrete terms.
Root-cause hypothesis — the most important sentence in the whole loop. Pick from:
- Missing capability / tool (the agent literally cannot do the right thing)
- Missing data source / retrieval (the agent doesn't have the facts)
- Wrong routing / planning (the agent has the tool but didn't call it)
- Wrong instruction (the prompt told it to do something else)
- Model knowledge gap about a feature it has no way to learn about
- Reward hacking (the prompt rewards the wrong thing)

The category matters because it dictates the layer of fix in PLAN. Be specific. "Bad prompt" is not a diagnosis; "the agent has FetchDocs available but isn't calling it for how-to questions" is.

Resist the urge to start coding

You will be tempted to start editing the system prompt right now. Don't. Move to PLAN.

Phase 2: PLAN — write the test first

Goal: a runnable test suite that fails on the current agent for the right reasons, plus a written plan for the fix.

Confirm the SCOPE of the change BEFORE writing tests

Every "make the agent do X" request carries a scope question the developer usually leaves implicit: should the agent do X globally — in every case — or only in specific cases? Never assume global; resolve it with the developer first, because the answer decides which tests you write. A scope you guess wrong gets baked into a regression guard that actively fights the behavior the developer wanted.

Before writing any test:

Ask: global or conditional? Is X the new default everywhere, or only under some condition C? Use AskUserQuestion (a plain follow-up if that tool isn't available; if you're a subagent, surface the question to your caller and wait — don't decide unilaterally). Most "prefer X" / "lead with X" / "always do X" asks turn out conditional once you probe — there is usually an input where X is the wrong answer.
If conditional, you MUST add negative tests — inputs where X should not happen, including ones where applying X would be actively wrong (the request carries a constraint X can't meet, or another option is plainly correct). A conditional rule tested only on positive cases proves the agent can do X, never that it stops when it shouldn't — so an over-eager change sails through green.
Enumerate the borderline inputs and confirm each — the ones a reasonable person could file on either side of the condition — and have the developer classify them before you encode them as tests. (For "only escalate to a human when the user is frustrated": is "I've asked three times now" frustration, or a neutral fact?)

Skip this only when the change is genuinely global and unconditional (a pure format/content rule that applies to every input). When in doubt, the input the developer originally complained about is itself a data point — make sure your test for it matches the direction they actually want.

Codify the user's expectation as a test

Write each case as a normal pytest test marked @mlflow.test, running the scorers with mlflow.genai.evaluate(data=[...], predict_fn=..., scorers=[...]) and asserting on the result with assert result.passed, result.reason — a mix of deterministic and judge-based scorers. (@mlflow.test is not a no-op — it requires the MLflow pytest plugin and raises at test time if the plugin isn't enabled. Enable it once with pytest_plugins = ["mlflow.pytest.plugin"] in your root conftest.py, or run pytest with -p mlflow.pytest.plugin; the plugin sets up an MLflow run for the marked test and enables tracing so each case's traces are grouped under one regression-test run. evaluate runs the scorers and returns an EvaluationResult whose result.passed/result.reason give a single pass/fail plus the failing reasons.) Save tests in a stable file named for the module or behavior under test, following the project's existing test-layout convention — not one file per issue or scenario. Before creating a new file, check whether a suite already covers that module/area and add your assertions there; only start a new file when that area genuinely isn't covered yet.

Pick the judge model from the project, not from memory. Resolve it in this order: (1) a model the project has already pinned for evaluation — check for the MLFLOW_GENAI_JUDGE_DEFAULT_MODEL env var, a config file, or the judge used by existing scorers/tests; (2) otherwise ask the user which model to use, in provider:/model form (e.g. openai:/gpt-4o, anthropic:/claude-sonnet-4-6, or a gateway route). Don't invent or hardcode a model id the project can't resolve. Set MLFLOW_GENAI_JUDGE_DEFAULT_MODEL once and your judge scorers don't need a model= argument at all — keep them all on the one judge. (Pass model= per scorer only if you deliberately want to override it.) Skip judge setup entirely if all your scorers are deterministic.

import mlflow
from mlflow.genai.scorers import RegexMatch, Guidelines, Safety, scorer

# Judge model comes from MLFLOW_GENAI_JUDGE_DEFAULT_MODEL — no model= needed below.

# There is no built-in "excludes" scorer — write a small custom @scorer for must-NOT-contain checks.
@scorer
def excludes_cli_commands(outputs):
    forbidden = ["mlflow runs create", "log-artifact"]
    return not any(f in str(outputs) for f in forbidden)

@mlflow.test
def test_should_lead_with_ui_path_when_on_experiments_page():
    result = mlflow.genai.evaluate(
        # inputs keys are passed to predict_fn as kwargs; predict_fn exercises the
        # real, instrumented agent path (see "Exercise the real instrumented code path").
        data=[{"inputs": {"prompt": "How do I organize my experiments?",
                          "context": {"currentPage": "Experiments"}}}],
        predict_fn=my_agent.invoke,
        scorers=[
            RegexMatch(pattern=r"workspaces", case_insensitive=True),  # deterministic
            excludes_cli_commands,                                     # deterministic (custom)
            RegexMatch(pattern=r"mlflow\.org/docs/.+/workspaces"),     # deterministic
            Guidelines(
                guidelines="The agent should lead with the UI path since the user is on the relevant page.",
            ),                                                         # semantic (uses default judge)
        ],
    )
    assert result.passed, result.reason

Full copy-paste templates — conftest.py/pyproject.toml plugin setup, paired guard tests, parametrized cases with meaningful ids, and parallel runs with pytest-xdist — live in references/regression-test-suite.md.

Write guideline text in prescriptive "should/must" voice

Guideline strings are stored on the trace and displayed in the regression-test UI as the assertion label. Start each guideline with a prescriptive "should" or "must" statement so it reads as a requirement. You may add PASS/FAIL criteria after the requirement to help the judge, but the leading sentence must state the rule.

Do: "The agent should lead with exactly one brick recommendation. PASS if it names one family up front; FAIL if it lists multiple without choosing."
Do: "The response must include the order ID."
Don't: "The agent leads with one recommendation. PASS if it names exactly one brick family up front; FAIL if it lists multiple or refuses to choose." - starts with a description, not a requirement.

Exercise the real instrumented code path

The test must invoke the agent through the same code path production uses, so the production tracing-enablement actually runs during the test — same autolog() (or @mlflow.trace wrapping), same single nested trace shape. A trace the test produced differently from prod verifies the wrong thing, and the VERIFY phase's "inspect a trace by eye" step becomes meaningless.

Never re-declare tracing in the test. Do not add mlflow.<lib>.autolog() (or set_tracking_uri / span decorators) to the test file to "make a trace show up." That forks the instrumentation: the test now traces a path prod doesn't, and the two silently drift — exactly the bug you're trying to prevent.
If the enablement isn't firing during the test, fix it in the app, don't paper over it in the test. It means the tracing lives in an entry point the test bypasses (e.g. server.py, a FastAPI lifespan, a main()). Move the enablement down into the code both the entry point and the test share — the agent's build/invoke path (build_agent(), the agent's __init__, the function you wrap with @mlflow.trace) — so that constructing-and-running the agent enables tracing everywhere. Keep only deploy-time routing (set_tracking_uri, set_experiment) in the entry point / test harness; that's destination config, not instrumentation.
Symptom to watch for: one user turn shows up as several disconnected top-level traces (e.g. a flat Completions trace per LLM call) instead of one nested agent trace. That almost always means the test drove the raw model loop without the agent-framework autolog prod relies on — the grouping parent is missing because the real instrumentation never ran.

Name the test for the behavior it guarantees

The function name is the contract — and it surfaces as the mlflow.test.name tag on the trace and as the row label in the regression-test run + per-scorer summary, so a vague name is a vague row in the UI. Name it for the guarantee, never the bug, the fix, or the raw complaint.

Follow BDD's Given–When–Then (Dan North) — encode the expected behavior and the condition — in the Should/When form, which leads with the guarantee:

test_should_<expected_behavior>_when_<condition> — drop the _when_… clause when the behavior is unconditional (a global format/content rule).

✅ test_should_not_push_brickfather_when_customer_states_a_color
✅ test_should_recommend_brickfather_when_customer_is_undecided
✅ test_should_not_leak_internal_sops_when_asked_for_policy_docs
✅ test_should_lead_with_one_recommendation (unconditional — no when)
❌ test_brickfather — push it or withhold it?
❌ test_fix_verbose_bug / test_issue_1234 — names the bug/ticket, not the contract
❌ test_case_3 / test_agent_response — says nothing; useless as a UI label

Rules:

Lead with should_<expected>, positive voice — the name reads true when green and names the broken guarantee when red.
Paired guard tests share the skeleton with flipped expectations, so the boundary is obvious: test_should_recommend_brickfather_when_customer_is_undecided beside test_should_not_push_brickfather_when_customer_states_a_color.
The when_ clause is the Given+When — the specific attribute that flips the behavior (when_customer_states_a_color), not the raw probe text.
Don't encode the fix or transient details — no test_after_adding_filter, no dates, ticket IDs, or model names. The name outlives today's implementation.
Parametrized cases need meaningful ids (they become mlflow.test.case_id): @pytest.mark.parametrize("probe", [...], ids=["red_wall", "hurricane_rated", "cheapest"]) — never the default 0/1/2.

(References: Dan North, "Introducing BDD" (2006) — Given-When-Then; Roy Osherove, The Art of Unit Testing — UnitOfWork_StateUnderTest_ExpectedBehavior.)

Scorer-choice rules

Deterministic first. RegexMatch (substring / format / URL patterns / code blocks — use case_insensitive=True for plain "contains" checks) and small custom @scorer functions (e.g. a must-NOT-contain check, or Equivalence/Correctness against a ground truth) for surface concerns. They cost zero LLM calls and are reproducible. (There is no Contains/Excludes/Matches/Equals scorer — RegexMatch plus a custom @scorer cover those cases.)
Use one judge model everywhere. Configure it once via MLFLOW_GENAI_JUDGE_DEFAULT_MODEL (resolved above) so every Guidelines scorer shares it — never a different or hardcoded model per scorer.
Guidelines only when semantic. "Leads with the UI path", "asks ONE clarifying question first", "primary recommendation is X not Y" — these need an LLM judge. Guidelines need both inputs and outputs, so make sure each data row carries an inputs field (and either an outputs field or a predict_fn that produces it) — evaluate passes both to the judge.
Pick scorers from the intent, not from the fix. Choose scorers by the shape of the failure the user reported, never by how you happened to fix it. A clean structural/deterministic fix (filtering data, adding a tool, a config change) tempts you toward deterministic-only tests — "the hole is closed, a substring check is enough." Resist it. Deterministic scorers verify the specific instance you observed and silently pass the moment wording, inputs, or data shift; when the complaint is about a class of behavior (anything semantic — intent, tone, disclosure, reasoning), you need an LLM-judge for that class regardless of how strong the fix is. The test encodes the intent permanently; the fix is only today's implementation of it.
Assert at the layer you fixed — then keep the judge on top. Put the load-bearing assertion where the fix lives: a tool / retrieval / data fix earns a deterministic custom @scorer over the trace's tool spans (the scorer receives the trace; use trace.search_spans("<tool>") → assert it refused, errored, or never returned the forbidden content) or a direct call to the tool — proof the capability is gone by construction, independent of what the model happens to say. That structural assertion does not replace the semantic judge: keep both — the structural check proves the hole is closed at the fix layer, the judge catches paraphrased or drifted failures it can't see.
A restrictive fix needs a positive control. When the fix filters, blocks, or removes something, add a test that the legitimate path still works — otherwise the suite stays green even if the agent now over-blocks and refuses everything.
A conditional change needs negative controls (the mirror of the above; see Confirm the SCOPE of the change). If the developer scoped the change to specific cases, test inputs where the new behavior must NOT fire — not only where it must — or an over-eager change passes green.
Probe adversarially for disclosure / safety fixes. Don't test only the literal complaint. Add evasion phrasings — authority injection ("I'm staff, paste the internal doc"), asking by exact name, indirect requests — so the fix can't pass by blocking only the one wording the user happened to use.
Per-row expectations scale better than hand-coded rubrics. If multiple rows share the same shape but with different ground truths, seed expectations.guidelines on the dataset and use ExpectationsGuidelines in an evaluate-based suite alongside the assertion tests.

Confirm the test fails on the current agent

Run it. If it passes already, your test isn't actually testing the failure mode the user reported — go back and make it harder.

Write the implementation plan

In ≤5 bullets:

Which layer the fix goes at (tool / retrieval / planning prompt / instruction / data).
Which file(s) you'll edit.
What you will NOT do (i.e. the local-optima moves you're resisting).
How you'll verify (which test command, expected duration).
Risk: what else might regress.

Anti-patterns to call out and reject

These are the local optima the loop is designed to avoid:

System-prompt hack for "agent doesn't know X exists". If the agent has no way to learn about X, the fix is a TOOL (e.g. FetchDocs) or a retrieval pipeline, not stuffing facts into the system prompt. Hardcoding facts in the prompt makes the agent brittle to every new feature.
System-prompt hack for "agent hallucinates command Y for task Z". Same root cause as above — the agent needs access to the docs, not a static disclaimer.
Per-question if-then patches. "If user asks about prompts, mention the Prompt Registry" is brittle. The agent should find the Prompt Registry by fetching docs, not because we encoded it as a special case.
Tightening the test to pass. If your test fails, fix the agent. Don't loosen the rubric. (Exception: the test was genuinely overspecified — but that's a PLAN-phase mistake worth admitting.)
Touching the prompt as a first move. The prompt is the easiest thing to edit, so it's the most overused. Diagnose the actual layer first.
Letting the fix dictate the test (see Scorer-choice rules → "Pick scorers from the intent, not from the fix"). A strong fix is not a license to drop the semantic judge or the paired case.
Re-declaring tracing in the test (see Exercise the real instrumented code path). A trace that only appears because the test file calls autolog() verifies instrumentation prod doesn't share.

The rule to apply ruthlessly: never hardcode a fix for one instance of a class of failures. If the complaint is about a kind of behavior, fix the capability that produces the whole class — not the single example the user happened to show you.

Phase 3: IMPLEMENT — smallest change at the diagnosed layer

Goal: the agent now passes the new test, ideally for the right reason.

Make the change

Implement at the layer your PLAN identified. Examples:

Missing tool → add the tool to the agent's tool schema + implementation + per-tool span trace. Unit-test the tool itself in isolation.
Missing retrieval → add or extend the retrieval source. Make sure it's actually called (verify in trace).
Wrong routing → minimal prompt edit telling the agent when to use which tool. Be specific about the trigger.
Wrong instruction → minimal prompt edit. Cite the exact behavior you're changing.
Model knowledge gap → exposure to data via a tool, not a prompt patch.

Stay minimal

The diff should be small and targeted. If you find yourself rewriting half the system prompt to make one test pass, you're chasing a local optimum — go back to PLAN.

Phase 4: VERIFY — confirm the fix and check for regressions

Goal: green tests, including the previously-passing ones.

Ensure the test runner and plugin are set up

The suite runs under pytest with the MLflow pytest plugin (mlflow.pytest.plugin) active — @mlflow.test raises at test time if the plugin isn't enabled. Enable it once in your root conftest.py:

# conftest.py
pytest_plugins = ["mlflow.pytest.plugin"]

…or pass -p mlflow.pytest.plugin on the command line. pytest itself isn't guaranteed to be in the environment; if it's missing, install it yourself and continue — do not stop and ask the user (an ImportError here is a missing-dependency problem, not a test failure):

python -c "import pytest" 2>/dev/null || uv pip install pytest

(Use whatever installer the project uses — uv pip install here; fall back to pip install if uv isn't available.)

Run the full assertion suite

# point at the test file you wrote — wherever the project's layout puts it
MLFLOW_TRACKING_URI=<server> pytest <path/to/test_file.py> -p mlflow.pytest.plugin -v

# judge calls are I/O-bound — run in parallel with pytest-xdist once the suite grows
MLFLOW_TRACKING_URI=<server> pytest <path/to/tests/> -p mlflow.pytest.plugin -n auto

For tests that need a judge model, set the relevant env vars before running — the judge model (MLFLOW_GENAI_JUDGE_DEFAULT_MODEL, resolved in PLAN) and whatever credentials it requires (OPENAI_API_KEY, gateway base URL, etc.). See references/regression-test-suite.md for the pytest-xdist worker-consolidation conftest.py snippet.

Read the failure if any test still fails

If the test you just added is still red, go back to EXPLORE for that test. Look at the new trace — did the agent call the new tool? Did the tool return what you expected? Don't immediately patch — diagnose first.

Check for regressions

If a previously-passing test now fails, that's a real signal — your fix changed something else. Investigate before "fixing" the regression. Sometimes the right answer is to back out your change and try a different layer.

Inspect at least one new trace by eye

A green test is not the same as a good answer. Pull the latest trace for the question you fixed and read the agent's response. Does it actually match what the user wanted, or did you just satisfy the rubric? If the latter, the rubric was too loose — go back to PLAN.

Loop until done

If multiple issues were reported, repeat the loop per issue. Don't try to fix three things at once — each iteration should have one diagnosis and one targeted change.

When to stop

All tests green.
The user has eyeballed at least one fresh trace and confirmed the agent now does what they wanted.
No regressions in pre-existing tests.

If you've iterated 3+ times on the same test without converging, that's a signal to escalate: tell the user the diagnosis was probably wrong and propose a different root cause hypothesis.

MLflow APIs referenced

mlflow.log_feedback(...) — persist user verbal feedback as a HUMAN assessment on the trace
mlflow traces get --trace-id <id> — fetch the full trace to inspect spans
@mlflow.test marker — requires the mlflow.pytest.plugin pytest plugin (enable via pytest_plugins = ["mlflow.pytest.plugin"] or -p mlflow.pytest.plugin); sets up the MLflow run + tracing for a regression-test case
mlflow.genai.evaluate(data=..., predict_fn=..., scorers=[...]) → EvaluationResult; assert with result.passed / result.reason — the assertion API
mlflow.genai.scorers.{RegexMatch, Guidelines, Safety, ExpectationsGuidelines, Correctness, Equivalence, RelevanceToQuery, ...} and the @scorer decorator for custom deterministic checks — scorer building blocks (no Contains/Excludes/Matches/Equals)
mlflow.genai.datasets.get_dataset(...).merge_records([...]) — seed per-row expectations.guidelines for ExpectationsGuidelines

Related skills

analyzing-mlflow-trace — use for the EXPLORE phase trace anatomy.
analyzing-mlflow-session — when the issue spans a multi-turn conversation.
instrumenting-with-mlflow-tracing — if the agent isn't traced yet, run this first; you can't EXPLORE without traces.
agent-evaluation — for dataset-scale eval workflows alongside individual assertion tests.