Run any Skill in Manus with one click

$pwd:

eval-mlflow

Name: Eval Mlflow
Author: opendatahub-io

// MLflow integration for evaluation — sync datasets, log run results, push/pull feedback between the harness and MLflow traces. Use when the user wants to log eval results to MLflow, sync test cases to MLflow datasets, connect judge scores to traces, pull MLflow annotations for eval-optimize, or view results in the MLflow UI. Triggers on "log to mlflow", "sync dataset", "push results", "mlflow integration", "view in mlflow".

Run Skill in Manus

$ git log --oneline --stat

stars:6

forks:8

updated:May 6, 2026 at 12:29

File Explorer

7 files

SKILL.md

readonly

package.json

"author": "opendatahub-io"

"repository": "opendatahub-io/agent-eval-harness"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	eval-mlflow
description	MLflow integration for evaluation — sync datasets, log run results, push/pull feedback between the harness and MLflow traces. Use when the user wants to log eval results to MLflow, sync test cases to MLflow datasets, connect judge scores to traces, pull MLflow annotations for eval-optimize, or view results in the MLflow UI. Triggers on "log to mlflow", "sync dataset", "push results", "mlflow integration", "view in mlflow".
user-invocable	true
allowed-tools	Read, Write, Edit, Bash, Glob, Grep, AskUserQuestion

You are an MLflow integration agent. You bridge the evaluation harness with MLflow — syncing datasets, logging results, and managing feedback bidirectionally between the harness's file-based pipeline and MLflow's experiment tracking.

Step 0: Parse Arguments

Parse $ARGUMENTS for:

Argument	Required	Default	Description
`--action <action>`	no	`all`	One of: `sync-dataset`, `log-results`, `push-feedback`, `pull-feedback`, `all`
`--config <path>`	no	`eval.yaml`	Path to eval config
`--run-id <id>`	for log/push/pull	—	Which eval run to log or attach feedback to

Step 1: Verify MLflow

Check MLflow is configured:

PYTHONPATH=${CLAUDE_SKILL_DIR}/scripts python3 -c "
from agent_eval.mlflow.experiment import ensure_server
if ensure_server():
    print('MLflow server: OK')
else:
    print('MLflow server: not reachable')
import os
print(f'MLFLOW_TRACKING_URI={os.environ.get(\"MLFLOW_TRACKING_URI\", \"not set\")}')
"

If not configured, suggest running /eval-setup first. The scripts resolve the tracking URI from mlflow.tracking_uri in eval.yaml first, then MLFLOW_TRACKING_URI env var, then default to http://127.0.0.1:5000. If the server is unreachable but a remote URI is set, proceed — the scripts handle connectivity errors gracefully.

Step 2: Read Configuration

Read eval.yaml to understand:

mlflow.experiment — the experiment name
dataset.path and dataset.schema — where cases are and what they look like
judges — what was scored (for feedback context)

Step 3: Sync Dataset (if `--action sync-dataset` or `all`)

This is a two-phase process: you interpret the schema, then a script syncs deterministically.

Step 3a: Read schema and sample case

Read dataset.schema from eval.yaml. Then browse one case directory at dataset.path:

ls <dataset_path>/ | head -5

Read the first case directory to see what files exist and their structure.

Step 3b: Produce schema mapping

Based on your understanding of dataset.schema and the sample case, create tmp/schema_mapping.json. This maps MLflow record fields to source files and field paths:

{
  "inputs": {
    "<field_name>": "<filename>:<field_path_or___file__>"
  },
  "expectations": {
    "<field_name>": "<filename>:<field_path_or___file__>"
  }
}

Rules for the mapping:

"input.yaml:prompt" → extract the prompt field from input.yaml
"input.yaml:context.details" → extract nested field context.details
"reference.md:__file__" → use the entire file content as the value
inputs: fields the skill receives as input (prompts, context, parameters)
expectations: reference/gold outputs the skill should produce (reference docs, expected scores)

Write the mapping:

mkdir -p tmp
cat > tmp/schema_mapping.json << 'EOF'
<your mapping here>
EOF

Step 3c: Run sync

python3 ${CLAUDE_SKILL_DIR}/scripts/sync_dataset.py \
  --config <config> \
  --mapping tmp/schema_mapping.json

The script validates the mapping against the first case and prints a preview before syncing. If the preview looks wrong, adjust the mapping and re-run.

Step 4: Log Run Results (if `--action log-results` or `all`)

Requires --run-id. Logs params, metrics, artifacts, and per-case results table to an MLflow run.

python3 ${CLAUDE_SKILL_DIR}/scripts/log_results.py \
  --run-id <id> \
  --config <config>

This logs:

Params: skill, runner.type, model, run_id
Metrics: per-judge mean and pass_rate, execution metrics (duration, cost, turns), per-model cost/token breakdown
Artifacts: summary.yaml
Table: per-case results with case_id, judge, value, rationale
Traces: one per case (case mode) or one for the run (batch mode), built from stdout.log
Tags: regressions_detected (yes/no), num_judges, plus any mlflow.tags from eval.yaml

Step 5: Push Feedback (if `--action push-feedback` or `all`)

Requires --run-id. Finds execution traces and attaches judge + human feedback.

python3 ${CLAUDE_SKILL_DIR}/scripts/attach_feedback.py \
  --run-id <id> \
  --config <config> \
  --source all

This pushes:

Judge feedback (from summary.yaml): source_type=CODE, named {case_id}/{judge_name}
Human feedback (from review.yaml, if it exists): source_type=HUMAN, named {case_id}/human_review

If no traces are found (tracing not enabled), the script reports 0 and succeeds — tracing is optional.

Step 6: Pull Feedback (if `--action pull-feedback`)

Requires --run-id. Pulls annotations added via the MLflow UI back into review.yaml for /eval-optimize to consume.

python3 ${CLAUDE_SKILL_DIR}/scripts/attach_feedback.py \
  --run-id <id> \
  --config <config> \
  --action pull

Pulled annotations are saved to review.yaml under the mlflow_feedback section, separate from local human feedback. /eval-optimize reads both.

Step 7: Report

Print summary:

Dataset: synced N cases to MLflow dataset <name> (if sync ran)
Results: logged to experiment <name>, run <run_id> (if log ran)
Feedback: pushed N entries to M traces (if push ran)
Pulled: N annotations from MLflow UI (if pull ran)
MLflow UI: $MLFLOW_TRACKING_URI

Suggest next steps (include --config <config> if a non-default config was used):

/eval-review --run-id <id> for human review
/eval-optimize --model <model> for automated improvement
View results in MLflow UI at the tracking URI

Rules

Read the schema — understand dataset.schema to build the mapping correctly. The mapping is the critical step — everything downstream depends on it.
No hardcoded fields — determine inputs vs expectations by reading the schema descriptions, not by assuming field names.
Graceful degradation — if MLflow is not available, scripts exit 0 and the skill reports "MLflow not available, skipping."
Idempotent — safe to run multiple times. merge_records deduplicates, log_feedback overwrites.
Don't block on traces — trace feedback is optional. If no traces exist, skip and note that tracing is configured automatically by /eval-run.

$ARGUMENTS

name	eval-mlflow
description	MLflow integration for evaluation — sync datasets, log run results, push/pull feedback between the harness and MLflow traces. Use when the user wants to log eval results to MLflow, sync test cases to MLflow datasets, connect judge scores to traces, pull MLflow annotations for eval-optimize, or view results in the MLflow UI. Triggers on "log to mlflow", "sync dataset", "push results", "mlflow integration", "view in mlflow".
user-invocable	true
allowed-tools	Read, Write, Edit, Bash, Glob, Grep, AskUserQuestion

Step 0: Parse Arguments

Parse $ARGUMENTS for:

Argument	Required	Default	Description
`--action <action>`	no	`all`	One of: `sync-dataset`, `log-results`, `push-feedback`, `pull-feedback`, `all`
`--config <path>`	no	`eval.yaml`	Path to eval config
`--run-id <id>`	for log/push/pull	—	Which eval run to log or attach feedback to

Step 1: Verify MLflow

Check MLflow is configured:

PYTHONPATH=${CLAUDE_SKILL_DIR}/scripts python3 -c "
from agent_eval.mlflow.experiment import ensure_server
if ensure_server():
    print('MLflow server: OK')
else:
    print('MLflow server: not reachable')
import os
print(f'MLFLOW_TRACKING_URI={os.environ.get(\"MLFLOW_TRACKING_URI\", \"not set\")}')
"

Step 2: Read Configuration

Read eval.yaml to understand:

mlflow.experiment — the experiment name
dataset.path and dataset.schema — where cases are and what they look like
judges — what was scored (for feedback context)

Step 3: Sync Dataset (if `--action sync-dataset` or `all`)

This is a two-phase process: you interpret the schema, then a script syncs deterministically.

Step 3a: Read schema and sample case

Read dataset.schema from eval.yaml. Then browse one case directory at dataset.path:

ls <dataset_path>/ | head -5

Read the first case directory to see what files exist and their structure.

Step 3b: Produce schema mapping

Based on your understanding of dataset.schema and the sample case, create tmp/schema_mapping.json. This maps MLflow record fields to source files and field paths:

{
  "inputs": {
    "<field_name>": "<filename>:<field_path_or___file__>"
  },
  "expectations": {
    "<field_name>": "<filename>:<field_path_or___file__>"
  }
}

Rules for the mapping:

"input.yaml:prompt" → extract the prompt field from input.yaml
"input.yaml:context.details" → extract nested field context.details
"reference.md:__file__" → use the entire file content as the value
inputs: fields the skill receives as input (prompts, context, parameters)
expectations: reference/gold outputs the skill should produce (reference docs, expected scores)

Write the mapping:

mkdir -p tmp
cat > tmp/schema_mapping.json << 'EOF'
<your mapping here>
EOF

Step 3c: Run sync

python3 ${CLAUDE_SKILL_DIR}/scripts/sync_dataset.py \
  --config <config> \
  --mapping tmp/schema_mapping.json

The script validates the mapping against the first case and prints a preview before syncing. If the preview looks wrong, adjust the mapping and re-run.

Step 4: Log Run Results (if `--action log-results` or `all`)

Requires --run-id. Logs params, metrics, artifacts, and per-case results table to an MLflow run.

python3 ${CLAUDE_SKILL_DIR}/scripts/log_results.py \
  --run-id <id> \
  --config <config>

This logs:

Params: skill, runner.type, model, run_id
Metrics: per-judge mean and pass_rate, execution metrics (duration, cost, turns), per-model cost/token breakdown
Artifacts: summary.yaml
Table: per-case results with case_id, judge, value, rationale
Traces: one per case (case mode) or one for the run (batch mode), built from stdout.log
Tags: regressions_detected (yes/no), num_judges, plus any mlflow.tags from eval.yaml

Step 5: Push Feedback (if `--action push-feedback` or `all`)

Requires --run-id. Finds execution traces and attaches judge + human feedback.

python3 ${CLAUDE_SKILL_DIR}/scripts/attach_feedback.py \
  --run-id <id> \
  --config <config> \
  --source all

This pushes:

Judge feedback (from summary.yaml): source_type=CODE, named {case_id}/{judge_name}
Human feedback (from review.yaml, if it exists): source_type=HUMAN, named {case_id}/human_review

If no traces are found (tracing not enabled), the script reports 0 and succeeds — tracing is optional.

Step 6: Pull Feedback (if `--action pull-feedback`)

Requires --run-id. Pulls annotations added via the MLflow UI back into review.yaml for /eval-optimize to consume.

python3 ${CLAUDE_SKILL_DIR}/scripts/attach_feedback.py \
  --run-id <id> \
  --config <config> \
  --action pull

Pulled annotations are saved to review.yaml under the mlflow_feedback section, separate from local human feedback. /eval-optimize reads both.

Step 7: Report

Print summary:

Dataset: synced N cases to MLflow dataset <name> (if sync ran)
Results: logged to experiment <name>, run <run_id> (if log ran)
Feedback: pushed N entries to M traces (if push ran)
Pulled: N annotations from MLflow UI (if pull ran)
MLflow UI: $MLFLOW_TRACKING_URI

Suggest next steps (include --config <config> if a non-default config was used):

/eval-review --run-id <id> for human review
/eval-optimize --model <model> for automated improvement
View results in MLflow UI at the tracking URI

Rules

Read the schema — understand dataset.schema to build the mapping correctly. The mapping is the critical step — everything downstream depends on it.
No hardcoded fields — determine inputs vs expectations by reading the schema descriptions, not by assuming field names.
Graceful degradation — if MLflow is not available, scripts exit 0 and the skill reports "MLflow not available, skipping."
Idempotent — safe to run multiple times. merge_records deduplicates, log_feedback overwrites.
Don't block on traces — trace feedback is optional. If no traces exist, skip and note that tracing is configured automatically by /eval-run.

$ARGUMENTS

eval-mlflow

Step 0: Parse Arguments

Step 1: Verify MLflow

Step 2: Read Configuration

Step 3: Sync Dataset (if --action sync-dataset or all)

Step 3a: Read schema and sample case

Step 3b: Produce schema mapping

Step 3c: Run sync

Step 4: Log Run Results (if --action log-results or all)

Step 5: Push Feedback (if --action push-feedback or all)

Step 6: Pull Feedback (if --action pull-feedback)

Step 7: Report

Rules

Step 0: Parse Arguments

Step 1: Verify MLflow

Step 2: Read Configuration

Step 3: Sync Dataset (if --action sync-dataset or all)

Step 3a: Read schema and sample case

Step 3b: Produce schema mapping

Step 3c: Run sync

Step 4: Log Run Results (if --action log-results or all)

Step 5: Push Feedback (if --action push-feedback or all)

Step 6: Pull Feedback (if --action pull-feedback)

Step 7: Report

Rules

Step 3: Sync Dataset (if `--action sync-dataset` or `all`)

Step 4: Log Run Results (if `--action log-results` or `all`)

Step 5: Push Feedback (if `--action push-feedback` or `all`)

Step 6: Pull Feedback (if `--action pull-feedback`)

Step 3: Sync Dataset (if `--action sync-dataset` or `all`)

Step 4: Log Run Results (if `--action log-results` or `all`)

Step 5: Push Feedback (if `--action push-feedback` or `all`)

Step 6: Pull Feedback (if `--action pull-feedback`)