| name | eval-setup |
| description | Optional environment configurator for the agent-eval-harness. Configures MLflow tracking, verifies API keys, and troubleshoots dependency issues. Not required for basic usage — dependencies auto-install via SessionStart hook and agent_eval is available via symlinks. Use when the user wants to configure MLflow tracking, troubleshoot import errors, verify the environment, or set up a remote MLflow server. Also triggers on "configure mlflow", "set up tracking", "ModuleNotFoundError", "mlflow not installed", "missing dependencies", or "check my eval environment". |
| user-invocable | true |
| allowed-tools | Read, Bash, Glob, AskUserQuestion |
You are an environment configurator. You verify the evaluation harness environment and configure optional integrations like MLflow. Non-destructive: skip steps that are already done, report status.
Most users can skip this skill entirely — dependencies auto-install via the plugin's SessionStart hook, and agent_eval is available to scripts via symlinks. This skill is useful for configuring MLflow tracking, troubleshooting dependency issues, or verifying the environment.
The eval pipeline is: /eval-analyze → /eval-dataset → /eval-run → /eval-review or /eval-optimize. /eval-mlflow can be invoked at any point after /eval-run. MLflow tracing is handled by /eval-mlflow after a run completes. No tracing setup is needed here.
Step 0: Parse Arguments
Parse $ARGUMENTS for:
| Argument | Required | Default | Description |
|---|
--tracking-uri <uri> | no | auto-detect | MLflow tracking URI (skips interactive setup) |
--skip-mlflow | no | false | Skip MLflow setup entirely |
--runs-dir <path> | no | eval/runs | Directory where eval runs are stored |
Step 1: Install Dependencies (if needed)
Dependencies are managed in an isolated venv at <plugin_root>/.eval-venv/. The SessionStart hook creates this venv automatically. Scripts auto-activate it via agent_eval._bootstrap on import.
This step is a fallback for mid-session installs or troubleshooting. Re-run the hook's install script:
python3 "${CLAUDE_SKILL_DIR}/../../scripts/ensure_deps.py" "${CLAUDE_PLUGIN_DATA:-${XDG_STATE_HOME:-$HOME/.local/state}/agent-eval-data}"
To check the venv status:
VENV_PYTHON="${CLAUDE_SKILL_DIR}/../../.eval-venv/bin/python3"
test -x "$VENV_PYTHON" && echo "venv: OK" || echo "venv: MISSING"
"$VENV_PYTHON" -c "import yaml; print('pyyaml: OK')" 2>&1 || echo "pyyaml: MISSING"
To install additional packages manually into the venv:
VENV_DIR="${CLAUDE_SKILL_DIR}/../../.eval-venv"
if command -v uv &>/dev/null; then
uv pip install --python "$VENV_DIR/bin/python3" 'mlflow[genai]>=3.5' 'anthropic[vertex]>=0.40'
else
"$VENV_DIR/bin/pip" install 'mlflow[genai]>=3.5' 'anthropic[vertex]>=0.40'
fi
Step 2: Run Preflight Checks
python3 ${CLAUDE_SKILL_DIR}/scripts/check_env.py --fix
Review the output. If all checks pass, report success and skip to Step 6.
If checks fail, work through Steps 3–5 to fix them.
Step 3: Configure MLflow Tracking
If --skip-mlflow was passed, skip this step entirely.
Check if MLflow tracking is configured:
echo "MLFLOW_TRACKING_URI=${MLFLOW_TRACKING_URI:-not set}"
If --tracking-uri was provided: use it directly, skip the interactive choice.
If not set and no flag: Ask the user which MLflow setup they want:
-
Local server (recommended for getting started):
Tell the user to run the server in a separate terminal:
mlflow server --port 5000
Then set the tracking URI in this session:
export MLFLOW_TRACKING_URI=http://127.0.0.1:5000
Note: the user should add this export to their shell profile for persistence.
-
Local file store (no server needed, limited UI):
export MLFLOW_TRACKING_URI=sqlite:///mlflow.db
-
Remote server (Databricks, etc.):
Ask the user for their tracking URI and verify connectivity.
Per-project pinning: To pin a tracking URI to a specific eval suite (overriding the env var), set mlflow.tracking_uri in eval.yaml. Useful when one machine runs evals against multiple servers. The harness resolves URIs in this order: mlflow.tracking_uri in eval.yaml > MLFLOW_TRACKING_URI env var > http://127.0.0.1:5000.
Step 4: Configure API Keys
Check authentication:
echo "ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:+set}"
echo "ANTHROPIC_VERTEX_PROJECT_ID=${ANTHROPIC_VERTEX_PROJECT_ID:-not set}"
If neither is set, tell the user:
- For direct Anthropic API:
export ANTHROPIC_API_KEY=<key>
- For Vertex AI:
export ANTHROPIC_VERTEX_PROJECT_ID=<project-id>
The API key is needed for skill execution (via Claude Code) and pairwise comparison judges.
Step 5: Configure Runs Directory
Check if the runs directory is configured:
echo "AGENT_EVAL_RUNS_DIR=${AGENT_EVAL_RUNS_DIR:-eval/runs}"
If --runs-dir was provided, use it. Otherwise, the default eval/runs is fine for most projects.
If the user wants a non-default location (e.g., larger disk, shared storage), tell them to add to their shell profile:
export AGENT_EVAL_RUNS_DIR=<path>
All harness scripts read this env var. The directory is created automatically by check_env.py --fix.
Step 5b: Check Skill-Specific Environment Variables
If eval.yaml exists and has execution.env entries with $VAR references, those variables must be set in the caller's environment at eval-run time. Check whether they're available:
test -f eval.yaml && PYTHONPATH=${CLAUDE_SKILL_DIR}/scripts python3 -c "
from agent_eval.config import EvalConfig
config = EvalConfig.from_yaml('eval.yaml')
import os
for key, value in config.execution.env.items():
if isinstance(value, str) and value.startswith('\$'):
var_name = value[1:]
status = 'set' if os.environ.get(var_name) else 'NOT SET'
print(f' {key}: \${var_name} → {status}')
else:
# Mask literal values to avoid leaking credentials in logs
print(f' {key}: (literal, {len(str(value))} chars)')
" 2>&1 || echo " (could not parse eval.yaml)"
If any $VAR references are unset, warn the user — they'll need to export them before running /eval-run. Common examples: JIRA_SERVER for jira-emulator, JIRA_TOKEN for Jira API access.
Step 6: Create MLflow Experiment
If --skip-mlflow was passed, skip this step.
Check if eval.yaml exists and has mlflow.experiment configured:
test -f eval.yaml && echo "CONFIG_EXISTS" || echo "NO_CONFIG"
If eval.yaml exists:
PYTHONPATH=${CLAUDE_SKILL_DIR}/scripts python3 -c "
from agent_eval.config import EvalConfig
from agent_eval.mlflow.experiment import setup_experiment, resolve_tracking_uri
config = EvalConfig.from_yaml('eval.yaml')
if config.mlflow.experiment:
setup_experiment(config.mlflow.experiment, tracking_uri=resolve_tracking_uri(config))
print(f'Experiment created: {config.mlflow.experiment} on {resolve_tracking_uri(config)}')
else:
print('No mlflow.experiment in eval.yaml, skipping')
"
If eval.yaml doesn't exist, skip this step — it will be created by /eval-analyze.
Step 7: Final Verification
Run the preflight checks again to confirm everything is set up:
python3 ${CLAUDE_SKILL_DIR}/scripts/check_env.py
If eval.yaml exists, also validate it:
test -f eval.yaml && python3 ${CLAUDE_SKILL_DIR}/scripts/check_env.py --config eval.yaml
Report the final status to the user and suggest next steps:
- If eval.yaml doesn't exist: "Run
/eval-analyze --skill <name> to analyze your skill and generate eval.yaml"
- If eval.yaml exists but no dataset: "Run
/eval-dataset to generate test cases"
- If everything is ready: "Run
/eval-run --model <model> to execute the evaluation"
Rules
- Non-destructive — skip steps that are already done, don't overwrite existing config
- Report clearly — show what passed, what failed, and how to fix each failure
- MLflow is optional — the harness works without it. Don't fail setup if MLflow can't be configured.
- Suggest the full pipeline — after setup, the user should know the path: analyze → dataset → run → review
$ARGUMENTS