| name | querying-mlflow-metrics |
| description | Fetches aggregated trace metrics (token usage, latency, trace counts, quality evaluations) from MLflow tracking servers. Triggers on requests to show metrics, analyze token usage, view LLM costs, check usage trends, or query trace statistics. |
MLflow Metrics
Run scripts/fetch_metrics.py to query metrics from an MLflow tracking server.
Examples
Token usage summary:
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM,AVG
Output: AVG: 223.91 SUM: 7613
Hourly token trend (last 24h):
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM \
-t 3600 --start-time="-24h" --end-time=now
Output: Time-bucketed token sums per hour
Latency percentiles by trace:
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m latency -a AVG,P95 -d trace_name
Error rate by status:
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m trace_count -a COUNT -d trace_status
Quality scores by evaluator (assessments):
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
-m assessment_value -a AVG,P50 -d assessment_name
Output: Average and median scores for each evaluator (e.g., correctness, relevance)
Assessment count by name:
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
-m assessment_count -a COUNT -d assessment_name
JSON output: Add -o json to any command.
Arguments
| Arg | Required | Description |
|---|
-s, --server | Yes | MLflow server URL |
-x, --experiment-ids | Yes | Experiment IDs (comma-separated) |
-m, --metric | Yes | trace_count, latency, input_tokens, output_tokens, total_tokens |
-a, --aggregations | Yes | COUNT, SUM, AVG, MIN, MAX, P50, P95, P99 |
-d, --dimensions | No | Group by: trace_name, trace_status |
-t, --time-interval | No | Bucket size in seconds (3600=hourly, 86400=daily) |
--start-time | No | -24h, -7d, now, ISO 8601, or epoch ms |
--end-time | No | Same formats as start-time |
-o, --output | No | table (default) or json |
For SPANS metrics (span_count, latency), add -v SPANS.
For ASSESSMENTS metrics, add -v ASSESSMENTS.
See references/api_reference.md for filter syntax and full API details.