Run any Skill in Manus with one click

evaluation

Stars0

Forks1

UpdatedMarch 31, 2026 at 18:30

Model evaluation framework for comparing LLM outputs (Haiku vs Sonnet vs fine-tuned). Use when building eval infrastructure, running model comparisons, or setting up the RLHF training pipeline. Status: PLANNED — build in Phase 2.2.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

fedickinson

fedickinson/budget-buddy-2

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Data ScientistsComputer and Mathematical Occupations·SOC 15-2051

SKILL.md

readonly

name	evaluation
description	Model evaluation framework for comparing LLM outputs (Haiku vs Sonnet vs fine-tuned). Use when building eval infrastructure, running model comparisons, or setting up the RLHF training pipeline. Status: PLANNED — build in Phase 2.2.

Evaluation Framework

STATUS: PLANNED — Build in Phase 2.2, after the steel thread (Phase 1) is working. The gateway (Phase 1.2) must exist first — evaluation is built on top of it.

Purpose

Two connected goals:

Production model selection — A/B test Haiku vs Sonnet on cost/quality tradeoffs to optimize routing
CS 5788 / RLHF research — Measure the quality gap between general-purpose Claude models and domain fine-tuned models on financial insight generation

EvalComparison Table

CREATE TABLE eval_comparisons (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    experiment_id       TEXT NOT NULL,          -- groups related comparisons
    surface             TEXT NOT NULL,          -- "classification" | "ambient" | "chat" | "review"
    model_a             TEXT NOT NULL,          -- model key from MODEL_REGISTRY
    model_b             TEXT NOT NULL,
    input_prompt        TEXT NOT NULL,          -- the exact prompt sent to both models
    output_a            TEXT NOT NULL,          -- model_a response
    output_b            TEXT NOT NULL,          -- model_b response
    trace_id_a          TEXT,                   -- Langfuse trace for model_a call
    trace_id_b          TEXT,                   -- Langfuse trace for model_b call
    cost_usd_a          FLOAT,
    cost_usd_b          FLOAT,
    human_preferred     TEXT,                   -- "a" | "b" | "tie" | null (if not rated)
    judge_preferred     TEXT,                   -- "a" | "b" | "tie" (from LLM-as-judge)
    judge_rationale     TEXT,
    judge_score_a       FLOAT,                  -- 1-5 scale
    judge_score_b       FLOAT,
    created_at          TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON eval_comparisons (experiment_id, surface, created_at DESC);

Model Router (A/B Experiments)

The model router integrates with the gateway. Callers use call_llm() normally — routing is transparent.

# backend/services/model_router.py

async def route_model(
    surface: str,
    user_id: str,
    experiment_id: str = None,
) -> str:  # returns model key ("haiku" | "sonnet" | ...)
    """
    Default routing: surface → model per the routing table in llm-gateway skill.
    Experiment routing: assigns user to experiment arm (deterministic by user_id hash).
    """

LLM-as-Judge Pattern

# backend/services/buddy_prompts/eval_judge.py

JUDGE_PROMPT = """
You are evaluating two AI responses to the same financial assistant prompt.

USER CONTEXT: {context_summary}

PROMPT SENT TO BOTH MODELS:
{input_prompt}

RESPONSE A:
{output_a}

RESPONSE B:
{output_b}

Score each response on:
1. Accuracy (does it reflect the actual financial data?) — 1-5
2. Tone (warm, direct, non-judgmental — like a smart friend) — 1-5
3. Actionability (does it help the user know what to do?) — 1-5
4. Specificity (does it reference actual amounts/merchants?) — 1-5

Which response is better overall? ("A" | "B" | "tie")
Explain in 1-2 sentences why.

Respond as JSON: {"score_a": X, "score_b": X, "preferred": "A|B|tie", "rationale": "..."}
"""

Four-Way Comparison (CS 5788 Design)

The academic contribution requires comparing:

base_mistral      — Mistral-7B with no fine-tuning
sft_mistral       — Mistral-7B after supervised fine-tuning on Budget Buddy data
sft_dpo_mistral   — After SFT + DPO (preference alignment)
claude_ceiling    — Claude Sonnet as the quality ceiling

Training data sources (collected automatically from Phase 1 onwards):

Ambient insight engagement (acknowledged/suppressed) → preference pairs
Chat thumbs up/down → direct quality signal
Action confirm/undo → implicit quality signal
Review Act 2 user responses → rich labeled data
User model communication preferences → personalization signal

The hypothesis: Real user interaction data (from the context system) produces better fine-tuning signal than synthetic data. The suppression system provides implicit preference learning without requiring explicit ratings.

Eval Runner CLI

# backend/scripts/run_eval_comparison.py

# Compare Haiku vs Sonnet on last 50 ambient insights:
python -m backend.scripts.run_eval_comparison \
  --experiment "haiku_vs_sonnet_ambient" \
  --surface ambient \
  --model_a haiku \
  --model_b sonnet \
  --sample_size 50

# Output: CSV + summary table in docs/v2/eval-results/

Files to Build (Phase 2.2)

backend/services/model_router.py              — A/B experiment routing
backend/database/models/eval_comparison.py    — EvalComparison model + migration
backend/services/eval_runner.py               — Run comparisons across models
backend/services/buddy_prompts/eval_judge.py  — LLM-as-judge prompt
backend/scripts/run_eval_comparison.py        — CLI for eval sweeps
tests/test_eval_runner.py                     — Unit tests (write FIRST)

Acceptance Criteria (Phase 2.2)

Model router integrates transparently with gateway (callers don't change)
Eval runner can compare Haiku vs Sonnet on same inputs
Results stored with quality scores + cost data
LLM-as-judge produces structured scores in EvalComparison table
Provider abstraction supports adding Modal without changing call sites
Langfuse traces for both model arms linked to same experiment_id
All tests pass

Metrics Tracked in Langfuse

Metric	Source	What it tells you
Engagement rate	Ambient insight acknowledged	Are insights relevant?
Suppression rate	Topics suppressed / shown	Is Buddy annoying?
Action success rate	Actions confirmed / proposed	Is Buddy accurate?
Undo rate	Actions undone / executed	Is Buddy trustworthy?
Thumbs up ratio	Positive / total feedback	Is quality acceptable?
Cost per interaction	Token usage × model price	Is routing optimal?
Latency p50/p95	Trace duration	Is it fast enough?
Repetition rate	Same topic within 24hr	Is memory working?

Last Updated

2026-03-31 (Phase 0 scaffold)

Evaluation Framework

STATUS: PLANNED — Build in Phase 2.2, after the steel thread (Phase 1) is working. The gateway (Phase 1.2) must exist first — evaluation is built on top of it.

Purpose

Two connected goals:

Production model selection — A/B test Haiku vs Sonnet on cost/quality tradeoffs to optimize routing
CS 5788 / RLHF research — Measure the quality gap between general-purpose Claude models and domain fine-tuned models on financial insight generation

EvalComparison Table

CREATE TABLE eval_comparisons (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    experiment_id       TEXT NOT NULL,          -- groups related comparisons
    surface             TEXT NOT NULL,          -- "classification" | "ambient" | "chat" | "review"
    model_a             TEXT NOT NULL,          -- model key from MODEL_REGISTRY
    model_b             TEXT NOT NULL,
    input_prompt        TEXT NOT NULL,          -- the exact prompt sent to both models
    output_a            TEXT NOT NULL,          -- model_a response
    output_b            TEXT NOT NULL,          -- model_b response
    trace_id_a          TEXT,                   -- Langfuse trace for model_a call
    trace_id_b          TEXT,                   -- Langfuse trace for model_b call
    cost_usd_a          FLOAT,
    cost_usd_b          FLOAT,
    human_preferred     TEXT,                   -- "a" | "b" | "tie" | null (if not rated)
    judge_preferred     TEXT,                   -- "a" | "b" | "tie" (from LLM-as-judge)
    judge_rationale     TEXT,
    judge_score_a       FLOAT,                  -- 1-5 scale
    judge_score_b       FLOAT,
    created_at          TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON eval_comparisons (experiment_id, surface, created_at DESC);

Model Router (A/B Experiments)

The model router integrates with the gateway. Callers use call_llm() normally — routing is transparent.

# backend/services/model_router.py

async def route_model(
    surface: str,
    user_id: str,
    experiment_id: str = None,
) -> str:  # returns model key ("haiku" | "sonnet" | ...)
    """
    Default routing: surface → model per the routing table in llm-gateway skill.
    Experiment routing: assigns user to experiment arm (deterministic by user_id hash).
    """

LLM-as-Judge Pattern

# backend/services/buddy_prompts/eval_judge.py

JUDGE_PROMPT = """
You are evaluating two AI responses to the same financial assistant prompt.

USER CONTEXT: {context_summary}

PROMPT SENT TO BOTH MODELS:
{input_prompt}

RESPONSE A:
{output_a}

RESPONSE B:
{output_b}

Score each response on:
1. Accuracy (does it reflect the actual financial data?) — 1-5
2. Tone (warm, direct, non-judgmental — like a smart friend) — 1-5
3. Actionability (does it help the user know what to do?) — 1-5
4. Specificity (does it reference actual amounts/merchants?) — 1-5

Which response is better overall? ("A" | "B" | "tie")
Explain in 1-2 sentences why.

Respond as JSON: {"score_a": X, "score_b": X, "preferred": "A|B|tie", "rationale": "..."}
"""

Four-Way Comparison (CS 5788 Design)

The academic contribution requires comparing:

base_mistral      — Mistral-7B with no fine-tuning
sft_mistral       — Mistral-7B after supervised fine-tuning on Budget Buddy data
sft_dpo_mistral   — After SFT + DPO (preference alignment)
claude_ceiling    — Claude Sonnet as the quality ceiling

Training data sources (collected automatically from Phase 1 onwards):

Ambient insight engagement (acknowledged/suppressed) → preference pairs
Chat thumbs up/down → direct quality signal
Action confirm/undo → implicit quality signal
Review Act 2 user responses → rich labeled data
User model communication preferences → personalization signal

Eval Runner CLI

# backend/scripts/run_eval_comparison.py

# Compare Haiku vs Sonnet on last 50 ambient insights:
python -m backend.scripts.run_eval_comparison \
  --experiment "haiku_vs_sonnet_ambient" \
  --surface ambient \
  --model_a haiku \
  --model_b sonnet \
  --sample_size 50

# Output: CSV + summary table in docs/v2/eval-results/

Files to Build (Phase 2.2)

backend/services/model_router.py              — A/B experiment routing
backend/database/models/eval_comparison.py    — EvalComparison model + migration
backend/services/eval_runner.py               — Run comparisons across models
backend/services/buddy_prompts/eval_judge.py  — LLM-as-judge prompt
backend/scripts/run_eval_comparison.py        — CLI for eval sweeps
tests/test_eval_runner.py                     — Unit tests (write FIRST)

Acceptance Criteria (Phase 2.2)

Model router integrates transparently with gateway (callers don't change)
Eval runner can compare Haiku vs Sonnet on same inputs
Results stored with quality scores + cost data
LLM-as-judge produces structured scores in EvalComparison table
Provider abstraction supports adding Modal without changing call sites
Langfuse traces for both model arms linked to same experiment_id
All tests pass

Metrics Tracked in Langfuse

Metric	Source	What it tells you
Engagement rate	Ambient insight acknowledged	Are insights relevant?
Suppression rate	Topics suppressed / shown	Is Buddy annoying?
Action success rate	Actions confirmed / proposed	Is Buddy accurate?
Undo rate	Actions undone / executed	Is Buddy trustworthy?
Thumbs up ratio	Positive / total feedback	Is quality acceptable?
Cost per interaction	Token usage × model price	Is routing optimal?
Latency p50/p95	Trace duration	Is it fast enough?
Repetition rate	Same topic within 24hr	Is memory working?

Last Updated

2026-03-31 (Phase 0 scaffold)

evaluation

Evaluation Framework

Purpose

EvalComparison Table

Model Router (A/B Experiments)

LLM-as-Judge Pattern

Four-Way Comparison (CS 5788 Design)

Eval Runner CLI

Files to Build (Phase 2.2)

Acceptance Criteria (Phase 2.2)

Metrics Tracked in Langfuse

Last Updated

More from this repository

Evaluation Framework

Purpose

EvalComparison Table

Model Router (A/B Experiments)

LLM-as-Judge Pattern

Four-Way Comparison (CS 5788 Design)

Eval Runner CLI

Files to Build (Phase 2.2)

Acceptance Criteria (Phase 2.2)

Metrics Tracked in Langfuse

Last Updated

More from this repository