| name | evaluation |
| description | Model evaluation framework for comparing LLM outputs (Haiku vs Sonnet vs fine-tuned). Use when building eval infrastructure, running model comparisons, or setting up the RLHF training pipeline. Status: PLANNED — build in Phase 2.2. |
Evaluation Framework
STATUS: PLANNED — Build in Phase 2.2, after the steel thread (Phase 1) is working. The gateway (Phase 1.2) must exist first — evaluation is built on top of it.
Purpose
Two connected goals:
- Production model selection — A/B test Haiku vs Sonnet on cost/quality tradeoffs to optimize routing
- CS 5788 / RLHF research — Measure the quality gap between general-purpose Claude models and domain fine-tuned models on financial insight generation
EvalComparison Table
CREATE TABLE eval_comparisons (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
experiment_id TEXT NOT NULL,
surface TEXT NOT NULL,
model_a TEXT NOT NULL,
model_b TEXT NOT NULL,
input_prompt TEXT NOT NULL,
output_a TEXT NOT NULL,
output_b TEXT NOT NULL,
trace_id_a TEXT,
trace_id_b TEXT,
cost_usd_a FLOAT,
cost_usd_b FLOAT,
human_preferred TEXT,
judge_preferred TEXT,
judge_rationale TEXT,
judge_score_a FLOAT,
judge_score_b FLOAT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON eval_comparisons (experiment_id, surface, created_at DESC);
Model Router (A/B Experiments)
The model router integrates with the gateway. Callers use call_llm() normally — routing is transparent.
async def route_model(
surface: str,
user_id: str,
experiment_id: str = None,
) -> str:
"""
Default routing: surface → model per the routing table in llm-gateway skill.
Experiment routing: assigns user to experiment arm (deterministic by user_id hash).
"""
LLM-as-Judge Pattern
JUDGE_PROMPT = """
You are evaluating two AI responses to the same financial assistant prompt.
USER CONTEXT: {context_summary}
PROMPT SENT TO BOTH MODELS:
{input_prompt}
RESPONSE A:
{output_a}
RESPONSE B:
{output_b}
Score each response on:
1. Accuracy (does it reflect the actual financial data?) — 1-5
2. Tone (warm, direct, non-judgmental — like a smart friend) — 1-5
3. Actionability (does it help the user know what to do?) — 1-5
4. Specificity (does it reference actual amounts/merchants?) — 1-5
Which response is better overall? ("A" | "B" | "tie")
Explain in 1-2 sentences why.
Respond as JSON: {"score_a": X, "score_b": X, "preferred": "A|B|tie", "rationale": "..."}
"""
Four-Way Comparison (CS 5788 Design)
The academic contribution requires comparing:
base_mistral — Mistral-7B with no fine-tuning
sft_mistral — Mistral-7B after supervised fine-tuning on Budget Buddy data
sft_dpo_mistral — After SFT + DPO (preference alignment)
claude_ceiling — Claude Sonnet as the quality ceiling
Training data sources (collected automatically from Phase 1 onwards):
- Ambient insight engagement (acknowledged/suppressed) → preference pairs
- Chat thumbs up/down → direct quality signal
- Action confirm/undo → implicit quality signal
- Review Act 2 user responses → rich labeled data
- User model communication preferences → personalization signal
The hypothesis: Real user interaction data (from the context system) produces better fine-tuning signal than synthetic data. The suppression system provides implicit preference learning without requiring explicit ratings.
Eval Runner CLI
python -m backend.scripts.run_eval_comparison \
--experiment "haiku_vs_sonnet_ambient" \
--surface ambient \
--model_a haiku \
--model_b sonnet \
--sample_size 50
Files to Build (Phase 2.2)
backend/services/model_router.py — A/B experiment routing
backend/database/models/eval_comparison.py — EvalComparison model + migration
backend/services/eval_runner.py — Run comparisons across models
backend/services/buddy_prompts/eval_judge.py — LLM-as-judge prompt
backend/scripts/run_eval_comparison.py — CLI for eval sweeps
tests/test_eval_runner.py — Unit tests (write FIRST)
Acceptance Criteria (Phase 2.2)
Metrics Tracked in Langfuse
| Metric | Source | What it tells you |
|---|
| Engagement rate | Ambient insight acknowledged | Are insights relevant? |
| Suppression rate | Topics suppressed / shown | Is Buddy annoying? |
| Action success rate | Actions confirmed / proposed | Is Buddy accurate? |
| Undo rate | Actions undone / executed | Is Buddy trustworthy? |
| Thumbs up ratio | Positive / total feedback | Is quality acceptable? |
| Cost per interaction | Token usage × model price | Is routing optimal? |
| Latency p50/p95 | Trace duration | Is it fast enough? |
| Repetition rate | Same topic within 24hr | Is memory working? |
Last Updated
2026-03-31 (Phase 0 scaffold)