Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

evaluation

Estrellas0

Forks1

Actualizado31 de marzo de 2026, 18:30

Model evaluation framework for comparing LLM outputs (Haiku vs Sonnet vs fine-tuned). Use when building eval infrastructure, running model comparisons, or setting up the RLHF training pipeline. Status: PLANNED — build in Phase 2.2.

Instalación

Instalar con Codex o Claude Copia este prompt, pégalo en Codex, Claude u otro asistente, y deja que revise la página de la skill y la instale por ti.

Ejecutar en Manus

Fuente

fedickinson

fedickinson/budget-buddy-2

Abrir repositorio de GitHub Ver repositorios del creador

Descarga

Ejecutar en Manus

Ocupaciones relacionadasSOC

Basado en la clasificación ocupacional SOC

Científicos de datosOcupaciones informáticas y matemáticas·SOC 15-2051

SKILL.md

readonly

name: evaluation description: Model evaluation framework for comparing LLM outputs (Haiku vs Sonnet vs fine-tuned). Use when building eval infrastructure, running model comparisons, or setting up the RLHF training pipeline. Status: PLANNED — build in Phase 2.2. allowed-tools: [Read, Grep, Bash(python*)]

Evaluation Framework

STATUS: PLANNED — Build in Phase 2.2, after the steel thread (Phase 1) is working. The gateway (Phase 1.2) must exist first — evaluation is built on top of it.

Purpose

Two connected goals:

Production model selection — A/B test Haiku vs Sonnet on cost/quality tradeoffs to optimize routing
CS 5788 / RLHF research — Measure the quality gap between general-purpose Claude models and domain fine-tuned models on financial insight generation

EvalComparison Table

CREATE TABLE eval_comparisons (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    experiment_id       TEXT NOT NULL,          -- groups related comparisons
    surface             TEXT NOT NULL,          -- "classification" | "ambient" | "chat" | "review"
    model_a             TEXT NOT NULL,          -- model key from MODEL_REGISTRY
    model_b             TEXT NOT NULL,
    input_prompt        TEXT NOT NULL,          -- the exact prompt sent to both models
    output_a            TEXT NOT NULL,          -- model_a response
    output_b            TEXT NOT NULL,          -- model_b response
    trace_id_a          TEXT,                   -- Langfuse trace for model_a call
    trace_id_b          TEXT,                   -- Langfuse trace for model_b call
    cost_usd_a          FLOAT,
    cost_usd_b          FLOAT,
    human_preferred     TEXT,                   -- "a" | "b" | "tie" | null (if not rated)
    judge_preferred     TEXT,                   -- "a" | "b" | "tie" (from LLM-as-judge)
    judge_rationale     TEXT,
    judge_score_a       FLOAT,                  -- 1-5 scale
    judge_score_b       FLOAT,
    created_at          TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON eval_comparisons (experiment_id, surface, created_at DESC);

Model Router (A/B Experiments)

The model router integrates with the gateway. Callers use call_llm() normally — routing is transparent.

# backend/services/model_router.py

async def route_model(
    surface: str,
    user_id: str,
    experiment_id: str = None,
) -> str:  # returns model key ("haiku" | "sonnet" | ...)
    """
    Default routing: surface → model per the routing table in llm-gateway skill.
    Experiment routing: assigns user to experiment arm (deterministic by user_id hash).
    """

LLM-as-Judge Pattern

# backend/services/buddy_prompts/eval_judge.py

JUDGE_PROMPT = """
You are evaluating two AI responses to the same financial assistant prompt.

USER CONTEXT: {context_summary}

PROMPT SENT TO BOTH MODELS:
{input_prompt}

RESPONSE A:
{output_a}

RESPONSE B:
{output_b}

Score each response on:
1. Accuracy (does it reflect the actual financial data?) — 1-5
2. Tone (warm, direct, non-judgmental — like a smart friend) — 1-5
3. Actionability (does it help the user know what to do?) — 1-5
4. Specificity (does it reference actual amounts/merchants?) — 1-5

Which response is better overall? ("A" | "B" | "tie")
Explain in 1-2 sentences why.

Respond as JSON: {"score_a": X, "score_b": X, "preferred": "A|B|tie", "rationale": "..."}
"""

Four-Way Comparison (CS 5788 Design)

The academic contribution requires comparing:

base_mistral      — Mistral-7B with no fine-tuning
sft_mistral       — Mistral-7B after supervised fine-tuning on Budget Buddy data
sft_dpo_mistral   — After SFT + DPO (preference alignment)
claude_ceiling    — Claude Sonnet as the quality ceiling

Training data sources (collected automatically from Phase 1 onwards):

Ambient insight engagement (acknowledged/suppressed) → preference pairs
Chat thumbs up/down → direct quality signal
Action confirm/undo → implicit quality signal
Review Act 2 user responses → rich labeled data
User model communication preferences → personalization signal

The hypothesis: Real user interaction data (from the context system) produces better fine-tuning signal than synthetic data. The suppression system provides implicit preference learning without requiring explicit ratings.

Eval Runner CLI

# backend/scripts/run_eval_comparison.py

# Compare Haiku vs Sonnet on last 50 ambient insights:
python -m backend.scripts.run_eval_comparison \
  --experiment "haiku_vs_sonnet_ambient" \
  --surface ambient \
  --model_a haiku \
  --model_b sonnet \
  --sample_size 50

# Output: CSV + summary table in docs/v2/eval-results/

Files to Build (Phase 2.2)

backend/services/model_router.py              — A/B experiment routing
backend/database/models/eval_comparison.py    — EvalComparison model + migration
backend/services/eval_runner.py               — Run comparisons across models
backend/services/buddy_prompts/eval_judge.py  — LLM-as-judge prompt
backend/scripts/run_eval_comparison.py        — CLI for eval sweeps
tests/test_eval_runner.py                     — Unit tests (write FIRST)

Acceptance Criteria (Phase 2.2)

Model router integrates transparently with gateway (callers don't change)
Eval runner can compare Haiku vs Sonnet on same inputs
Results stored with quality scores + cost data
LLM-as-judge produces structured scores in EvalComparison table
Provider abstraction supports adding Modal without changing call sites
Langfuse traces for both model arms linked to same experiment_id
All tests pass

Metrics Tracked in Langfuse

Metric	Source	What it tells you
Engagement rate	Ambient insight acknowledged	Are insights relevant?
Suppression rate	Topics suppressed / shown	Is Buddy annoying?
Action success rate	Actions confirmed / proposed	Is Buddy accurate?
Undo rate	Actions undone / executed	Is Buddy trustworthy?
Thumbs up ratio	Positive / total feedback	Is quality acceptable?
Cost per interaction	Token usage × model price	Is routing optimal?
Latency p50/p95	Trace duration	Is it fast enough?
Repetition rate	Same topic within 24hr	Is memory working?

Last Updated

2026-03-31 (Phase 0 scaffold)

Más de este repositorio

mismo repositorio

gstack

fedickinson/budget-buddy-2

Fast headless browser for QA testing and site dogfooding. Navigate pages, interact with elements, verify state, diff before/after, take annotated screenshots, test responsive layouts, forms, uploads, dialogs, and capture bug evidence. Use when asked to open or test a site, verify a deployment, dogfood a user flow, or file a bug with screenshots. (gstack)

2026-04-100

frontend-ui-ux-design

fedickinson/budget-buddy-2

Guide React component design for Fogo using the Fogo design system (Industrial Warmth, dark-first, Fogo amber brand color, Tailwind CSS). Use when designing UI, creating components, styling, or writing frontend code.

2026-04-020

review-session

fedickinson/budget-buddy-2

Review session data model for the v2 two-phase monthly review (Reflect + Plan). Use when building review session CRUD, phase orchestration, conversation entry storage, or the review frontend. Status: BUILT (Phase 3, 2026-04-02).

2026-04-020

llm-gateway

fedickinson/budget-buddy-2

LLM Gateway interface for all Anthropic API calls in v2. Use when writing any code that calls an LLM, adding new model providers, or understanding how LLM costs are tracked. Status: PLANNED — not yet built.

2026-03-310

buddy-ai-setup

fedickinson/budget-buddy-2

Configure Buddy AI with Anthropic Claude API and set up automated insight generation via cron jobs (daily, weekly, monthly). Use when setting up Buddy AI, configuring cron jobs, or troubleshooting AI insights.

2026-03-150

code-explanation

fedickinson/budget-buddy-2

Explain code with visual diagrams and analogies. Use when explaining how code works, teaching about codebase, answering "how does this work", walking through logic, or understanding Budget Buddy architecture.

2026-01-020

name: evaluation description: Model evaluation framework for comparing LLM outputs (Haiku vs Sonnet vs fine-tuned). Use when building eval infrastructure, running model comparisons, or setting up the RLHF training pipeline. Status: PLANNED — build in Phase 2.2. allowed-tools: [Read, Grep, Bash(python*)]

Evaluation Framework

STATUS: PLANNED — Build in Phase 2.2, after the steel thread (Phase 1) is working. The gateway (Phase 1.2) must exist first — evaluation is built on top of it.

Purpose

Two connected goals:

Production model selection — A/B test Haiku vs Sonnet on cost/quality tradeoffs to optimize routing
CS 5788 / RLHF research — Measure the quality gap between general-purpose Claude models and domain fine-tuned models on financial insight generation

EvalComparison Table

CREATE TABLE eval_comparisons (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    experiment_id       TEXT NOT NULL,          -- groups related comparisons
    surface             TEXT NOT NULL,          -- "classification" | "ambient" | "chat" | "review"
    model_a             TEXT NOT NULL,          -- model key from MODEL_REGISTRY
    model_b             TEXT NOT NULL,
    input_prompt        TEXT NOT NULL,          -- the exact prompt sent to both models
    output_a            TEXT NOT NULL,          -- model_a response
    output_b            TEXT NOT NULL,          -- model_b response
    trace_id_a          TEXT,                   -- Langfuse trace for model_a call
    trace_id_b          TEXT,                   -- Langfuse trace for model_b call
    cost_usd_a          FLOAT,
    cost_usd_b          FLOAT,
    human_preferred     TEXT,                   -- "a" | "b" | "tie" | null (if not rated)
    judge_preferred     TEXT,                   -- "a" | "b" | "tie" (from LLM-as-judge)
    judge_rationale     TEXT,
    judge_score_a       FLOAT,                  -- 1-5 scale
    judge_score_b       FLOAT,
    created_at          TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON eval_comparisons (experiment_id, surface, created_at DESC);

Model Router (A/B Experiments)

The model router integrates with the gateway. Callers use call_llm() normally — routing is transparent.

# backend/services/model_router.py

async def route_model(
    surface: str,
    user_id: str,
    experiment_id: str = None,
) -> str:  # returns model key ("haiku" | "sonnet" | ...)
    """
    Default routing: surface → model per the routing table in llm-gateway skill.
    Experiment routing: assigns user to experiment arm (deterministic by user_id hash).
    """

LLM-as-Judge Pattern

# backend/services/buddy_prompts/eval_judge.py

JUDGE_PROMPT = """
You are evaluating two AI responses to the same financial assistant prompt.

USER CONTEXT: {context_summary}

PROMPT SENT TO BOTH MODELS:
{input_prompt}

RESPONSE A:
{output_a}

RESPONSE B:
{output_b}

Score each response on:
1. Accuracy (does it reflect the actual financial data?) — 1-5
2. Tone (warm, direct, non-judgmental — like a smart friend) — 1-5
3. Actionability (does it help the user know what to do?) — 1-5
4. Specificity (does it reference actual amounts/merchants?) — 1-5

Which response is better overall? ("A" | "B" | "tie")
Explain in 1-2 sentences why.

Respond as JSON: {"score_a": X, "score_b": X, "preferred": "A|B|tie", "rationale": "..."}
"""

Four-Way Comparison (CS 5788 Design)

The academic contribution requires comparing:

base_mistral      — Mistral-7B with no fine-tuning
sft_mistral       — Mistral-7B after supervised fine-tuning on Budget Buddy data
sft_dpo_mistral   — After SFT + DPO (preference alignment)
claude_ceiling    — Claude Sonnet as the quality ceiling

Training data sources (collected automatically from Phase 1 onwards):

Ambient insight engagement (acknowledged/suppressed) → preference pairs
Chat thumbs up/down → direct quality signal
Action confirm/undo → implicit quality signal
Review Act 2 user responses → rich labeled data
User model communication preferences → personalization signal

Eval Runner CLI

# backend/scripts/run_eval_comparison.py

# Compare Haiku vs Sonnet on last 50 ambient insights:
python -m backend.scripts.run_eval_comparison \
  --experiment "haiku_vs_sonnet_ambient" \
  --surface ambient \
  --model_a haiku \
  --model_b sonnet \
  --sample_size 50

# Output: CSV + summary table in docs/v2/eval-results/

Files to Build (Phase 2.2)

backend/services/model_router.py              — A/B experiment routing
backend/database/models/eval_comparison.py    — EvalComparison model + migration
backend/services/eval_runner.py               — Run comparisons across models
backend/services/buddy_prompts/eval_judge.py  — LLM-as-judge prompt
backend/scripts/run_eval_comparison.py        — CLI for eval sweeps
tests/test_eval_runner.py                     — Unit tests (write FIRST)

Acceptance Criteria (Phase 2.2)

Model router integrates transparently with gateway (callers don't change)
Eval runner can compare Haiku vs Sonnet on same inputs
Results stored with quality scores + cost data
LLM-as-judge produces structured scores in EvalComparison table
Provider abstraction supports adding Modal without changing call sites
Langfuse traces for both model arms linked to same experiment_id
All tests pass

Metrics Tracked in Langfuse

Metric	Source	What it tells you
Engagement rate	Ambient insight acknowledged	Are insights relevant?
Suppression rate	Topics suppressed / shown	Is Buddy annoying?
Action success rate	Actions confirmed / proposed	Is Buddy accurate?
Undo rate	Actions undone / executed	Is Buddy trustworthy?
Thumbs up ratio	Positive / total feedback	Is quality acceptable?
Cost per interaction	Token usage × model price	Is routing optimal?
Latency p50/p95	Trace duration	Is it fast enough?
Repetition rate	Same topic within 24hr	Is memory working?

Last Updated

2026-03-31 (Phase 0 scaffold)