// Create flexible annotation workflows for AI applications. Contains common tools to explore raw ai agent logs/transcripts, extract out relevant evaluation data, and llm-as-a-judge creation.
| name | annotate |
| description | Create flexible annotation workflows for AI applications. Contains common tools to explore raw ai agent logs/transcripts, extract out relevant evaluation data, and llm-as-a-judge creation. |
| allowed-tools | Read, Grep, Glob, jq |
The workflow consists of 3 main phases in a loop:
NOTE: use the multichoice/checklist/survey tool as much as possible when getting feedback from the user, typing is onerous.
This document covers setup tasks that should be completed before starting the annotation workflow.
This skill requires Python and Node.js dependencies. If the user encounters errors about missing packages, ask for permission to install dependencies:
# From the annotate skill directory
pip install -e .
# Frontend dependencies (from frontend/ subdirectory)
cd frontend && yarn install
All the state and data should be contained under .haize_annotations in the current working directory. For claude code, this is the same directory as .claude.
Make sure this is the .claude folder in the CURRENT working directory, not globally.
ALWAYS start by checking for existing work; we may be resuming an annotation session where a user has already made progress.
If .haize_annotations does not exist, create it.
If .haize_annotations does exist, inspect its contents:
ingested_data exists, it means we've done the work to translate raw trace data to a haize format.feedback_config exists, it means that we've already arrived on some criteria for how the human should provide feedback.test_cases exists with human/ai annotations, it means the user has already made some annotation progress based on a particular feedback config; it's not necessarily the case that the cases match the CURRENT feedback config though.All artifacts will be saved to .haize_annotations/ directory:
Important note: ingest.py is the ONLY file you are meant to manually EDIT/WRITE (all files are free to read)
.haize_annotations/
├── ingested_data/
│ └── interactions/ # list of all ai interactions
│ ├── {interaction_id}/
│ │ ├── metadata.json # Interaction fields (id, name, group_id, etc)
│ │ └── steps.jsonl # One InteractionStep per line
│ └── ...
│
├── ingest.py # Script to ingest raw data under some folder path into `ingested_data`
│
├── feedback_config.json # Evaluation criteria and any state about an annotation session tied to this specific feedback config
└── test_cases/ # Test cases human/ai produced via annotation sessions that human/ai must give feedback on
├── tc_{uuid}.json # Contains: raw_judge_input, judge_input,
│ # ai_annotation, human_annotation
└── ...
Important note: You will NOT be directly modifying the feedback config file; instead, you'll modify it through a FastAPI endpoint.
The goal of this step is to ingest raw agent data which can be in arbitrary form into a normalized data format that we can work with.
To start, generate ingest.py with all models and boilerplate included:
bash <path-to>/scripts/generate_ingestion_script.sh .haize_annotations/ingest.py
IMPORTANT:
<path-to-annotate-skill> with the actual path to the annotate skill directory (under scripts/generate_ingestion_script.sh).haize_annotations/ingest.py) specifies where the script will be created inside the .haize_annotations/ directory to keep all artifacts together.This creates ingest.py with:
scripts/_models.pyingest() function with TODO comments and example implementation.haize_annotations/ingested_data/REQUIRED READING BEFORE CONTINUING: You MUST read:
references/normalization_patterns.md
(no need to call out or cite specific patterns you are using - these reference docs are for YOUR knowledge)Sample and inspect raw trace files to understand structure.
See references/normalization_patterns.md for detailed guidance on analyzing trace formats and transformation patterns.
Careful: Raw agent log files can get very large, and not all files in the working directory are agent transcript data. Check file sizes before opening trace data.
Important: Python is a great tool here, but make use of other safer tools like jq as well if they exist.
NOTE: If the user wants to annotate something that is not super feasible with the data at hand (e.g., they want to annotate the quality of complete sessions but the associated data doesn't have a stable session identifier to group the data into sessions, or the user wants to annotate a multi-step RAG bot but the data doesn't have an identifier to group together all steps into a single interaction), you HAVE to mention that to them to manage expectations and tell them WHAT IS MISSING for them to annotate that aspect. In these cases, suggest ALTERNATIVE aspects of the AI app to annotate.
Your job: Implement the ingest(folder_path) function based on your trace format in ingest.py. You're responsible for ALL file loading and parsing logic.
See references/normalization_patterns.md for transformation patterns and complete examples:
python .haize_annotations/ingest.py --input /path/to/traces/folder
This will:
ingest() function with the input folder path.haize_annotations/ingested_data/interactions/{interaction_id}/
metadata.json - interaction metadatasteps.jsonl - one InteractionStep per lineCritical validation step! Always validate that the ingested data is as expected before continuing using a combination of:
scripts/run_validate_ingested_data.py (very quick, high level stats) (must be run as a module!!)cd <path-to>/skills/annotate_skill && python -m scripts.run_validate_ingested_data --ingested-dir <path-to>/.haize_annotations/ingested_data/interactionsYou might need to come back to this later, in case anything with the way we've ingested the data makes giving feedback on agent traces difficult.
Before creating feedback configurations and collecting annotations, you need to start the annotation session servers. This single command starts:
Command:
YOU MUST run the annotation session script as a module!!!!
# Option 1: Navigate to the annotate skill directory first
cd <path-to-annotate-skill>
python -m scripts.run_annotation_session \
--haize-annotations-dir <path-to-project>/.haize_annotations \
--source-data-directory <path-to-project>/ \
--port 8000 \
--frontend-port 5173
# Option 2: Run from any directory using module path
python -m <path_to>.scripts.run_annotation_session \
--haize-annotations-dir <path-to-project>/.haize_annotations \
--source-data-directory <path-to-project>/ \
--port 8000 \
--frontend-port 5173
Replace <path-to-annotate-skill> and <path-to-project> with actual paths.
What happens on startup:
feedback_config.json (if exists; otherwise, you'll have to interact with an endpoint to generate it)Servers:
http://localhost:<backend port>http://localhost:<front-end port>To stop: Press Ctrl+C once (gracefully shuts down all servers)
Important: Keep this running throughout your annotation session. If you update the feedback config, the servers will automatically reload and regenerate test cases.
REQUIRED STEP: Call curl -s http://localhost:8000/openapi.json to get documentation on interacting with the FastAPI server.
⚠️ REQUIRED READING BEFORE CONTINUING: You MUST read:
references/feedback_config_design.mdreferences/rubric_design.mdYOU MUST ONLY MODIFY THE FEEDBACK CONFIG via interacting with the FastAPI server since this has validation built in.
You can create temporary files to work on and then do:
curl -s -X POST "http://localhost:8000/feedback-config" -H "Content-Type: application/json" -d @.haize_annotations/new_config.json
Or just directly hit the server.
REMEMBER! Here's the request model for feedback configs:
class FeedbackConfigRequest(BaseModel):
"""Request to create or update the active feedback configuration."""
config: FeedbackConfig = Field(
...,
description="Complete FeedbackConfig object defining evaluation criteria, granularity, rubrics, and filtering rules.",
)
Still check out the openapi spec but this tells u its wrapped in config object
Goal: Define WHAT to evaluate and HOW - this is usually the part of the workflow that requires the most back and forth with the user.
At this point, if you haven't already, you should try reading scripts/_models.py to understand the various data models we are working with, especially the FeedbackConfig object.
Usage:
# Import and inspect models
python -c "from scripts.models import InteractionStep, Interaction, TestCase, Annotation, FeedbackConfig"
python -c "from scripts.models import *; print(InteractionStep.model_json_schema())"
python -c "from scripts.models import *; print(InteractionStep.__doc__)"
Important: You should NOT be exploring the original raw data to design the feedback config. The feedback config should be designed based on the ingested data. Of course, if the ingested data does not contain what you need to meet the user's needs, you can reconsider the ingestion script, explore the raw data, and re-ingest the data. For the majority of cases, this isn't needed.
After you have a good idea of the feedback configuration, you must use the feedback configuration FastAPI endpoint (POST /feedback-config) to modify feedback_config.json (don't directly edit the file EVER!!)
This endpoint will return basic validation information; it's a good idea, though, to also quickly scan through the test case data produced even if the endpoint returns 200 as a gut check that the attribute matchers / granularity contain the necessary eval data.
Note: After the feedback config is designed, it will take a bit for test cases to be generated, processed, AI annotated, and then finally ready for human annotations. Feel free to start annotating when there are any test cases that are AI annotated!
Reminder: Use this command to understand the annotations API:
curl -s http://localhost:8000/openapi.json > .haize_annotations/tmp/annotation_api_spec.json && wc -l .haize_annotations/tmp/annotation_api_spec.json
It's open-ended now, but your main goal is to get AS MANY ANNOTATIONS and HIGH QUALITY COMMENTS from the user as possible while minimizing effort and time from the human.
START OFF WITH THIS WORKFLOW (unless the user expresses preferences otherwise or you have a reason not to, this is a good default):
/api/test-cases/next endpoint to get the next test case that's ready to be annotatedPOST /api/test-cases/{test_case_id}/visualize endpoint to open the test case in the browser for the user to see
Handy helper: TC_ID=$(cat /tmp/tc_id.txt) && curl -s -X POST "http://localhost:<backend-port>/api/test-cases/$TC_ID/visualize"IMPORTANT: DO NOT call open manually on any endpoint - always POST to the visualize endpoint.
Some more creative strategies could be:
Important guidelines:
ai_rubricREMINDER!
class AnnotationRequest(BaseModel):
"""Request to submit an annotation (human or AI) for a test case."""
annotation: Annotation = Field(
...,
description="Annotation to submit. Type depends on the feedback spec: categorical (labels), continuous (scores), or ranking (ordering).",
)
Once it seems like the user has landed on something satisfactory - e.g., a non-trivial annotation sample size with good alignment between them and the AI annotator - feel free to suggest looking into next steps: references/next_steps.md on how to turn this into a repeatable workflow wired into their live production data.
You will always be juggling a mix of prompting the user for information while doing analysis on your own. You should try to minimize thrashing between "interview mode" (asking the user for stuff) and "analysis mode" (doing discovery on your own).
Anything related to AI alignment / expert feedback on AI, including:
The main reason is that the user probably doesn't care, but if they do, feel free to mention these things. In general, it is completely okay to be transparent about what's going on under the hood (they can see anyway), but we don't want to bother the user with concepts/names/details.
In general, heavy lifting and setup details should happen silently in the background unless there is truly an urgent issue to expose.
LAST REMINDERS:
WARNING: FILE SIZES CAN GET LARGE! Both for raw trace files, normalized interactions, and test cases. Approach smartly and CHECK FILE SIZES/read chunk by chunk instead of trying to go in blind to load it all at once.
CRITICAL: NEVER directly edit the feedback config or test cases directory. Managing these will be handled by the annotation server