| name | neuron-evaluation-engineer |
| description | Create and run AI evaluations with datasets, assertions, and output drivers in Neuron AI. Use this skill whenever the user mentions evaluation, testing AI systems, creating evaluators, dataset-driven testing, assertion-based validation, or wants to measure AI system performance. Also trigger for tasks involving evaluator discovery, output configuration, result analysis, or building custom assertions. |
Neuron AI Evaluation Engineer
This skill helps you create and run evaluations for AI systems in Neuron AI. The evaluation system provides dataset-driven testing with flexible assertions, comprehensive result reporting, and extensible output drivers.
Core Concepts
The Evaluation System
Evaluations test AI systems using three main components:
- Evaluators - Test classes that define what to run and how to validate
- Datasets - Test data sources (arrays, JSON files)
- Assertions - Validation rules for checking outputs
Dataset Items → Evaluator::run() → Output → Evaluator::evaluate() → Assertions → Results
Evaluation Flow
For each dataset item:
setUp() - Initialize resources (once per evaluator)
run(datasetItem) - Execute your AI logic
evaluate(output, datasetItem) - Assert against expected results
- Repeat for next item
Note: Each evaluation starts with a fresh assertion executor - no manual reset needed.
Creating Custom Evaluators
Basic Evaluator
use NeuronAI\Evaluation\BaseEvaluator;
use NeuronAI\Evaluation\Contracts\DatasetInterface;
use NeuronAI\Evaluation\Assertions\StringContains;
use NeuronAI\Evaluation\Dataset\ArrayDataset;
use NeuronAI\Agent;
use NeuronAI\Agent\SystemPrompt;
class ContainsEvaluator extends BaseEvaluator
{
public function getDataset(): DatasetInterface
{
return new ArrayDataset([
[
'text' => 'I love this product!',
'content' => 'product',
],
[
'text' => 'This is terrible.',
'content' => 'positive',
],
]);
}
public function run(array $datasetItem): mixed
{
$response = MyAgent::make()->chat(
new UserMessage($datasetItem['text'])
)->getMessage();
return $response->getContent();
}
public function evaluate(mixed $output, array $datasetItem): void
{
$this->assert(
new StringContains($datasetItem['content']),
$output
);
}
}
JSON Dataset
For larger datasets, use JSON files:
use NeuronAI\Evaluation\Dataset\JsonDataset;
public function getDataset(): DatasetInterface
{
return new JsonDataset(__DIR__ . '/datasets/sentiment.json');
}
JSON format (sentiment.json):
[
{"text": "I love this!", "expected": "positive"},
{"text": "This is bad.", "expected": "negative"}
]
Built-in Assertions
String Assertions
StringContains
Check if the output contains a substring:
$this->assert(new StringContains('positive'), $output);
StringContainsAll
Check if the output contains all keywords:
$this->assert(new StringContainsAll(['hello', 'world']), $output);
StringContainsAny
Check if the output contains any of the keywords:
$this->assert(new StringContainsAny(['success', 'completed']), $output);
StringStartsWith
Check if the output starts with a prefix:
$this->assert(new StringStartsWith('Hello'), $output);
StringEndsWith
Check if the output ends with a suffix:
$this->assert(new StringEndsWith('!'), $output);
StringLengthBetween
Check if the string length is within range:
$this->assert(new StringLengthBetween(10, 100), $output);
StringDistance
Check string similarity using Levenshtein distance:
$this->assert(new StringDistance(
reference: 'expected text',
threshold: 0.5, // Minimum similarity score
maxDistance: 50 // Maximum allowed edits
), $output);
StringSimilarity
Check string similarity using embeddings:
use NeuronAI\Evaluation\Assertions\StringSimilarity;
use NeuronAI\RAG\Embeddings\OpenAI\OpenAIEmbeddings;
$this->assert(new StringSimilarity(
reference: 'The quick brown fox',
embeddingsProvider: new OpenAIEmbeddings(key: 'YOUR_KEY'),
threshold: 0.6
), $output);
Pattern Assertions
MatchesRegex
Match against regular expression:
$this->assert(new MatchesRegex('/^\d{3}-\d{2}-\d{4}$/'), $output);
Structure Assertions
IsValidJson
Check if the output is valid JSON:
$this->assert(new IsValidJson(), $output);
AI Judge Assertions
AgentJudge
Use an AI agent to evaluate outputs with custom criteria:
use NeuronAI\Evaluation\Assertions\AgentJudge;
use NeuronAI\Agent;
$judge = Agent::make()
->setInstructions('You are an expert evaluator for customer support responses.');
$this->assert(new AgentJudge(
judge: $judge,
criteria: 'Response should be helpful, polite, and address the customer\'s question directly',
threshold: 0.7
), $output);
$this->assert(new AgentJudge(
judge: $judge,
criteria: 'The response should convey the same meaning as the reference',
threshold: 0.8,
reference: $datasetItem['expected_answer']
), $output);
$this->assert(new AgentJudge(
judge: $judge,
criteria: 'Rate the factual accuracy of the response',
threshold: 0.7,
examples: [
[
'input' => 'What is 2+2?',
'output' => '2+2 equals 4',
'score' => 1.0,
'reasoning' => 'Mathematically correct and clear.',
],
]
), $output);
Pre-configured Judges
Built-in judges for common evaluation scenarios:
use NeuronAI\Evaluation\Assertions\Judges\{FaithfulnessJudge, CorrectnessJudge, RelevanceJudge, HelpfulnessJudge};
$this->assert(new FaithfulnessJudge(
judge: $judge,
context: $retrievedDocuments,
threshold: 0.7
), $output);
$this->assert(new CorrectnessJudge(
judge: $judge,
expected: $datasetItem['expected_answer'],
threshold: 0.7
), $output);
$this->assert(new RelevanceJudge(
judge: $judge,
question: $datasetItem['question'],
threshold: 0.7
), $output);
$this->assert(new HelpfulnessJudge(
judge: $judge,
threshold: 0.7
), $output);
Creating Custom Assertions
use NeuronAI\Evaluation\Assertions\AbstractAssertion;
use NeuronAI\Evaluation\AssertionResult;
class GreaterThanAssertion extends AbstractAssertion
{
public function __construct(
private readonly float $threshold
) {}
public function evaluate(mixed $actual): AssertionResult
{
if (!is_numeric($actual)) {
return AssertionResult::fail(
0.0,
'Expected numeric value, got ' . gettype($actual),
);
}
if ($actual > $this->threshold) {
return AssertionResult::pass(1.0);
}
return AssertionResult::fail(
0.0,
"Expected {$actual} to be greater than {$this->threshold}",
);
}
}
Use it:
$this->assert(new GreaterThanAssertion(0.8), $score);
Running Evaluations
CLI Command
vendor/bin/neuron evaluation /path/to/evaluators
vendor/bin/neuron evaluation --verbose /path/to/evaluators
vendor/bin/neuron evaluation --path=/path/to/evaluators
vendor/bin/neuron evaluation --help
Programmatic Execution
use NeuronAI\Evaluation\Runner\EvaluatorRunner;
$runner = new EvaluatorRunner();
$evaluator = new MyEvaluator();
$summary = $runner->run($evaluator);
echo "Passed: {$summary->getPassedCount()}\n";
echo "Failed: {$summary->getFailedCount()}\n";
echo "Success Rate: {$summary->getSuccessRate() * 100}%\n";
Output Configuration
Config File
Create evaluation.php in project root:
<?php
use NeuronAI\Evaluation\Output\ConsoleOutput;
use NeuronAI\Evaluation\Output\JsonOutput;
return [
'output' => [
ConsoleOutput::class,
JsonOutput::class => [
'path' => 'evaluation-results.json',
],
],
];
Default behavior: If no config exists, uses ConsoleOutput.
Built-in Output Drivers
ConsoleOutput
ConsoleOutput::class => ['verbose' => true]
verbose - Show detailed input/output for failures
JsonOutput
JsonOutput::class => ['path' => 'results.json']
JsonOutput::class
Creating Custom Output Drivers
use NeuronAI\Evaluation\Contracts\EvaluationOutputInterface;
use NeuronAI\Evaluation\Runner\EvaluatorSummary;
class DatabaseOutput implements EvaluationOutputInterface
{
public function __construct(
private readonly \PDO $pdo,
private readonly string $table = 'evaluations'
) {}
public function output(EvaluatorSummary $summary): void
{
$stmt = $this->pdo->prepare(
"INSERT INTO {$this->table}
(passed, failed, success_rate, total_time, created_at)
VALUES (?, ?, ?, ?, NOW())"
);
$stmt->execute([
$summary->getPassedCount(),
$summary->getFailedCount(),
$summary->getSuccessRate(),
$summary->getTotalExecutionTime(),
]);
}
}
Register in config:
DatabaseOutput::class => [
'pdo' => new \PDO('mysql:host=localhost;dbname=evaluations', 'user', 'pass'),
'table' => 'evaluations',
]
Project Setup
Configuring Autoloader
Add evaluators directory to composer.json:
{
"autoload-dev": {
"psr-4": {
"App\\Evaluators\\": "evaluators/"
}
}
}
Directory Structure
project/
├── evaluators/
│ ├── SentimentEvaluator.php
│ ├── SummarizationEvaluator.php
│ └── datasets/
│ ├── sentiment.json
│ └── summarization.json
├── evaluation.php
└── vendor/bin/neuron
Result Analysis
Accessing Results
$summary = $runner->run($evaluator);
$summary->getPassedCount();
$summary->getFailedCount();
$summary->getTotalCount();
$summary->getSuccessRate();
$summary->getTotalExecutionTime();
$summary->getAverageExecutionTime();
$summary->getTotalAssertions();
$summary->getTotalAssertionsPassed();
$summary->getTotalAssertionsFailed();
$summary->getAssertionSuccessRate();
$summary->getResults();
$summary->getFailedResults();
$summary->getAssertionFailuresByLocation();
EvaluatorResult
foreach ($summary->getResults() as $result) {
$result->getIndex();
$result->isPassed();
$result->getInput();
$result->getOutput();
$result->getExecutionTime();
$result->getError();
$result->getAssertionsPassed();
$result->getAssertionsFailed();
$result->getAssertionFailures();
}
AssertionFailure
$failure->getEvaluatorClass();
$failure->getShortEvaluatorClass();
$failure->getAssertionMethod();
$failure->getMessage();
$failure->getLineNumber();
$failure->getContext();
$failure->getFullDescription();
Common Patterns
Evaluating Multiple Metrics
public function evaluate(mixed $output, array $datasetItem): void
{
$this->assert(new StringContains($datasetItem['topic']), $output);
$this->assert(new StringLengthBetween(50, 500), $output);
$this->assert(new IsValidJson(), $output);
}
Using AI Judge for Scoring
Use the built-in AgentJudge assertion for AI-powered evaluation:
use NeuronAI\Evaluation\Assertions\AgentJudge;
use NeuronAI\Evaluation\Assertions\Judges\CorrectnessJudge;
public function setUp(): void
{
$this->judge = Agent::make()
->setInstructions('You are an expert evaluator for AI responses.');
}
public function evaluate(mixed $output, array $datasetItem): void
{
$this->assert(new AgentJudge(
judge: $this->judge,
criteria: 'Rate the quality and accuracy of the response',
threshold: 0.7
), $output);
$this->assert(new CorrectnessJudge(
judge: $this->judge,
expected: $datasetItem['expected'],
threshold: 0.7
), $output);
}
Testing RAG Systems
class RAGEvaluator extends BaseEvaluator
{
public function setUp(): void
{
$this->rag = new MyRAGAgent();
}
public function run(array $datasetItem): mixed
{
return $this->rag->chat(
new UserMessage($datasetItem['question'])
)->getMessage()->getContent();
}
public function evaluate(mixed $output, array $datasetItem): void
{
$this->assert(new StringContainsAny($datasetItem['key_facts']), $output);
$this->assert(new StringSimilarity(
reference: $datasetItem['expected_answer'],
embeddingsProvider: $this->embeddings,
threshold: 0.7
), $output);
}
}
Comparing Multiple Agents
public function setUp(): void
{
$this->agentA = new AgentOne();
$this->agentB = new AgentTwo();
}
public function run(array $datasetItem): mixed
{
return [
'agent_a' => $this->agentA->chat(...)->getContent(),
'agent_b' => $this->agentB->chat(...)->getContent(),
];
}
public function evaluate(mixed $output, array $datasetItem): void
{
$similarity = $this->calculateSimilarity(
$output['agent_a'],
$output['agent_b']
);
$this->assert(new GreaterThanAssertion(0.8), $similarity);
}
Best Practices
Evaluator Design
- Keep evaluators focused - One evaluator per use case
- Use descriptive dataset items - Include expected values, metadata
- Leverage
setUp() - Initialize expensive resources once
- Test in isolation - Make
run() and evaluate() pure functions
Assertion Usage
- Use specific assertions - Prefer
StringContains over generic checks
- Set appropriate thresholds - Balance sensitivity vs. false positives
- Combine multiple assertions - Check different aspects of output
- Use embeddings for semantic similarity - Don't rely only on string matching
Dataset Management
- Separate test data - Keep evaluators in dedicated directory
- Use JSON for large datasets - Easier to maintain than arrays
- Include diverse cases - Edge cases, typical cases, boundary values
- Version control datasets - Track changes to test cases
Output Configuration
- Configure multiple drivers - Console for quick checks, JSON for CI/CD
- Use verbose mode during development for detailed failure info
- Custom drivers for integration with existing systems (databases, APIs)
CLI Generation
Testing Evaluators
use PHPUnit\Framework\TestCase;
use NeuronAI\Evaluation\Runner\EvaluatorRunner;
class MyEvaluatorTest extends TestCase
{
public function testEvaluatorRuns(): void
{
$runner = new EvaluatorRunner();
$evaluator = new MyEvaluator();
$summary = $runner->run($evaluator);
$this->assertGreaterThan(0, $summary->getTotalCount());
}
public function testEvaluatorHasNoFailures(): void
{
$runner = new EvaluatorRunner();
$evaluator = new MyEvaluator();
$summary = $runner->run($evaluator);
$this->assertEquals(0, $summary->getFailedCount());
}
}
Integration with CI/CD
GitHub Actions
name: Evaluation Tests
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup PHP
uses: shivammathur/setup-php@v2
with:
php-version: '8.2'
- name: Install dependencies
run: composer install
- name: Run evaluations
run: vendor/bin/neuron evaluation evaluators --verbose
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Failing on Thresholds
vendor/bin/neuron evaluation evaluators || exit 1
Key Decision Points
When helping users with evaluations:
-
Dataset format depends on:
- Small datasets →
ArrayDataset (in code)
- Large/external datasets →
JsonDataset (files)
-
Assertion choice depends on:
- Exact matching →
StringContains, StringStartsWith
- Pattern matching →
MatchesRegex
- Semantic similarity →
StringSimilarity (embeddings)
- Fuzzy matching →
StringDistance
-
Output configuration based on:
- Development →
ConsoleOutput with verbose mode
- CI/CD →
JsonOutput to file
- Analytics → Custom driver to database/API
-
Evaluation granularity:
- Unit tests → Single assertion per evaluator
- Integration tests → Multiple assertions
- System tests → Multiple evaluators covering different scenarios