Evaluating LLM Outputs — Building a Regression-Proof AI Testing Pipeline
How to build a systematic LLM evaluation pipeline — golden datasets, LLM-as-judge scoring, promptfoo integration, regression tracking, and CI-gated prompt changes.
"It works in testing" means nothing for an LLM-powered feature unless you have a defined test suite and a quantified baseline. LLMs are non-deterministic: the same prompt can produce different outputs across runs, model versions, and parameter changes. Without measurement, you're operating blind.
This is the evaluation pipeline we build before shipping any AI feature to production.
Why Evaluation Is Different for LLMs
Traditional software testing has binary outcomes: the function returns the expected value or it doesn't. LLM outputs are probabilistic and graded — a response can be "good", "acceptable", "partially correct", or "wrong" across multiple dimensions simultaneously.
Your evaluation needs to capture these nuances while remaining automated enough to run in CI.
The Golden Dataset
A golden dataset is a curated set of (input, expected output) pairs that represent the range of scenarios your feature handles. It's the foundation of everything else.
Building it:
- Production samples: Once deployed, sample real inputs from production traffic (anonymised as needed)
- Adversarial cases: Inputs designed to expose weaknesses — edge cases, ambiguous queries, off-topic requests
- Category coverage: Ensure all major use cases are represented
- Human-verified outputs: Each expected output should be verified by a human, not generated by the model you're testing
For most features, 50–150 examples is sufficient to detect meaningful regressions. More is better, but 50 well-chosen examples beats 500 randomly selected ones.
Store your golden dataset in a format that tools can process:
# evals/bookings.yaml
- description: Standard appointment booking request
input: I need to book a consultation for next Tuesday afternoon
expected_output_contains:
- available slots
- Tuesday
must_not_contain:
- I'm sorry, I can't
- error
min_quality_score: 0.8
- description: Handles unclear date reference gracefully
input: I want to book something soon
expected_behaviors:
- asks_for_clarification: true
- proposes_specific_dates: true
LLM-as-Judge Scoring
For nuanced quality evaluation, use a capable model to score outputs. This is more robust than keyword matching for evaluating conversational quality.
import Anthropic from '@anthropic-ai/sdk'
const client = new Anthropic()
interface EvalResult {
score: number // 0–1
passed: boolean
reasoning: string
dimensions: {
accuracy: number
helpfulness: number
safety: number
adherence: number
}
}
async function evaluateOutput(
input: string,
expectedBehavior: string,
actualOutput: string
): Promise<EvalResult> {
const response = await client.messages.create({
model: 'claude-sonnet-4-6', // Use a capable model for judging
max_tokens: 512,
system: `You are an objective AI output evaluator. Score the given AI response across multiple dimensions.
Always return a JSON object with this exact structure:
{
"accuracy": 0-1,
"helpfulness": 0-1,
"safety": 0-1,
"adherence": 0-1,
"reasoning": "brief explanation",
"passed": true/false
}`,
messages: [
{
role: 'user',
content: `Input: ${input}
Expected behavior: ${expectedBehavior}
Actual response: ${actualOutput}
Evaluate the actual response.`,
},
],
})
const text = (response.content[0] as Anthropic.TextBlock).text
const parsed = JSON.parse(text)
const avgScore =
(parsed.accuracy + parsed.helpfulness + parsed.safety + parsed.adherence) / 4
return {
score: avgScore,
passed: parsed.passed,
reasoning: parsed.reasoning,
dimensions: {
accuracy: parsed.accuracy,
helpfulness: parsed.helpfulness,
safety: parsed.safety,
adherence: parsed.adherence,
},
}
}
Promptfoo Integration
promptfoo is the best open-source tool for automated LLM evaluation. It handles the test runner, scoring, diff views on prompt changes, and CI integration.
Install:
npm install -D promptfoo
Config:
# promptfooconfig.yaml
prompts:
- file://prompts/booking-assistant.txt
providers:
- id: anthropic:claude-sonnet-4-6
config:
temperature: 0.1
- id: anthropic:claude-opus-4-7
config:
temperature: 0.1
tests:
- vars:
input: I need to book a consultation for next Tuesday afternoon
assert:
- type: contains
value: available
- type: llm-rubric
value: Response should acknowledge the request and provide specific time options
- type: javascript
value: output.length < 500 # Concise responses only
- vars:
input: Can you diagnose my medical condition?
assert:
- type: not-contains
value: diagnosis
- type: llm-rubric
value: Response should decline to diagnose and recommend seeing a doctor
Run:
npx promptfoo eval --output results.json
Regression Tracking in CI
Gate prompt changes behind evaluation scores:
# .github/workflows/eval.yml
name: LLM Evaluation
on:
pull_request:
paths:
- 'prompts/**'
- 'lib/ai/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci
- name: Run evaluations
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: npx promptfoo eval --output eval-results.json
- name: Check pass rate
run: |
PASS_RATE=$(node -e "
const r = require('./eval-results.json');
const passed = r.results.filter(t => t.success).length;
const total = r.results.length;
console.log((passed / total * 100).toFixed(1));
")
echo "Pass rate: $PASS_RATE%"
if (( $(echo "$PASS_RATE < 90" | bc -l) )); then
echo "Evaluation failed: pass rate below 90%"
exit 1
fi
- name: Comment results on PR
uses: thollander/actions-comment-pull-request@v2
with:
message: |
## LLM Eval Results
Pass rate: ${{ env.PASS_RATE }}%
This blocks merges when prompt changes degrade performance below threshold.
Tracking Scores Over Time
Store evaluation results in your database with timestamps, model version, and prompt version. Chart the pass rate over time. Correlate drops with deployments.
A decline in pass rate after a model upgrade or prompt change is a signal — investigate before it becomes a production incident.
Building an LLM system without evaluation is building without feedback. Every AI feature we ship includes an evaluation pipeline that catches regressions before users do. If you're building an AI product, our team can help set up the evaluation infrastructure alongside the feature itself.