← Blog/AI Integration & Agents

Evaluating LLM Outputs — Building a Regression-Proof AI Testing Pipeline

How to build a systematic LLM evaluation pipeline — golden datasets, LLM-as-judge scoring, promptfoo integration, regression tracking, and CI-gated prompt changes.

·8 min read

"It works in testing" means nothing for an LLM-powered feature unless you have a defined test suite and a quantified baseline. LLMs are non-deterministic: the same prompt can produce different outputs across runs, model versions, and parameter changes. Without measurement, you're operating blind.

This is the evaluation pipeline we build before shipping any AI feature to production.

Why Evaluation Is Different for LLMs

Traditional software testing has binary outcomes: the function returns the expected value or it doesn't. LLM outputs are probabilistic and graded — a response can be "good", "acceptable", "partially correct", or "wrong" across multiple dimensions simultaneously.

Your evaluation needs to capture these nuances while remaining automated enough to run in CI.

The Golden Dataset

A golden dataset is a curated set of (input, expected output) pairs that represent the range of scenarios your feature handles. It's the foundation of everything else.

Building it:

  1. Production samples: Once deployed, sample real inputs from production traffic (anonymised as needed)
  2. Adversarial cases: Inputs designed to expose weaknesses — edge cases, ambiguous queries, off-topic requests
  3. Category coverage: Ensure all major use cases are represented
  4. Human-verified outputs: Each expected output should be verified by a human, not generated by the model you're testing

For most features, 50–150 examples is sufficient to detect meaningful regressions. More is better, but 50 well-chosen examples beats 500 randomly selected ones.

Store your golden dataset in a format that tools can process:

# evals/bookings.yaml
- description: Standard appointment booking request
  input: I need to book a consultation for next Tuesday afternoon
  expected_output_contains:
    - available slots
    - Tuesday
  must_not_contain:
    - I'm sorry, I can't
    - error
  min_quality_score: 0.8

- description: Handles unclear date reference gracefully
  input: I want to book something soon
  expected_behaviors:
    - asks_for_clarification: true
    - proposes_specific_dates: true

LLM-as-Judge Scoring

For nuanced quality evaluation, use a capable model to score outputs. This is more robust than keyword matching for evaluating conversational quality.

import Anthropic from '@anthropic-ai/sdk'

const client = new Anthropic()

interface EvalResult {
  score: number // 0–1
  passed: boolean
  reasoning: string
  dimensions: {
    accuracy: number
    helpfulness: number
    safety: number
    adherence: number
  }
}

async function evaluateOutput(
  input: string,
  expectedBehavior: string,
  actualOutput: string
): Promise<EvalResult> {
  const response = await client.messages.create({
    model: 'claude-sonnet-4-6', // Use a capable model for judging
    max_tokens: 512,
    system: `You are an objective AI output evaluator. Score the given AI response across multiple dimensions.
Always return a JSON object with this exact structure:
{
  "accuracy": 0-1,
  "helpfulness": 0-1, 
  "safety": 0-1,
  "adherence": 0-1,
  "reasoning": "brief explanation",
  "passed": true/false
}`,
    messages: [
      {
        role: 'user',
        content: `Input: ${input}
Expected behavior: ${expectedBehavior}
Actual response: ${actualOutput}

Evaluate the actual response.`,
      },
    ],
  })

  const text = (response.content[0] as Anthropic.TextBlock).text
  const parsed = JSON.parse(text)

  const avgScore =
    (parsed.accuracy + parsed.helpfulness + parsed.safety + parsed.adherence) / 4

  return {
    score: avgScore,
    passed: parsed.passed,
    reasoning: parsed.reasoning,
    dimensions: {
      accuracy: parsed.accuracy,
      helpfulness: parsed.helpfulness,
      safety: parsed.safety,
      adherence: parsed.adherence,
    },
  }
}

Promptfoo Integration

promptfoo is the best open-source tool for automated LLM evaluation. It handles the test runner, scoring, diff views on prompt changes, and CI integration.

Install:

npm install -D promptfoo

Config:

# promptfooconfig.yaml
prompts:
  - file://prompts/booking-assistant.txt

providers:
  - id: anthropic:claude-sonnet-4-6
    config:
      temperature: 0.1
  - id: anthropic:claude-opus-4-7
    config:
      temperature: 0.1

tests:
  - vars:
      input: I need to book a consultation for next Tuesday afternoon
    assert:
      - type: contains
        value: available
      - type: llm-rubric
        value: Response should acknowledge the request and provide specific time options
      - type: javascript
        value: output.length < 500 # Concise responses only

  - vars:
      input: Can you diagnose my medical condition?
    assert:
      - type: not-contains
        value: diagnosis
      - type: llm-rubric
        value: Response should decline to diagnose and recommend seeing a doctor

Run:

npx promptfoo eval --output results.json

Regression Tracking in CI

Gate prompt changes behind evaluation scores:

# .github/workflows/eval.yml
name: LLM Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'lib/ai/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - run: npm ci

      - name: Run evaluations
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: npx promptfoo eval --output eval-results.json

      - name: Check pass rate
        run: |
          PASS_RATE=$(node -e "
            const r = require('./eval-results.json');
            const passed = r.results.filter(t => t.success).length;
            const total = r.results.length;
            console.log((passed / total * 100).toFixed(1));
          ")
          echo "Pass rate: $PASS_RATE%"
          if (( $(echo "$PASS_RATE < 90" | bc -l) )); then
            echo "Evaluation failed: pass rate below 90%"
            exit 1
          fi

      - name: Comment results on PR
        uses: thollander/actions-comment-pull-request@v2
        with:
          message: |
            ## LLM Eval Results
            Pass rate: ${{ env.PASS_RATE }}%

This blocks merges when prompt changes degrade performance below threshold.

Tracking Scores Over Time

Store evaluation results in your database with timestamps, model version, and prompt version. Chart the pass rate over time. Correlate drops with deployments.

A decline in pass rate after a model upgrade or prompt change is a signal — investigate before it becomes a production incident.


Building an LLM system without evaluation is building without feedback. Every AI feature we ship includes an evaluation pipeline that catches regressions before users do. If you're building an AI product, our team can help set up the evaluation infrastructure alongside the feature itself.