Building a Fixture-Driven LLM Evaluation Framework

From Strategy to Infrastructure

In a previous post, I covered the conceptual two-tier strategy for testing LLM applications: deterministic validation (Tier 1) blocking CI/CD, and AI judge quality assessment (Tier 2) running advisory. That post answered “what should you test?” — this one answers “what does the evaluation infrastructure actually look like?”

This is the story of building a fixture-driven evaluation framework that tests AI-generated ski resort banners across 31 scenarios, validates 4 external API contracts nightly, and produces markdown reports I can diff over time. No watch mode. No visual regression. Just fixtures, assertions, and reports.

Our Summit AI dashboard displays a dynamic banner at the top of the page — a short message with contextual badges (powder alert, highway chains, peak day warning, crowd level, etc.). The banner is generated by an LLM, but the badge selection is deterministic: a badge calculator service examines real data from 5 sources and decides which badges to attach.

The challenge: you can’t unit test this in isolation. The banner quality depends on:

Which data sources are available (some go down seasonally)
The specific combination of weather + schedule + highway + crowd data
The LLM’s ability to summarize conditions into ~200 characters
Badge mutual exclusion rules (e.g., peakDay and crowd can never coexist)
Safety-first ordering (highway restrictions must lead)

This needs scenario-level testing — complete, realistic data snapshots that exercise the entire production code path.

Fixture Anatomy: A Complete World Snapshot

Each fixture is a self-contained JSON file representing one ski day scenario. Here’s a simplified view of the powder day fixture:

{
  "id": "powder-day",
  "name": "Powder Day - Fresh Snow, All Areas Open",
  "description": "Epic powder day: 10\" overnight, clear skies, full operations",
  "expectedBadges": [{ "type": "powder", "tier": "legendary" }],
  "dataPrep": {
    "sources": {
      "summitReport": {
        "date": "2026-01-15",
        "reportText": "POWDER ALERT! 10 inches of fresh snow overnight. All lifts spinning...",
        "hasReport": true
      },
      "summitWeather": {
        "temperature": 28,
        "conditions": "Partly Cloudy",
        "snowfall24h": 10,
        "snowfall48h": 14,
        "baseDepth": 72
      },
      "summitSchedule": {
        "liftsOpen": 14,
        "liftsTotal": 14,
        "trailsOpen": 65,
        "trailsTotal": 65,
        "areas": [
          { "name": "Summit West", "status": "Open", "hours": "9:00 AM - 9:30 PM" },
          { "name": "Alpental", "status": "Open", "hours": "9:00 AM - 4:00 PM" }
        ]
      },
      "noaaForecast": {
        "daily": [{ "date": "2026-01-15", "conditions": "Partly Cloudy", "tempHigh": 32, "snowfall": "0" }]
      }
    }
  },
  "wsdotCondition": {
    "passName": "Snoqualmie Pass",
    "roadCondition": "Wet",
    "restriction": { "type": "none", "details": "" }
  },
  "peakDayResult": {
    "isPeakDay": false,
    "crowdLevel": "moderate"
  },
  "highlightExpectations": {
    "requiredCategories": ["conditions", "operations"],
    "requiredKeywords": ["10", "open"],
    "minHighlights": 3,
    "maxHighlights": 4
  }
}

This fixture encodes everything: the mountain report text the LLM will see, the weather data that drives badge calculation, the schedule data, highway conditions, crowd predictions, and — critically — what we expect the output to contain.

We have 31 of these. They cover:

Category	Fixtures	Examples
Core scenarios	6	Powder day, holiday weekend, chain control, poor conditions, midweek, all-sources-fail
Data source coverage	7	Missing report, missing weather, missing forecast, highway-only, combined failures
Badge edge cases	4	Legendary powder, multiple highways, improving visibility, Saturday peak
Highway parsing	2	Traction advised, clear roads
Seasonal edge cases	4	Early season, spring conditions, wind hold, incoming storm
Time-of-day	4	Late night closed, early morning planning, evening session, stale report
Robustness	4	Peak/crowd conflict, malformed weather, zero snowfall, link syntax bleed

The Evaluation Runner

When you run nx run llm-evaluations:eval:banner, the runner does this:

Loads all 31 fixtures from JSON files
Calls the actual production API at localhost:3001/api/summit/banner?fixture=<id>
Validates each response against the fixture’s expectations
Generates a markdown report saved to evaluation-reports/

The key design decision: we test through the real API, not the LLM directly. This ensures the deterministic badge calculator, the data transformation layer, and the LLM prompt all get exercised together. If we mocked the badge calculator, we’d miss the most common class of bugs.

async function callBannerAPI(fixture: BannerFixture, apiUrl: string) {
  const response = await fetch(`${apiUrl}/api/summit/banner?fixture=${fixture.id}`);
  const data = await response.json();
  return { message: data.message, badges: data.badges };
}

Assertion-Based Scoring (Not Numeric)

We don’t score responses 1-10. We use binary assertions:

// Check 1: All expected badges must be present (subset check)
const missingBadges = expectedBadgeTypes.filter((expected) => !actualBadgeTypes.includes(expected));
const hasAllExpected = missingBadges.length === 0;

// Check 2: Mutual exclusion — peakDay and crowd NEVER coexist
const mutualExclusionViolation = actualBadgeTypes.includes('peakDay') && actualBadgeTypes.includes('crowd');

// Check 3: No link syntax bleeding through from scraped data
const linkPatterns = [/\[link\]/i, /http[s]?:\/\//, /summitatsnoqualmie\.com/i];
const hasLinkBleed = linkPatterns.some((p) => p.test(result.message));

This is intentional. Numeric scoring introduces subjectivity and drift. Binary assertions tell you exactly what broke and why.

Data Transparency Analysis

Each badge carries provenance metadata — where the data came from, with a URL and preview:

{
  "type": "powder",
  "tier": "heavy",
  "snowfall": 10,
  "provenance": {
    "reason": "10\" of heavy powder fell in the last 24 hours",
    "source": "Summit Weather API",
    "url": "https://summitatsnoqualmie.com/mountain-report",
    "dataPreview": {
      "label": "Recent Snowfall",
      "rows": [
        { "key": "Last 24 hours", "value": "10\"" },
        { "key": "Last 48 hours", "value": "14\"" }
      ]
    }
  }
}

The evaluation runner tracks transparency metrics across all fixtures: what percentage of badges include source URLs, what percentage include data previews. This catches regressions where prompt changes accidentally strip provenance.

Golden Prompt Versioning

The system prompt that generates banners has gone through three major versions, each stored as a golden prompt file:

v1-baseline — Free-form generation, high variability
v2-structured — JSON output with badge array, more consistent
v3-badge-focused — Decision tree format, deterministic badge logic moved out of LLM

The v3 prompt is the current production version. It uses a decision tree that the LLM follows:

Q1: Is there highway restriction? → ADD highway badge
Q2: Is there fresh snow? → ADD powder badge (light/fresh/heavy/legendary tiers)
Q3: Is it a peak day? → ADD peakDay badge
Q4: Are crowds expected? → ADD crowd badge
Q5: Is visibility poor? → ADD visibility badge

By moving badge selection to deterministic code and keeping only the message generation in the LLM, we reduced evaluation failures from ~15% (v1) to 0% (v3) across all 31 fixtures.

Contract Tests: Catching External API Breakage

Fixtures test our code. Contract tests test everyone else’s code.

We validate 4 external APIs nightly:

describe('API Contract Tests - OpenAI', () => {
  it('validates GPT-5 integration (Responses API)', async () => {
    const response = await gpt5Service.generateText('What is 2+2?', {
      maxTokens: 200,
    });

    expect(response).toHaveProperty('content');
    expect(response.model).toContain('gpt-5');
    expect(response.usage?.promptTokens).toBeGreaterThan(0);
  });

  it('validates model availability', async () => {
    const models = await openai.models.list();
    const availableIds = models.data.map((m) => m.id);
    expect(availableIds).toContain('gpt-5');
    expect(availableIds).toContain('gpt-4o-mini');
  });
});

Each external API gets its own contract test suite:

API	What We Validate	Cost/Run
OpenAI (GPT-5, GPT-4o-mini)	Response structure, model availability, pricing, API routing (Responses vs Chat Completions)	~$0.005
Google Gemini	Flash integration, service configuration, response structure	Free
Scryfall	Card data schema (`name`, `mana_cost`, `prices`), search endpoint structure	Free
Summit at Snoqualmie	Schedule integration (both our library AND the raw upstream API), area status values, 30-day date range	Free

The Summit tests are particularly interesting — we test both our transformation library (does our code work?) and the raw upstream API (has Summit changed their data format?):

// Test OUR library
const schedule = await fetchSummitSkiSchedule();
expect(schedule[firstDate]).toHaveProperty('status');
expect(['Open', 'Closed', 'TBD']).toContain(firstArea.status);

// Test Summit's raw API directly
const response = await fetch(SUMMIT_UPSTREAM_URL);
const data = await response.json();
expect(data[0].data[0].data).toHaveProperty('status');
expect(['OPEN', 'CLOSED', 'TBD']).toContain(data[0].data[0].data.status);

This dual-layer approach means we know whether a bug is in Summit’s API or in our transformation code.

Total contract test cost: ~$0.01 per run. Cheap insurance.

Evaluation Reports: Diffable Over Time

Every evaluation run produces a timestamped markdown report:

evaluation-reports/
├── banner-api-eval-2026-01-10T07-09-39.md
├── banner-api-eval-2026-01-13T05-15-51.md
├── banner-api-eval-2026-01-21T20-55-57.md
├── banner-api-eval-2026-02-01T21-57-47.md
├── banner-api-eval-2026-02-07T05-31-14.md   ← Latest: 31/31, 100%, 22.6s
└── ... (60+ reports accumulated)

Each report contains per-fixture results: badge accuracy (pass/fail), mutual exclusion validation, link syntax check, the actual badges generated, the LLM message, and timing data. Because they’re markdown, I can git diff between runs to see exactly what changed.

The latest run shows:

31 fixtures tested, 100% success rate
22.6s total (~730ms per fixture average)
Cost: ~$0.06 per full run

This is the heartbeat of the evaluation system. I run it after prompt changes, after model updates, and before releases.

Multiple Evaluation Modules

The banner evaluator is the most mature, but the framework supports multiple domains:

Module	Command	What It Tests
`eval:banner`	`nx run llm-evaluations:eval:banner`	Banner generation across 31 fixtures
`eval:cribbage`	`nx run llm-evaluations:eval:cribbage`	Multi-model cribbage strategy comparison
`eval:cribbage-production`	`nx run llm-evaluations:eval:cribbage-production`	Production discard quality validation
`eval:function-calling`	`nx run llm-evaluations:eval:function-calling`	Tool selection model comparison
`eval:daily-briefing`	`nx run llm-evaluations:eval:daily-briefing`	Ski conditions narrative quality

Each module follows the same pattern: load fixtures → call production API → validate assertions → generate report. The cribbage evaluator adds expert-validated constraint checking (specific cards that must/must not be discarded), while function-calling adds multi-model comparison across different providers.

The Full Testing Ecosystem

Here’s how all the layers work together in practice:

Layer	Runs When	Blocks?	Cost	What It Catches
Unit tests	Every commit (pre-push hook)	✅ Yes	Free	Logic bugs, type errors, regressions
Contract tests	Nightly CI + on-demand	⚠️ CI only	~$0.01	External API changes, model deprecations, pricing changes
Banner evals (31 fixtures)	After prompt changes, before releases	❌ Advisory	~$0.06	Badge miscalculation, prompt drift, link bleed, mutual exclusion violations
Golden prompts	Nightly (gated)	❌ Advisory	~$0.02	Prompt version drift, model behavior changes
E2E tests	Weekly + pre-deploy	⚠️ Manual	Free	Full user journey regressions

The key insight: these layers are not redundant. Unit tests catch code bugs. Contract tests catch other people’s code bugs. Evaluations catch behavior bugs that only emerge when real data flows through the whole system.

Lessons Learned Building This

1. Test the Production Path, Not the LLM

Our biggest improvement came from switching eval:banner from calling the LLM SDK directly to calling the production API endpoint. The SDK tests were passing while production was broken — because the badge calculator had a bug that only manifested with the real API’s data transformation.

2. Fixtures > Generated Data

We tried generating test scenarios with another LLM. They were plausible but missed edge cases that matter in production: stale morning reports from the night before, highway data that says “none” when it means “no restriction” vs “no data”, the Summit report containing raw HTML links that bleed into the LLM’s output.

Every fixture we have was inspired by a real production scenario.

3. Binary Assertions > Numeric Scoring

We experimented with AI judge scoring (1-10 quality ratings). The scores were noisy and hard to action. Did a drop from 8.2 to 7.8 mean something broke, or just that the LLM rephrased slightly? Binary assertions are boring, but when one turns red, you know exactly what to fix.

4. Reports Should Be Diffable

Storing evaluation reports as timestamped markdown files in the repo was initially a lazy choice. It turned out to be the most useful tool for understanding drift: git diff evaluation-reports/banner-api-eval-2026-01-21*.md evaluation-reports/banner-api-eval-2026-02-07*.md shows exactly which fixture changed behavior and how.

5. Cost Tracking Matters

At ~$0.06 per banner eval run, costs are trivial. But we built cost tracking and budget circuit breakers into the evaluation infrastructure early. When we added the cribbage evaluator (which runs multiple models per fixture), per-run costs climbed to ~$0.30. Circuit breakers prevent runaway evaluation loops from draining API budgets.

What’s Next

Cosmos DB persistence — We’ve built the infrastructure for storing evaluation results in Cosmos DB with drift detection queries (query:drift --modelId=gpt-5-mini --days=7). Next step is wiring this into automated alerting.
Cross-model evaluation — Running the same 31 banner fixtures against different models (GPT-5 vs Gemini Flash vs GPT-4o-mini) to quantify cost/quality tradeoffs.
Fixture generation from production logs — Mining real production requests to generate new fixtures covering scenarios we haven’t thought of yet.

Try It

The evaluation framework is open source in our portfolio monorepo:

Fixtures: apps/evaluation/llm-evaluations/src/fixtures/
Evaluation runner: apps/evaluation/llm-evaluations/src/evaluations/
Contract tests: apps/testing/contract/api-contracts/
Reports: evaluation-reports/
Golden prompts: apps/evaluation/llm-evaluations/src/golden-prompts/

The Summit AI dashboard shows the banner this system validates.

This is Part 2 of a series on testing AI applications. Part 1 covers the two-tier testing strategy for LLM function calling. This post covers the evaluation infrastructure that makes that strategy practical.