From Strategy to Infrastructure
In a previous post, I covered the conceptual two-tier strategy for testing LLM applications: deterministic validation (Tier 1) blocking CI/CD, and AI judge quality assessment (Tier 2) running advisory. That post answered “what should you test?” — this one answers “what does the evaluation infrastructure actually look like?”
This is the story of building a fixture-driven evaluation framework that tests AI-generated ski resort banners across 31 scenarios, validates 4 external API contracts nightly, and produces markdown reports I can diff over time. No watch mode. No visual regression. Just fixtures, assertions, and reports.
The Problem: Banner Generation Is Deceptively Complex
Our Summit AI dashboard displays a dynamic banner at the top of the page — a short message with contextual badges (powder alert, highway chains, peak day warning, crowd level, etc.). The banner is generated by an LLM, but the badge selection is deterministic: a badge calculator service examines real data from 5 sources and decides which badges to attach.
The challenge: you can’t unit test this in isolation. The banner quality depends on:
- Which data sources are available (some go down seasonally)
- The specific combination of weather + schedule + highway + crowd data
- The LLM’s ability to summarize conditions into ~200 characters
- Badge mutual exclusion rules (e.g.,
peakDayandcrowdcan never coexist) - Safety-first ordering (highway restrictions must lead)
This needs scenario-level testing — complete, realistic data snapshots that exercise the entire production code path.
Fixture Anatomy: A Complete World Snapshot
Each fixture is a self-contained JSON file representing one ski day scenario. Here’s a simplified view of the powder day fixture:
{
"id": "powder-day",
"name": "Powder Day - Fresh Snow, All Areas Open",
"description": "Epic powder day: 10\" overnight, clear skies, full operations",
"expectedBadges": [{ "type": "powder", "tier": "legendary" }],
"dataPrep": {
"sources": {
"summitReport": {
"date": "2026-01-15",
"reportText": "POWDER ALERT! 10 inches of fresh snow overnight. All lifts spinning...",
"hasReport": true
},
"summitWeather": {
"temperature": 28,
"conditions": "Partly Cloudy",
"snowfall24h": 10,
"snowfall48h": 14,
"baseDepth": 72
},
"summitSchedule": {
"liftsOpen": 14,
"liftsTotal": 14,
"trailsOpen": 65,
"trailsTotal": 65,
"areas": [
{ "name": "Summit West", "status": "Open", "hours": "9:00 AM - 9:30 PM" },
{ "name": "Alpental", "status": "Open", "hours": "9:00 AM - 4:00 PM" }
]
},
"noaaForecast": {
"daily": [{ "date": "2026-01-15", "conditions": "Partly Cloudy", "tempHigh": 32, "snowfall": "0" }]
}
}
},
"wsdotCondition": {
"passName": "Snoqualmie Pass",
"roadCondition": "Wet",
"restriction": { "type": "none", "details": "" }
},
"peakDayResult": {
"isPeakDay": false,
"crowdLevel": "moderate"
},
"highlightExpectations": {
"requiredCategories": ["conditions", "operations"],
"requiredKeywords": ["10", "open"],
"minHighlights": 3,
"maxHighlights": 4
}
}
This fixture encodes everything: the mountain report text the LLM will see, the weather data that drives badge calculation, the schedule data, highway conditions, crowd predictions, and — critically — what we expect the output to contain.
We have 31 of these. They cover:
| Category | Fixtures | Examples |
|---|---|---|
| Core scenarios | 6 | Powder day, holiday weekend, chain control, poor conditions, midweek, all-sources-fail |
| Data source coverage | 7 | Missing report, missing weather, missing forecast, highway-only, combined failures |
| Badge edge cases | 4 | Legendary powder, multiple highways, improving visibility, Saturday peak |
| Highway parsing | 2 | Traction advised, clear roads |
| Seasonal edge cases | 4 | Early season, spring conditions, wind hold, incoming storm |
| Time-of-day | 4 | Late night closed, early morning planning, evening session, stale report |
| Robustness | 4 | Peak/crowd conflict, malformed weather, zero snowfall, link syntax bleed |
The Evaluation Runner
When you run nx run llm-evaluations:eval:banner, the runner does this:
- Loads all 31 fixtures from JSON files
- Calls the actual production API at
localhost:3001/api/summit/banner?fixture=<id> - Validates each response against the fixture’s expectations
- Generates a markdown report saved to
evaluation-reports/
The key design decision: we test through the real API, not the LLM directly. This ensures the deterministic badge calculator, the data transformation layer, and the LLM prompt all get exercised together. If we mocked the badge calculator, we’d miss the most common class of bugs.
async function callBannerAPI(fixture: BannerFixture, apiUrl: string) {
const response = await fetch(`${apiUrl}/api/summit/banner?fixture=${fixture.id}`);
const data = await response.json();
return { message: data.message, badges: data.badges };
}
Assertion-Based Scoring (Not Numeric)
We don’t score responses 1-10. We use binary assertions:
// Check 1: All expected badges must be present (subset check)
const missingBadges = expectedBadgeTypes.filter((expected) => !actualBadgeTypes.includes(expected));
const hasAllExpected = missingBadges.length === 0;
// Check 2: Mutual exclusion — peakDay and crowd NEVER coexist
const mutualExclusionViolation = actualBadgeTypes.includes('peakDay') && actualBadgeTypes.includes('crowd');
// Check 3: No link syntax bleeding through from scraped data
const linkPatterns = [/\[link\]/i, /http[s]?:\/\//, /summitatsnoqualmie\.com/i];
const hasLinkBleed = linkPatterns.some((p) => p.test(result.message));
This is intentional. Numeric scoring introduces subjectivity and drift. Binary assertions tell you exactly what broke and why.
Data Transparency Analysis
Each badge carries provenance metadata — where the data came from, with a URL and preview:
{
"type": "powder",
"tier": "heavy",
"snowfall": 10,
"provenance": {
"reason": "10\" of heavy powder fell in the last 24 hours",
"source": "Summit Weather API",
"url": "https://summitatsnoqualmie.com/mountain-report",
"dataPreview": {
"label": "Recent Snowfall",
"rows": [
{ "key": "Last 24 hours", "value": "10\"" },
{ "key": "Last 48 hours", "value": "14\"" }
]
}
}
}
The evaluation runner tracks transparency metrics across all fixtures: what percentage of badges include source URLs, what percentage include data previews. This catches regressions where prompt changes accidentally strip provenance.
Golden Prompt Versioning
The system prompt that generates banners has gone through three major versions, each stored as a golden prompt file:
- v1-baseline — Free-form generation, high variability
- v2-structured — JSON output with badge array, more consistent
- v3-badge-focused — Decision tree format, deterministic badge logic moved out of LLM
The v3 prompt is the current production version. It uses a decision tree that the LLM follows:
Q1: Is there highway restriction? → ADD highway badge
Q2: Is there fresh snow? → ADD powder badge (light/fresh/heavy/legendary tiers)
Q3: Is it a peak day? → ADD peakDay badge
Q4: Are crowds expected? → ADD crowd badge
Q5: Is visibility poor? → ADD visibility badge
By moving badge selection to deterministic code and keeping only the message generation in the LLM, we reduced evaluation failures from ~15% (v1) to 0% (v3) across all 31 fixtures.
Contract Tests: Catching External API Breakage
Fixtures test our code. Contract tests test everyone else’s code.
We validate 4 external APIs nightly:
describe('API Contract Tests - OpenAI', () => {
it('validates GPT-5 integration (Responses API)', async () => {
const response = await gpt5Service.generateText('What is 2+2?', {
maxTokens: 200,
});
expect(response).toHaveProperty('content');
expect(response.model).toContain('gpt-5');
expect(response.usage?.promptTokens).toBeGreaterThan(0);
});
it('validates model availability', async () => {
const models = await openai.models.list();
const availableIds = models.data.map((m) => m.id);
expect(availableIds).toContain('gpt-5');
expect(availableIds).toContain('gpt-4o-mini');
});
});
Each external API gets its own contract test suite:
| API | What We Validate | Cost/Run |
|---|---|---|
| OpenAI (GPT-5, GPT-4o-mini) | Response structure, model availability, pricing, API routing (Responses vs Chat Completions) | ~$0.005 |
| Google Gemini | Flash integration, service configuration, response structure | Free |
| Scryfall | Card data schema (name, mana_cost, prices), search endpoint structure | Free |
| Summit at Snoqualmie | Schedule integration (both our library AND the raw upstream API), area status values, 30-day date range | Free |
The Summit tests are particularly interesting — we test both our transformation library (does our code work?) and the raw upstream API (has Summit changed their data format?):
// Test OUR library
const schedule = await fetchSummitSkiSchedule();
expect(schedule[firstDate]).toHaveProperty('status');
expect(['Open', 'Closed', 'TBD']).toContain(firstArea.status);
// Test Summit's raw API directly
const response = await fetch(SUMMIT_UPSTREAM_URL);
const data = await response.json();
expect(data[0].data[0].data).toHaveProperty('status');
expect(['OPEN', 'CLOSED', 'TBD']).toContain(data[0].data[0].data.status);
This dual-layer approach means we know whether a bug is in Summit’s API or in our transformation code.
Total contract test cost: ~$0.01 per run. Cheap insurance.
Evaluation Reports: Diffable Over Time
Every evaluation run produces a timestamped markdown report:
evaluation-reports/
├── banner-api-eval-2026-01-10T07-09-39.md
├── banner-api-eval-2026-01-13T05-15-51.md
├── banner-api-eval-2026-01-21T20-55-57.md
├── banner-api-eval-2026-02-01T21-57-47.md
├── banner-api-eval-2026-02-07T05-31-14.md ← Latest: 31/31, 100%, 22.6s
└── ... (60+ reports accumulated)
Each report contains per-fixture results: badge accuracy (pass/fail), mutual exclusion validation, link syntax check, the actual badges generated, the LLM message, and timing data. Because they’re markdown, I can git diff between runs to see exactly what changed.
The latest run shows:
- 31 fixtures tested, 100% success rate
- 22.6s total (~730ms per fixture average)
- Cost: ~$0.06 per full run
This is the heartbeat of the evaluation system. I run it after prompt changes, after model updates, and before releases.
Multiple Evaluation Modules
The banner evaluator is the most mature, but the framework supports multiple domains:
| Module | Command | What It Tests |
|---|---|---|
eval:banner | nx run llm-evaluations:eval:banner | Banner generation across 31 fixtures |
eval:cribbage | nx run llm-evaluations:eval:cribbage | Multi-model cribbage strategy comparison |
eval:cribbage-production | nx run llm-evaluations:eval:cribbage-production | Production discard quality validation |
eval:function-calling | nx run llm-evaluations:eval:function-calling | Tool selection model comparison |
eval:daily-briefing | nx run llm-evaluations:eval:daily-briefing | Ski conditions narrative quality |
Each module follows the same pattern: load fixtures → call production API → validate assertions → generate report. The cribbage evaluator adds expert-validated constraint checking (specific cards that must/must not be discarded), while function-calling adds multi-model comparison across different providers.
The Full Testing Ecosystem
Here’s how all the layers work together in practice:
| Layer | Runs When | Blocks? | Cost | What It Catches |
|---|---|---|---|---|
| Unit tests | Every commit (pre-push hook) | ✅ Yes | Free | Logic bugs, type errors, regressions |
| Contract tests | Nightly CI + on-demand | ⚠️ CI only | ~$0.01 | External API changes, model deprecations, pricing changes |
| Banner evals (31 fixtures) | After prompt changes, before releases | ❌ Advisory | ~$0.06 | Badge miscalculation, prompt drift, link bleed, mutual exclusion violations |
| Golden prompts | Nightly (gated) | ❌ Advisory | ~$0.02 | Prompt version drift, model behavior changes |
| E2E tests | Weekly + pre-deploy | ⚠️ Manual | Free | Full user journey regressions |
The key insight: these layers are not redundant. Unit tests catch code bugs. Contract tests catch other people’s code bugs. Evaluations catch behavior bugs that only emerge when real data flows through the whole system.
Lessons Learned Building This
1. Test the Production Path, Not the LLM
Our biggest improvement came from switching eval:banner from calling the LLM SDK directly to calling the production API endpoint. The SDK tests were passing while production was broken — because the badge calculator had a bug that only manifested with the real API’s data transformation.
2. Fixtures > Generated Data
We tried generating test scenarios with another LLM. They were plausible but missed edge cases that matter in production: stale morning reports from the night before, highway data that says “none” when it means “no restriction” vs “no data”, the Summit report containing raw HTML links that bleed into the LLM’s output.
Every fixture we have was inspired by a real production scenario.
3. Binary Assertions > Numeric Scoring
We experimented with AI judge scoring (1-10 quality ratings). The scores were noisy and hard to action. Did a drop from 8.2 to 7.8 mean something broke, or just that the LLM rephrased slightly? Binary assertions are boring, but when one turns red, you know exactly what to fix.
4. Reports Should Be Diffable
Storing evaluation reports as timestamped markdown files in the repo was initially a lazy choice. It turned out to be the most useful tool for understanding drift: git diff evaluation-reports/banner-api-eval-2026-01-21*.md evaluation-reports/banner-api-eval-2026-02-07*.md shows exactly which fixture changed behavior and how.
5. Cost Tracking Matters
At ~$0.06 per banner eval run, costs are trivial. But we built cost tracking and budget circuit breakers into the evaluation infrastructure early. When we added the cribbage evaluator (which runs multiple models per fixture), per-run costs climbed to ~$0.30. Circuit breakers prevent runaway evaluation loops from draining API budgets.
What’s Next
- Cosmos DB persistence — We’ve built the infrastructure for storing evaluation results in Cosmos DB with drift detection queries (
query:drift --modelId=gpt-5-mini --days=7). Next step is wiring this into automated alerting. - Cross-model evaluation — Running the same 31 banner fixtures against different models (GPT-5 vs Gemini Flash vs GPT-4o-mini) to quantify cost/quality tradeoffs.
- Fixture generation from production logs — Mining real production requests to generate new fixtures covering scenarios we haven’t thought of yet.
Try It
The evaluation framework is open source in our portfolio monorepo:
- Fixtures:
apps/evaluation/llm-evaluations/src/fixtures/ - Evaluation runner:
apps/evaluation/llm-evaluations/src/evaluations/ - Contract tests:
apps/testing/contract/api-contracts/ - Reports:
evaluation-reports/ - Golden prompts:
apps/evaluation/llm-evaluations/src/golden-prompts/
The Summit AI dashboard shows the banner this system validates.
This is Part 2 of a series on testing AI applications. Part 1 covers the two-tier testing strategy for LLM function calling. This post covers the evaluation infrastructure that makes that strategy practical.