Portfolio Blog - Nick Maassel

AI Dev Sessions From Anywhere: Copilot CLI on iPhone and iPad via Tailscale

Nick Maassel — Wed, 25 Feb 2026 00:00:00 GMT

## The Setup I've been building out a homelab K3s cluster — a Beelink Mini PC as the control plane, an Ubuntu tower as an agent, and a Razer Blade for GPU workloads. All connected over [Tailscale](https://tailscale.com), which gives every device a stable private IP regardless of network. Recently I wanted to be able to run [GitHub Copilot CLI](https://github.com/github/copilot) sessions without being at my desk. Turns out the combination of Tailscale + a good SSH client makes this completely viable from a phone. ## What You Need - A Linux server (homelab, VPS, Raspberry Pi — anything you can SSH into) - [GitHub Copilot CLI](https://github.com/github/copilot) installed on it (`sudo npm install -g @github/copilot`) - [Tailscale](https://tailscale.com) on both the server and your phone - An SSH client app: [Termius](https://termius.com) (iOS/Android) or [Blink Shell](https://blink.sh) (iOS) ## Why Tailscale Without Tailscale you'd need to expose SSH to the public internet or set up a VPN. Tailscale handles both — it creates a private mesh network where your phone and your server get stable `100.x.x.x` IPs that just work, even when switching between WiFi and cellular. No port forwarding, no dynamic DNS. Install it on your server: ```bash curl -fsSL https://tailscale.com/install.sh | sh sudo tailscale up ``` Install the Tailscale app on your phone and sign in with the same account. Both devices appear in your Tailscale admin panel automatically. ## Installing Copilot CLI on the Server ```bash # Node.js 20+ required curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - sudo apt-get install -y nodejs # Install Copilot CLI sudo npm install -g @github/copilot # Authenticate (follow the device flow) copilot auth login ``` ## The SSH Client **On iPad**: [Blink Shell](https://blink.sh) is the gold standard — it's a real terminal emulator with Mosh support (great for flaky connections) and feels native. Worth the subscription. **On iPhone**: [Termius](https://termius.com) works better on smaller screens. The free tier is sufficient. It adds a toolbar with Ctrl, Tab, Esc, and arrow keys above the keyboard which makes terminal work actually usable on mobile. Both apps support Tailscale IPs directly — just add a new host with your server's `100.x.x.x` address. ## The Workflow Once connected: ```bash cd ~/homelab copilot ``` You're in a full Copilot CLI session. You can ask questions about your codebase, run commands, edit files, deploy to K3s — everything you'd do at a desk, from your couch or on the go. For quick sessions on iPhone, Termius handles it well. For longer work sessions where you want to actually write code, the iPad with a keyboard is surprisingly capable. ## Automating the Setup I added Copilot CLI installation to my [Ansible](https://www.ansible.com) dev-workstation playbook so every new machine in the homelab gets it automatically: ```yaml - name: Install GitHub Copilot CLI tags: copilot-cli community.general.npm: name: '@github/copilot' global: true state: present become: true ``` Run it against any host with: ```bash ansible-playbook playbooks/dev-workstation.yml -l --tags copilot-cli ``` ## Why This Matters The barrier to picking up a dev task used to be "am I at my desk?" Now it's just "do I have my phone?" For quick infrastructure fixes, reviewing a spec, or iterating on a K3s manifest — a Copilot CLI session from my iPhone is fast enough to be genuinely useful.

Building a Fixture-Driven LLM Evaluation Framework

Nick Maassel — Tue, 10 Feb 2026 00:00:00 GMT

## From Strategy to Infrastructure In a [previous post](/blog/testing-production-ai-apps), I covered the _conceptual_ two-tier strategy for testing LLM applications: deterministic validation (Tier 1) blocking CI/CD, and AI judge quality assessment (Tier 2) running advisory. That post answered "what should you test?" — this one answers **"what does the evaluation infrastructure actually look like?"** This is the story of building a fixture-driven evaluation framework that tests AI-generated ski resort banners across 31 scenarios, validates 4 external API contracts nightly, and produces markdown reports I can diff over time. No watch mode. No visual regression. Just fixtures, assertions, and reports. ## The Problem: Banner Generation Is Deceptively Complex Our [Summit AI dashboard](https://summit.maassel.dev) displays a dynamic banner at the top of the page — a short message with contextual badges (powder alert, highway chains, peak day warning, crowd level, etc.). The banner is generated by an LLM, but the badge _selection_ is deterministic: a badge calculator service examines real data from 5 sources and decides which badges to attach. The challenge: **you can't unit test this in isolation.** The banner quality depends on: 1. Which data sources are available (some go down seasonally) 2. The specific combination of weather + schedule + highway + crowd data 3. The LLM's ability to summarize conditions into ~200 characters 4. Badge mutual exclusion rules (e.g., `peakDay` and `crowd` can never coexist) 5. Safety-first ordering (highway restrictions must lead) This needs **scenario-level testing** — complete, realistic data snapshots that exercise the entire production code path. ## Fixture Anatomy: A Complete World Snapshot Each fixture is a self-contained JSON file representing one ski day scenario. Here's a simplified view of the powder day fixture: ```json { "id": "powder-day", "name": "Powder Day - Fresh Snow, All Areas Open", "description": "Epic powder day: 10\" overnight, clear skies, full operations", "expectedBadges": [{ "type": "powder", "tier": "legendary" }], "dataPrep": { "sources": { "summitReport": { "date": "2026-01-15", "reportText": "POWDER ALERT! 10 inches of fresh snow overnight. All lifts spinning...", "hasReport": true }, "summitWeather": { "temperature": 28, "conditions": "Partly Cloudy", "snowfall24h": 10, "snowfall48h": 14, "baseDepth": 72 }, "summitSchedule": { "liftsOpen": 14, "liftsTotal": 14, "trailsOpen": 65, "trailsTotal": 65, "areas": [ { "name": "Summit West", "status": "Open", "hours": "9:00 AM - 9:30 PM" }, { "name": "Alpental", "status": "Open", "hours": "9:00 AM - 4:00 PM" } ] }, "noaaForecast": { "daily": [{ "date": "2026-01-15", "conditions": "Partly Cloudy", "tempHigh": 32, "snowfall": "0" }] } } }, "wsdotCondition": { "passName": "Snoqualmie Pass", "roadCondition": "Wet", "restriction": { "type": "none", "details": "" } }, "peakDayResult": { "isPeakDay": false, "crowdLevel": "moderate" }, "highlightExpectations": { "requiredCategories": ["conditions", "operations"], "requiredKeywords": ["10", "open"], "minHighlights": 3, "maxHighlights": 4 } } ``` This fixture encodes **everything**: the mountain report text the LLM will see, the weather data that drives badge calculation, the schedule data, highway conditions, crowd predictions, and — critically — **what we expect the output to contain**. We have 31 of these. They cover: | Category | Fixtures | Examples | | ------------------------ | -------- | -------------------------------------------------------------------------------------- | | **Core scenarios** | 6 | Powder day, holiday weekend, chain control, poor conditions, midweek, all-sources-fail | | **Data source coverage** | 7 | Missing report, missing weather, missing forecast, highway-only, combined failures | | **Badge edge cases** | 4 | Legendary powder, multiple highways, improving visibility, Saturday peak | | **Highway parsing** | 2 | Traction advised, clear roads | | **Seasonal edge cases** | 4 | Early season, spring conditions, wind hold, incoming storm | | **Time-of-day** | 4 | Late night closed, early morning planning, evening session, stale report | | **Robustness** | 4 | Peak/crowd conflict, malformed weather, zero snowfall, link syntax bleed | ## The Evaluation Runner When you run `nx run llm-evaluations:eval:banner`, the runner does this: 1. **Loads all 31 fixtures** from JSON files 2. **Calls the actual production API** at `localhost:3001/api/summit/banner?fixture=` 3. **Validates each response** against the fixture's expectations 4. **Generates a markdown report** saved to `evaluation-reports/` The key design decision: **we test through the real API, not the LLM directly.** This ensures the deterministic badge calculator, the data transformation layer, _and_ the LLM prompt all get exercised together. If we mocked the badge calculator, we'd miss the most common class of bugs. ```typescript async function callBannerAPI(fixture: BannerFixture, apiUrl: string) { const response = await fetch(`${apiUrl}/api/summit/banner?fixture=${fixture.id}`); const data = await response.json(); return { message: data.message, badges: data.badges }; } ``` ### Assertion-Based Scoring (Not Numeric) We don't score responses 1-10. We use **binary assertions**: ```typescript // Check 1: All expected badges must be present (subset check) const missingBadges = expectedBadgeTypes.filter((expected) => !actualBadgeTypes.includes(expected)); const hasAllExpected = missingBadges.length === 0; // Check 2: Mutual exclusion — peakDay and crowd NEVER coexist const mutualExclusionViolation = actualBadgeTypes.includes('peakDay') && actualBadgeTypes.includes('crowd'); // Check 3: No link syntax bleeding through from scraped data const linkPatterns = [/\[link\]/i, /http[s]?:\/\//, /summitatsnoqualmie\.com/i]; const hasLinkBleed = linkPatterns.some((p) => p.test(result.message)); ``` This is intentional. Numeric scoring introduces subjectivity and drift. Binary assertions tell you exactly what broke and why. ### Data Transparency Analysis Each badge carries provenance metadata — where the data came from, with a URL and preview: ```json { "type": "powder", "tier": "heavy", "snowfall": 10, "provenance": { "reason": "10\" of heavy powder fell in the last 24 hours", "source": "Summit Weather API", "url": "https://summitatsnoqualmie.com/mountain-report", "dataPreview": { "label": "Recent Snowfall", "rows": [ { "key": "Last 24 hours", "value": "10\"" }, { "key": "Last 48 hours", "value": "14\"" } ] } } } ``` The evaluation runner tracks transparency metrics across all fixtures: what percentage of badges include source URLs, what percentage include data previews. This catches regressions where prompt changes accidentally strip provenance. ## Golden Prompt Versioning The system prompt that generates banners has gone through three major versions, each stored as a golden prompt file: - **v1-baseline** — Free-form generation, high variability - **v2-structured** — JSON output with badge array, more consistent - **v3-badge-focused** — Decision tree format, deterministic badge logic moved out of LLM The v3 prompt is the current production version. It uses a decision tree that the LLM follows: ```text Q1: Is there highway restriction? → ADD highway badge Q2: Is there fresh snow? → ADD powder badge (light/fresh/heavy/legendary tiers) Q3: Is it a peak day? → ADD peakDay badge Q4: Are crowds expected? → ADD crowd badge Q5: Is visibility poor? → ADD visibility badge ``` By moving badge _selection_ to deterministic code and keeping only the _message generation_ in the LLM, we reduced evaluation failures from ~15% (v1) to 0% (v3) across all 31 fixtures. ## Contract Tests: Catching External API Breakage Fixtures test our code. Contract tests test **everyone else's code**. We validate 4 external APIs nightly: ```typescript describe('API Contract Tests - OpenAI', () => { it('validates GPT-5 integration (Responses API)', async () => { const response = await gpt5Service.generateText('What is 2+2?', { maxTokens: 200, }); expect(response).toHaveProperty('content'); expect(response.model).toContain('gpt-5'); expect(response.usage?.promptTokens).toBeGreaterThan(0); }); it('validates model availability', async () => { const models = await openai.models.list(); const availableIds = models.data.map((m) => m.id); expect(availableIds).toContain('gpt-5'); expect(availableIds).toContain('gpt-4o-mini'); }); }); ``` Each external API gets its own contract test suite: | API | What We Validate | Cost/Run | | ------------------------------- | ------------------------------------------------------------------------------------------------------- | -------- | | **OpenAI (GPT-5, GPT-4o-mini)** | Response structure, model availability, pricing, API routing (Responses vs Chat Completions) | ~$0.005 | | **Google Gemini** | Flash integration, service configuration, response structure | Free | | **Scryfall** | Card data schema (`name`, `mana_cost`, `prices`), search endpoint structure | Free | | **Summit at Snoqualmie** | Schedule integration (both our library AND the raw upstream API), area status values, 30-day date range | Free | The Summit tests are particularly interesting — we test **both** our transformation library (does our code work?) and the raw upstream API (has Summit changed their data format?): ```typescript // Test OUR library const schedule = await fetchSummitSkiSchedule(); expect(schedule[firstDate]).toHaveProperty('status'); expect(['Open', 'Closed', 'TBD']).toContain(firstArea.status); // Test Summit's raw API directly const response = await fetch(SUMMIT_UPSTREAM_URL); const data = await response.json(); expect(data[0].data[0].data).toHaveProperty('status'); expect(['OPEN', 'CLOSED', 'TBD']).toContain(data[0].data[0].data.status); ``` This dual-layer approach means we know whether a bug is in Summit's API or in our transformation code. **Total contract test cost: ~$0.01 per run.** Cheap insurance. ## Evaluation Reports: Diffable Over Time Every evaluation run produces a timestamped markdown report: ```text evaluation-reports/ ├── banner-api-eval-2026-01-10T07-09-39.md ├── banner-api-eval-2026-01-13T05-15-51.md ├── banner-api-eval-2026-01-21T20-55-57.md ├── banner-api-eval-2026-02-01T21-57-47.md ├── banner-api-eval-2026-02-07T05-31-14.md ← Latest: 31/31, 100%, 22.6s └── ... (60+ reports accumulated) ``` Each report contains per-fixture results: badge accuracy (pass/fail), mutual exclusion validation, link syntax check, the actual badges generated, the LLM message, and timing data. Because they're markdown, I can `git diff` between runs to see exactly what changed. The latest run shows: - **31 fixtures tested, 100% success rate** - **22.6s total (~730ms per fixture average)** - **Cost: ~$0.06 per full run** This is the heartbeat of the evaluation system. I run it after prompt changes, after model updates, and before releases. ## Multiple Evaluation Modules The banner evaluator is the most mature, but the framework supports multiple domains: | Module | Command | What It Tests | | -------------------------- | ------------------------------------------------- | ---------------------------------------- | | `eval:banner` | `nx run llm-evaluations:eval:banner` | Banner generation across 31 fixtures | | `eval:cribbage` | `nx run llm-evaluations:eval:cribbage` | Multi-model cribbage strategy comparison | | `eval:cribbage-production` | `nx run llm-evaluations:eval:cribbage-production` | Production discard quality validation | | `eval:function-calling` | `nx run llm-evaluations:eval:function-calling` | Tool selection model comparison | | `eval:daily-briefing` | `nx run llm-evaluations:eval:daily-briefing` | Ski conditions narrative quality | Each module follows the same pattern: load fixtures → call production API → validate assertions → generate report. The cribbage evaluator adds expert-validated constraint checking (specific cards that must/must not be discarded), while function-calling adds multi-model comparison across different providers. ## The Full Testing Ecosystem Here's how all the layers work together in practice: | Layer | Runs When | Blocks? | Cost | What It Catches | | ------------------------------ | ------------------------------------- | ----------- | ------ | --------------------------------------------------------------------------- | | **Unit tests** | Every commit (pre-push hook) | ✅ Yes | Free | Logic bugs, type errors, regressions | | **Contract tests** | Nightly CI + on-demand | ⚠️ CI only | ~$0.01 | External API changes, model deprecations, pricing changes | | **Banner evals** (31 fixtures) | After prompt changes, before releases | ❌ Advisory | ~$0.06 | Badge miscalculation, prompt drift, link bleed, mutual exclusion violations | | **Golden prompts** | Nightly (gated) | ❌ Advisory | ~$0.02 | Prompt version drift, model behavior changes | | **E2E tests** | Weekly + pre-deploy | ⚠️ Manual | Free | Full user journey regressions | The key insight: **these layers are not redundant.** Unit tests catch code bugs. Contract tests catch other people's code bugs. Evaluations catch _behavior_ bugs that only emerge when real data flows through the whole system. ## Lessons Learned Building This ### 1. Test the Production Path, Not the LLM Our biggest improvement came from switching `eval:banner` from calling the LLM SDK directly to calling the production API endpoint. The SDK tests were passing while production was broken — because the badge calculator had a bug that only manifested with the real API's data transformation. ### 2. Fixtures > Generated Data We tried generating test scenarios with another LLM. They were plausible but missed edge cases that matter in production: stale morning reports from the night before, highway data that says "none" when it means "no restriction" vs "no data", the Summit report containing raw HTML links that bleed into the LLM's output. Every fixture we have was inspired by a real production scenario. ### 3. Binary Assertions > Numeric Scoring We experimented with AI judge scoring (1-10 quality ratings). The scores were noisy and hard to action. Did a drop from 8.2 to 7.8 mean something broke, or just that the LLM rephrased slightly? Binary assertions are boring, but when one turns red, you know exactly what to fix. ### 4. Reports Should Be Diffable Storing evaluation reports as timestamped markdown files in the repo was initially a lazy choice. It turned out to be the most useful tool for understanding drift: `git diff evaluation-reports/banner-api-eval-2026-01-21*.md evaluation-reports/banner-api-eval-2026-02-07*.md` shows exactly which fixture changed behavior and how. ### 5. Cost Tracking Matters At ~$0.06 per banner eval run, costs are trivial. But we built cost tracking and budget circuit breakers into the evaluation infrastructure early. When we added the cribbage evaluator (which runs multiple models per fixture), per-run costs climbed to ~$0.30. Circuit breakers prevent runaway evaluation loops from draining API budgets. ## What's Next - **Cosmos DB persistence** — We've built the infrastructure for storing evaluation results in Cosmos DB with drift detection queries (`query:drift --modelId=gpt-5-mini --days=7`). Next step is wiring this into automated alerting. - **Cross-model evaluation** — Running the same 31 banner fixtures against different models (GPT-5 vs Gemini Flash vs GPT-4o-mini) to quantify cost/quality tradeoffs. - **Fixture generation from production logs** — Mining real production requests to generate new fixtures covering scenarios we haven't thought of yet. ## Try It The evaluation framework is open source in our [portfolio monorepo](https://github.com/nsmaassel/nx-portfolio-monorepo): - **Fixtures**: `apps/evaluation/llm-evaluations/src/fixtures/` - **Evaluation runner**: `apps/evaluation/llm-evaluations/src/evaluations/` - **Contract tests**: `apps/testing/contract/api-contracts/` - **Reports**: `evaluation-reports/` - **Golden prompts**: `apps/evaluation/llm-evaluations/src/golden-prompts/` The [Summit AI dashboard](https://summit.maassel.dev) shows the banner this system validates. --- **This is Part 2 of a series on testing AI applications.** Part 1 covers the [two-tier testing strategy](/blog/testing-production-ai-apps) for LLM function calling. This post covers the evaluation infrastructure that makes that strategy practical.

Building a 6-Model Review Panel with GitHub Copilot Custom Agents

Nick Maassel — Mon, 09 Feb 2026 00:00:00 GMT

**TL;DR:** I built a system that sends any design decision to 6 different AI models simultaneously — Gemini 3 Pro, GPT-5.2, Claude Sonnet 4.5, GPT-5.2-Codex, Claude Opus 4.6, and Grok Code Fast 1 — and synthesizes their feedback into a single report. Each model reviews through a different lens (mobile-first, info architecture, cognitive load, engineering feasibility, deep reasoning, fast gut-check), and an orchestrator agent merges their perspectives. It costs ~13 premium requests per review, takes about 30 seconds, and has already surfaced blind spots I'd have missed with any single model. Here's how I built it, what I learned, and how it compares to multi-agent approaches in other tools. --- ## The Problem: One Model, One Perspective I was working on a precipitation timeline feature for a ski conditions dashboard. The weather data comes from NOAA's gridpoint API as hourly time-series arrays, and I needed to decide how to bucket those hours into meaningful periods for skiers. Should I stick with two periods (overnight/daytime)? Split into three (overnight/morning/evening)? Four periods with a 2pm boundary for twilight pass holders? Add pass-aware dynamic periods? Each option had trade-offs across UX, engineering complexity, data accuracy, and user segmentation. I realized I was going back and forth in my own head, and what I really wanted was a **panel of reviewers** — each bringing a different perspective to the same question. That's when I discovered that VS Code's custom agents now support model selection and subagent orchestration. ## What Are Custom Agents? Since the January 2026 VS Code release, you can create `.agent.md` files that define specialized AI agents with: - **Custom system prompts** — tell the agent what lens to use - **Model selection** — pin to a specific model (or provide a fallback chain) - **Tool access** — control what the agent can do (search, fetch, read files, invoke other agents) - **Subagent restrictions** — an orchestrator agent can specify exactly which other agents it's allowed to call Agents live in two places: - **Workspace-level:** `.github/agents/` — shared with your team via git - **User-level:** `~\AppData\Roaming\Code - Insiders\User\agents\` — personal, available across all projects The key insight: by creating multiple agents pinned to different models, and an orchestrator that dispatches to all of them, you get a **multi-model review panel** that runs in parallel. ## The Architecture ``` ┌─────────────────────────────────────────────────┐ │ @multi-review "Should I use 3 or 4 periods?" │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌──────────┐ │ │ │review-gemini│ │ review-gpt │ │review- │ │ │ │Gemini 3 Pro │ │ GPT-5.2 │ │claude │ │ │ │mobile-first │ │info arch │ │cog load │ │ │ └──────┬──────┘ └──────┬──────┘ └─────┬─────┘ │ │ │ │ │ │ │ ┌──────┴──────┐ ┌──────┴──────┐ ┌─────┴─────┐ │ │ │review-codex │ │ review-opus │ │review-grok│ │ │ │GPT-5.2-Codex│ │Claude Opus │ │Grok Fast 1│ │ │ │engineering │ │deep reason │ │gut check │ │ │ └──────┬──────┘ └──────┬──────┘ └─────┬─────┘ │ │ │ │ │ │ │ └────────┬───────┘───────────────┘ │ │ ┌───────┴────────┐ │ │ │ Orchestrator │ │ │ │ Synthesis + │ │ │ │ Comparison │ │ │ └────────────────┘ │ └─────────────────────────────────────────────────┘ ``` Seven `.agent.md` files total: 6 reviewers + 1 orchestrator. ## Building the Reviewer Agents Each reviewer agent follows the same structure but with a distinct **lens**: ### Example: The Gemini Reviewer ```yaml --- name: review-gemini description: Product/UX/technical reviewer powered by Gemini. model: - gemini-3-pro (copilot) - gemini-2.5-pro (copilot) tools: ['search', 'fetch', 'read'] user-invokable: false --- ``` Key decisions in the design: - **`model` as an array** provides automatic fallback. If Gemini 3 Pro is unavailable, it silently falls back to Gemini 2.5 Pro. - **`user-invokable: false`** hides the agent from the chat dropdown — it only runs when the orchestrator calls it as a subagent. - **`tools`** are intentionally limited. Reviewers can search and read but can't write files or run terminals. The system prompt gives each model a distinct personality: | Agent | Model | Lens | What It Asks | | ------------- | ----------------- | -------------------------------- | ------------------------------------- | | review-gemini | Gemini 3 Pro | Mobile-first, practical | "How does this feel on a phone?" | | review-gpt | GPT-5.2 | Info architecture, competitive | "What do comparable products do?" | | review-claude | Claude Sonnet 4.5 | Cognitive load, behavioral psych | "How many decisions are you forcing?" | | review-codex | GPT-5.2-Codex | Engineering, implementation | "What's the actual code complexity?" | | review-opus | Claude Opus 4.6 | Deep reasoning, system-level | "What are the second-order effects?" | | review-grok | Grok Code Fast 1 | Fast gut-check | "Does this really matter?" | ### The Cost Profile Not all reviews are equal in cost: | Agent | Model | Premium Multiplier | Role | | ------------- | ----------------- | ------------------ | --------------------------------- | | review-grok | Grok Code Fast 1 | 0.25x | Cheapest — the quick sanity check | | review-gemini | Gemini 3 Pro | 1x | Standard | | review-gpt | GPT-5.2 | 1x | Standard | | review-claude | Claude Sonnet 4.5 | 1x | Standard | | review-codex | GPT-5.2-Codex | 1x | Standard | | review-opus | Claude Opus 4.6 | **10x** | Most expensive — deep reasoning | | **Total** | | **~13.25x** | Full panel | For budget-conscious reviews: skip Opus (saves 10x) or run only Gemini + GPT + Claude (3x total). ## The Orchestrator The orchestrator is the glue. Its frontmatter restricts which agents it can invoke: ```yaml --- name: multi-review description: Multi-model review panel. tools: ['agent', 'search', 'fetch', 'read'] agents: ['review-gemini', 'review-gpt', 'review-claude', 'review-codex', 'review-opus', 'review-grok'] --- ``` Its system prompt defines a 3-step workflow: 1. **Parse** the question and formulate a clear review prompt 2. **Dispatch** to all 6 reviewers in parallel 3. **Synthesize** into consensus, divergence, per-reviewer highlights, comparison table, and final recommendation The synthesis step is the real value. Raw output from 6 models is overwhelming. The orchestrator distills it into: "All 6 agree on X. Gemini and Grok diverge on Y. Opus surfaced a second-order concern about Z that nobody else caught." ## How to Use It In VS Code chat: ``` @multi-review I'm deciding between 3 time periods (overnight/daytime/evening) and 4 periods (splitting daytime at 2pm for twilight pass holders). Context: ski conditions dashboard, NOAA hourly data, user segments include day pass holders (9am-5pm) and twilight pass holders (2pm-close at 9:30pm). ``` The orchestrator dispatches to all 6 models, collects their reviews, and returns a synthesized report with a comparison table. ## How This Compares to Other Tools The multi-agent landscape has exploded in early 2026. Here's how the major tools approach it differently: ### GitHub Copilot (VS Code) — Declarative Agent Orchestration **Approach:** `.agent.md` files with YAML frontmatter. Agents can invoke subagents. Parallel execution since January 2026. **Strengths:** - Broadest model selection (GPT, Claude, Gemini, Grok — 10+ models) - Declarative config — no code required to define agents - User-level agents work across all workspaces - Agent Skills (Anthropic's open standard) for reusable capabilities - Agent Sessions view consolidates local + background + cloud agents **Limitations:** - No file system isolation between parallel agents (unlike Cursor's worktrees) - Subagent context is isolated from parent — can't share intermediate state ### Claude Code — Agent Teams with Deep Context **Approach:** Agent teams via the Agent SDK. Peer agents coordinate toward shared goals rather than leader-follower hierarchy. **Strengths:** - 1M token context window with Opus 4.6 (!) - Context compaction for long-running sessions - Adaptive thinking — model decides when to use extended reasoning - Can take over any subagent mid-execution (Shift+Up/Down) - CLAUDE.md files as persistent project memory **Limitations:** - Naturally biased toward Claude models - Agent team coordination is "research preview" — still maturing - No declarative multi-model setup like VS Code's `.agent.md` ### Cursor — Parallel Agents with Git Worktree Isolation **Approach:** Each parallel agent gets its own Git worktree (isolated working directory, shared `.git` object store). **Strengths:** - True file system isolation — agents can't conflict - Up to 8 agents in parallel on a single prompt - Plan mode: plan with one model, execute with another - Planner/worker/judge architecture for scaling to hundreds of agents **Limitations:** - Requires Git overhead for isolation - Coordination through shared branches, not shared context - No declarative agent definition format (yet) ### Copilot CLI — Terminal-Native Agents **Approach:** Built-in specialized agents (Explore, Task, Plan, Code-review) with automatic delegation. **Strengths:** - Agents auto-select based on your prompt — no manual choice needed - Same `.agent.md` format as VS Code for custom agents - Agent Registry (with JetBrains, Zed) for cross-IDE discovery - Auto-compaction at 95% context usage **Limitations:** - Terminal-only — no visual comparison UI - Model selection more limited than VS Code ### Windsurf — Flow-Aware Cascade **Approach:** Cascade agent tracks all developer actions (edits, commands, clipboard, terminal) to infer intent. **Strengths:** - Implicit intent inference — "continue my work" actually works - Arena Mode for head-to-head model comparison - Git worktree isolation (like Cursor) - Memories system for cross-session context **Limitations:** - More opaque agent behavior (hard to debug the "flow" reasoning) - Tighter vendor coupling than open-standard approaches ### Amazon Q Developer — AWS-Native Agents **Approach:** Custom agents via configuration files with granular tool/path permissions. **Strengths:** - Deep AWS service integration (CloudWatch, Lambda, DynamoDB analysis) - Granular permission model (read-only vs write-only per path) - Free tier: 50 agentic chats/month **Limitations:** - Heavily AWS-focused — less general-purpose - Fewer model choices than VS Code or Cursor ## The Comparison Matrix | Capability | VS Code Copilot | Claude Code | Cursor | Copilot CLI | Windsurf | Amazon Q | | ---------------------------- | ------------------- | ------------------- | -------------- | ------------------ | ------------------ | --------------- | | Custom agent definitions | `.agent.md` ✅ | Agent SDK ✅ | Rules ⚠️ | `.agent.md` ✅ | Rules ⚠️ | Config files ✅ | | Multi-model in same workflow | ✅ (10+ models) | ⚠️ (Claude-focused) | ✅ (~8 models) | ✅ | ✅ | ⚠️ (fewer) | | Sub-agent orchestration | ✅ (parallel) | ✅ (agent teams) | ✅ (worktrees) | ✅ (auto-delegate) | ⚠️ (multi-cascade) | ⚠️ | | File system isolation | ❌ | ❌ | ✅ (worktrees) | N/A | ✅ (worktrees) | ❌ | | Declarative agent config | ✅ | Partial | ❌ | ✅ | ❌ | ✅ | | Cross-IDE portability | ✅ (ACP + Registry) | ❌ | ❌ | ✅ (ACP) | ❌ | ❌ | | Context window | Standard | 1M tokens | Standard | Standard | Standard | Standard | ## What I've Learned ### 1. Different Models Have Genuinely Different Blind Spots This isn't just "get more opinions for confidence." Models trained differently actually surface different concerns. In my precipitation timeline review: - **Gemini** focused on mobile scrolling behavior and touch targets - **GPT** referenced how Ski Utah and OpenSnow handle time bucketing - **Claude Sonnet** flagged cognitive overload from too many time periods - **Codex** pointed out the code complexity of dynamic period boundaries - **Opus** identified a second-order effect: changing periods would affect historical comparisons - **Grok** said "3 periods is fine, ship it, stop overthinking" Grok's directness was surprisingly valuable. Sometimes the most useful review is the one that says "this isn't worth the complexity." ### 2. The Orchestrator Synthesis Is The Key Feature Raw output from 6 models is ~3,000-5,000 words. Nobody reads that. The orchestrator's job — identifying consensus, surfacing divergence, building comparison tables — is what makes the system usable. Without synthesis, it's just noise. ### 3. Cost Management Matters At 13.25x premium requests per full review, you won't run this on every commit message. I use it for: - Architecture decisions (which database? which API pattern?) - UX decisions with multiple valid approaches - Reviewing my own specs before implementation For quick checks, I skip Opus and Codex (keeps it at 3.25x). ### 4. `user-invokable: false` Is Essential for Clean UX Without this flag, all 6 reviewer agents would appear in your chat dropdown alongside your regular agents. Setting `user-invokable: false` keeps them hidden — they only activate when the orchestrator calls them. This is the difference between a usable system and a cluttered mess. ### 5. Fallback Chains Handle Model Deprecations Gracefully GitHub deprecates model versions regularly (next batch: Feb 17, 2026). The array syntax for `model` ensures your agents keep working: ```yaml model: - gemini-3-pro (copilot) # primary - gemini-2.5-pro (copilot) # fallback ``` When Gemini 2.5 Pro gets deprecated on Feb 17, the agent is already set to prefer Gemini 3 Pro. Zero downtime. ## Setting This Up Yourself ### Step 1: Create Your Agent Directory For user-level agents (available across all workspaces): - **macOS/Linux:** `~/.vscode-insiders/data/User/agents/` (or `~/.vscode/data/User/agents/` for stable) - **Windows:** `%APPDATA%\Code - Insiders\User\agents\` For workspace-level agents (shared via git): `.github/agents/` ### Step 2: Create Reviewer Agents Create one `.agent.md` file per model. Minimum viable agent: ```yaml --- name: review-mymodel description: Technical reviewer on ModelName. model: model-name (copilot) tools: ['search', 'fetch', 'read'] user-invokable: false --- You are a technical reviewer. Rate options as Strong/Moderate/Weak. Give a clear recommendation with reasoning. ``` ### Step 3: Create the Orchestrator ```yaml --- name: multi-review description: Multi-model review orchestrator. tools: ['agent', 'search', 'fetch', 'read'] agents: ['review-mymodel1', 'review-mymodel2', 'review-mymodel3'] --- Dispatch the user's question to all reviewers in parallel. Synthesize into: Consensus, Divergence, Comparison Table, Recommendation. ``` ### Step 4: Use It ``` @multi-review [your question with context] ``` ## What's Next The multi-agent ecosystem is moving fast. A few things I'm watching: - **Agent Client Protocol (ACP)** — open standard from GitHub + JetBrains + Zed for cross-IDE agent portability. Your `.agent.md` files could work in JetBrains IDEs without modification. - **Agent Skills** — Anthropic's open standard for reusable agent capabilities. Think npm packages but for agent behaviors. - **Extended context windows** — Claude Code's 1M token context enables agent sessions that span days. This changes what's possible for long-running autonomous agents. - **Planner/worker/judge patterns** — Cursor's research on scaling to hundreds of parallel agents suggests a future where "run 100 agents on this codebase" is a normal workflow. The role of the developer is shifting from "person who writes code" to "person who orchestrates AI agents that write code." Multi-model review panels are an early, practical example of that shift — and the tools to build them are available today. --- _Built with GitHub Copilot in VS Code Insiders, running Claude Opus 4.6. Agent files available at [github.com/nickmaassel](https://github.com/nickmaassel)._

Building Summit AI: A Real-Time Ski Schedule & Weather App

Nick Maassel — Tue, 23 Dec 2025 00:00:00 GMT

## Introduction As a frequent visitor to Summit at Snoqualmie, I found myself constantly checking their website to see which base areas were open, what the weather looked like, and whether it was worth making the drive from Seattle. After too many times juggling multiple browser tabs, I decided to build **Summit AI** — a single-page app that consolidates all this information in one beautifully designed interface. 🚀 **Live App:** [summit-ai.maassel.dev](https://summit-ai.maassel.dev) ## Key Features ### 1. **Real-Time Schedule Data** Summit AI scrapes the official Summit at Snoqualmie website daily to provide up-to-date information on: - **5 Base Areas**: Summit West, Summit Central, Silver Fir, Alpental, and Summit East - **Operating Hours**: Exact open/close times for each area - **Status Indicators**: Color-coded badges (Open ✅, Closed ❌, TBD ⏳) - **Special Tags**: "Powder Magnet" for Alpental (gets the most snow!) ![Summit AI Daily View](/images/summit-ai-daily-view.png) ### 2. **Live Weather Integration (NOAA)** Instead of relying on third-party weather APIs with rate limits or paywalls, Summit AI fetches data directly from the **National Weather Service (NOAA)** for Snoqualmie Pass: - **7-Day Forecast**: Temperature, conditions, wind speed - **Snowfall Predictions**: Tracks accumulation for powder alerts - **Powder Alerts**: Automatic banners when ≥3" of snow is forecasted - **Fresh Snow Badges**: 1-3" = "Fresh Snow," 3"+ = "POWDER ALERT" ❄️ The NOAA API is free, reliable, and doesn't require authentication—perfect for a hobby project! ### 3. **Interactive Calendar Navigation** A custom calendar component shows: - **Week/day position indicators**: See where you are in the month - **Snow icons on forecast days**: Visual indicators for expected snowfall - **Quick date jumps**: "Today," "Tomorrow," shortcuts for fast navigation - **Clickable dates**: Jump to any day instantly ### 4. **Dual View Modes** **Daily View** — Deep dive into a single date: - All 5 base area statuses and hours - Weather forecast details - Live webcam previews (see below) - Powder alerts with exact accumulation amounts **Weekly View** — At-a-glance 7-day overview: - Compact grid showing all areas across the week - Snow accumulation badges on each day - Color-coded status cells for quick scanning - Click any day to jump to detailed daily view ![Summit AI Weekly View](/images/summit-ai-weekly-view.png) ### 5. **Live Webcam Integration** One of my favorite features: **embedded YouTube webcams** showing live conditions at each base area. **Desktop Experience:** - **5 webcam previews** in the left sidebar (under the calendar) - 2-column grid with labels for each area - **Hover-to-expand**: Mouse over any webcam to see a full-size overlay in the center of the screen **Mobile Experience:** - Horizontal scrollable row of webcam previews - Optimized for touch navigation - Saves vertical space for schedule content The webcams use YouTube's embed API with autoplay disabled (per user preference standards) and are sourced from Summit's official channels. ### 6. **Fully Responsive Design** Summit AI adapts seamlessly from desktop to mobile: **Desktop (≥900px):** - Side-by-side layout: Calendar/webcams on the left, schedule on the right - Sticky positioning keeps calendar visible while scrolling - Hover interactions for webcams and schedule details **Mobile (<900px):** - Stacked vertical layout: Calendar → Webcams → Toggle → Schedule - Touch-friendly buttons and navigation - Optimized font sizes and spacing ![Summit AI Mobile View](/images/summit-ai-mobile-view.png) ### 7. **Powder Alert System** The app automatically calculates snowfall based on NOAA forecasts and displays: - **Powder Alert Banner** (≥3" snow in 24hrs): Bright cyan highlight with snowfall amount - **Fresh Snow Banner** (1-3" snow): Green success banner - **Calendar Snow Icons** (❄️): Visual indicators on dates with predicted snowfall This makes it easy to spot the best days to hit the slopes! ## Tech Stack - **Frontend**: React 18 + TypeScript + Vite - **UI Framework**: Material-UI (MUI) v5 - **Styling**: Design tokens for colors, typography, spacing - **Data Sources**: - Summit at Snoqualmie (web scraping via backend API) - NOAA National Weather Service (public API) - YouTube (embedded live webcams) - **Hosting**: Azure Static Web Apps (with CDN) - **Backend**: Express API for schedule scraping and caching ## Design Philosophy I wanted Summit AI to feel modern and polished while being blazingly fast. Key decisions: 1. **Gradient Header**: Teal-to-pink gradient matches ski culture vibes 2. **Status Color Coding**: - Green (Open) = Good to go ✅ - Orange (Closed) = Stay home ❌ - Yellow (TBD) = Check back later ⏳ 3. **Typography Hierarchy**: Clear headings, readable body text, and compact data tables 4. **Micro-interactions**: Smooth transitions, hover effects, and animations 5. **Performance**: React Query for caching, lazy-loaded images, optimized bundle size ## Challenges & Solutions ### Challenge 1: Real-Time Schedule Data **Problem**: Summit's website doesn't have a public API. **Solution**: Built a backend scraper that runs daily (Azure Functions) and caches the schedule data as JSON. The frontend fetches from this cached endpoint. ### Challenge 2: Weather Forecast Parsing **Problem**: NOAA's API returns raw text descriptions like "Snow likely, mainly after 4pm." **Solution**: Implemented regex parsing to extract snowfall amounts ("3 to 5 inches") and normalize them into numeric values for powder alerts. ### Challenge 3: Responsive Webcams **Problem**: YouTube embeds are heavyweight and can slow down the page. **Solution**: Used lazy loading (`loading="lazy"`) and conditional rendering (only load on viewport visibility). Also disabled autoplay until user hovers (desktop) or taps (mobile). ### Challenge 4: Calendar State Management **Problem**: Syncing calendar selection with daily/weekly view navigation. **Solution**: Lifted state to the root `ScheduleView` component and passed callbacks down. Week view calculates Monday-start week dynamically. ## What's Next? **Phase 2 Features** (coming soon): - 🤖 **AI-Powered Insights**: "Best for beginners today: Summit West" (using GPT-4) - 📊 **Historical Data**: "This day last year had 12\" of powder" - 🎿 **Crowd Predictions**: "Expect heavy traffic on weekends" - 🔔 **Push Notifications**: "Powder alert for tomorrow!" ## Lessons Learned 1. **Design Tokens are Worth It**: Centralizing colors, spacing, and typography made theming painless. 2. **NOAA's API is Underrated**: Free, reliable, and well-documented. More devs should use it! 3. **Playwright for Screenshots**: Automated screenshot capture (used for this blog post!) is a game-changer for documentation. 4. **Monorepo Power**: NX made it easy to share types between frontend/backend and run tests across the entire stack. ## Try It Yourself 🔗 **Live App**: [summit-ai.maassel.dev](https://summit-ai.maassel.dev) 💻 **Source Code**: (Private repo, but happy to discuss implementation!) If you're a skier or snowboarder in the Pacific Northwest, give Summit AI a try and let me know what you think! I'm always open to feedback and feature suggestions. --- **Tags**: #React #TypeScript #Vite #MaterialUI #WeatherAPI #WebScraping #Azure #NXMonorepo #ResponsiveDesign --- _Have questions about how I built this? Want to discuss the architecture or design decisions? Feel free to reach out!_

Testing Production AI Apps: Two-Tier Strategy for LLM Function Calling

Nick Maassel — Thu, 11 Dec 2025 00:00:00 GMT

## The Testing Paradox for AI Apps Traditional software testing relies on determinism: given the same input, you get the same output. But AI systems—especially LLMs—are fundamentally non-deterministic. The same prompt can produce different responses every time. So how do you write automated tests for production AI applications? **You can't assert exact outputs, but you can validate behavior.** This post shares the two-tier testing strategy we use for our [Function Calling Demo](/demos/function-calling), which uses OpenAI's GPT models to automatically select and execute backend APIs based on natural language queries. ## The Problem: Non-Determinism Meets TDD Traditional TDD approach (doesn't work for LLMs): ```typescript // ❌ This will be flaky test('should answer weather question', () => { const response = llm.ask("What's the weather in Seattle?"); expect(response).toBe("It's 52°F and rainy in Seattle."); // Fails 90% of the time - LLM phrases it differently }); ``` The LLM might say: - "It's 52°F and rainy in Seattle." - "Seattle is currently experiencing rainy weather at 52 degrees Fahrenheit." - "The weather in Seattle is rainy with a temperature of 52°F." - "Seattle: 52°F, precipitation expected." All correct answers, but none match the assertion. ## Two-Tier Testing Strategy We split testing into two complementary tiers: ### Tier 1: Deterministic Validation (Blocks CI/CD) - ✅ **Tool selection correctness** - Did the LLM choose the right function? - ✅ **Response structure** - Does the API return expected fields? - ✅ **Semantic relevance** - Does the response contain keywords related to the question? - 🚫 **Blocks PRs if failing** - ⚡ **Fast** (~30 seconds for full suite) ### Tier 2: Quality Assessment (Advisory) - 📊 **Response quality** - Is it coherent, helpful, and complete? - 📊 **Model comparison** - Which model performs best (GPT-5 vs GPT-4.1)? - 📊 **AI judge grading** - Another LLM evaluates quality - 💡 **Advisory only** - Doesn't block PRs - 🕐 **Slower** (~5 minutes for full evaluation) ## Tier 1: Real API Calls with Flexible Validation Here's a real test from our function calling demo: ```typescript it('should handle "What\'s the weekend schedule at Alpental?"', async () => { // 1. Send real user question to production API const response = await fetch(`${API_BASE_URL}/api/agent-chat`, { method: 'POST', body: JSON.stringify({ message: "What's the weekend schedule at Alpental?", context: {}, }), }); expect(response.status).toBe(200); const data = await response.json(); // 2. Validate API response structure expect(data).toHaveProperty('requestId'); expect(data).toHaveProperty('response'); expect(data).toHaveProperty('toolsUsed'); // 3. Validate tool selection (function calling behavior) const toolNames = data.toolsUsed.map((tool: any) => tool.name); expect(toolNames.length).toBeGreaterThan(0); expect( toolNames.some( (name: string) => name.includes('summit_schedule') // Right category of tool ) ).toBe(true); // 4. Validate semantic relevance (flexible regex) expect(data.response.toLowerCase()).toMatch(/alpental|weekend|saturday|sunday/); // Any of these keywords prove the response is relevant }); ``` ### What We're Validating **✅ Tool Selection** (Most Critical) ```typescript expect(toolNames.some((name) => name.includes('summit_schedule'))).toBe(true); ``` If the LLM chose `weather_api` instead of `summit_schedule`, that's a regression—even if the response sounds plausible. **✅ Response Structure** (API Contract) ```typescript expect(data).toHaveProperty('toolsUsed'); ``` The API shape must remain stable for frontend consumers. **✅ Semantic Relevance** (Flexible Keywords) ```typescript expect(data.response.toLowerCase()).toMatch(/alpental|weekend|saturday|sunday/); ``` We're not checking exact wording—just that the response is _about_ the right topic. ### What We're NOT Validating ❌ **Exact wording** - LLMs rephrase constantly ❌ **Grammar/style** - Subjective and changes with model updates ❌ **Tone** - That's a quality concern (Tier 2) ❌ **Completeness** - That's also Tier 2 ## Why This Works ### 1. **Catches Real Regressions** - System prompt changes that break tool selection - Tool definition changes that confuse the LLM - MCP server connectivity issues - Response formatting bugs ### 2. **Tolerates Non-Determinism** - LLM can phrase answers differently each time - Minor wording variations don't fail tests - Focuses on **behavior** not **exact output** ### 3. **Fast Enough for CI/CD** - Each test ~3-5 seconds (real LLM API call) - Full suite ~30 seconds - Acceptable for pre-push hooks ### 4. **Real Integration Testing** ```typescript // Entire stack is exercised: User Question → Express API route → OpenAI Function Calling → MCP Server → Backend API → Data source → LLM formats response → Returns to user ``` This is **real integration testing**, not mocked unit tests. ## Tier 2: AI Judge for Quality Assessment While Tier 1 blocks regressions, Tier 2 evaluates **quality** using another LLM as a judge: ```typescript // Simplified example - actual implementation is more sophisticated const evaluateResponse = async (question: string, response: string) => { const judgePrompt = ` Evaluate this AI assistant response on a scale of 1-10: Question: ${question} Response: ${response} Criteria: - Accuracy: Does it answer the question correctly? - Completeness: Is all relevant information included? - Clarity: Is the response easy to understand? - Conciseness: Is it appropriately brief? Return JSON: { "score": 8, "reasoning": "..." } `; const judgment = await gpt5.evaluate(judgePrompt); return judgment; }; ``` ### What Tier 2 Evaluates **📊 Response Quality** - Coherence and readability - Appropriate level of detail - Helpful and user-friendly **📊 Model Comparison** - GPT-5 vs GPT-4.1-nano performance - Cost vs quality trade-offs - Which model handles edge cases better **📊 Regression Detection Over Time** - Are responses getting worse with model updates? - Is the system prompt still effective? ### Why Tier 2 Doesn't Block PRs - **Subjective metrics** - Quality is harder to define than correctness - **Model updates** - OpenAI can change model behavior without warning - **Cost concerns** - Running AI judges on every PR is expensive - **Speed** - Takes 5+ minutes for comprehensive evaluation Instead, Tier 2 runs: - **Nightly** - Against production endpoints - **On-demand** - When investigating quality issues - **Before releases** - To ensure no quality degradation ## Real-World Example: Weekend Query Bug We recently fixed a bug where weekend queries returned only Saturday's schedule without mentioning Sunday. Here's how the two-tier approach caught it: ### Tier 1 Test (Caught the Regression) ```typescript it('should query BOTH days for "next weekend"', async () => { const response = await askAgent("What's open next weekend?"); // Extract dates from tool calls const dates = response.toolsUsed.flatMap((tool) => tool.arguments?.date).filter(Boolean); // Must query both Saturday AND Sunday expect(dates.length).toBe(2); expect(dates).toContain('2025-12-21'); // Saturday expect(dates).toContain('2025-12-22'); // Sunday }); ``` This test **blocked the PR** until we fixed the system prompt to explicitly instruct the LLM to query both days. ### Tier 2 Evaluation (Assessed User Experience) ```json { "question": "What's open next weekend?", "score": 6, "reasoning": "Response mentions Saturday schedule but doesn't explicitly state Sunday hours. User might assume Sunday is closed when it's actually open. Incomplete information." } ``` The AI judge identified the **user experience problem** even though Tier 1 didn't catch it initially (we added that test after). ## Coverage Strategy Our function calling demo has 100% coverage of UI sample questions: | Sample Question | Test Coverage | | ------------------------------------------ | ------------------ | | "What's open at Summit today?" | ✅ Tier 1 + Tier 2 | | "What's the weekend schedule at Alpental?" | ✅ Tier 1 + Tier 2 | | "Tell me about the Summit West base area" | ✅ Tier 1 + Tier 2 | | "What was open yesterday at the summit?" | ✅ Tier 1 + Tier 2 | | "What's the weather like at Summit?" | ✅ Tier 1 + Tier 2 | **Tier 1 Tests**: 16 tests covering tool selection and semantic relevance **Tier 2 Evaluations**: 20+ golden prompts for quality assessment ## Implementation Details ### Test Infrastructure **Tier 1 Location**: `apps/deployable/api/generative-ai-api/src/tests/` - `agent-chat-llm-integration.spec.ts` - Main demo sample questions - `agent-chat-weekend-queries.spec.ts` - Weekend-specific edge cases - `agent-chat-area-filtering.spec.ts` - Area filtering logic **Tier 2 Location**: `apps/evaluation/llm-evaluations/` - `src/golden-prompts/` - Curated test cases - `src/ai-judge/` - GPT-5 evaluation logic - `src/comparison/` - Multi-model comparison ### Running the Tests **Tier 1 (CI/CD)**: ```bash # Run on every push (pre-push hook) nx test generative-ai-api --testPathPattern=agent-chat # Prerequisites: # - OPENAI_API_KEY environment variable # - API server running (or auto-started by tests) ``` **Tier 2 (On-Demand)**: ```bash # Run full evaluation suite npx nx run llm-evaluations:eval:function-calling # Compare models npx nx run llm-evaluations:compare --models=gpt-5,gpt-4.1-nano ``` ## Lessons Learned ### 1. **Start with Tier 1, Add Tier 2 Later** Get the deterministic tests working first. They provide immediate feedback and catch most bugs. ### 2. **Don't Assert Exact Strings** Use flexible regex patterns with alternation: ```typescript // ✅ Good expect(response).toMatch(/open|available|accessible/); // ❌ Bad expect(response).toContain('The area is currently open'); ``` ### 3. **Test Tool Selection First** The most important validation is: **Did the LLM choose the right function?** If tool selection is correct, response quality issues can be fixed with prompt tuning. If tool selection is wrong, the system is fundamentally broken. ### 4. **Golden Prompts are Gold** Curate a set of "golden" test cases that represent real user queries. Protect these fiercely—they're your regression suite. ### 5. **AI Judges Need Structure** Give your AI judge clear criteria and ask for JSON output. Freeform evaluations are hard to aggregate. ### 6. **Cost vs Coverage Trade-Offs** - Tier 1: Run on every commit (~$0.02 per run) - Tier 2: Run nightly (~$2 per full evaluation) Budget accordingly. ## Common Pitfalls ### ❌ Over-Reliance on Mocks ```typescript // This doesn't test the actual LLM behavior mock(llm).toReturn({ tool: 'summit_schedule', args: {...} }); ``` Use real API calls for integration tests. Mock at the boundary (external services), not at the LLM. ### ❌ Exact String Matching ```typescript expect(response).toBe('Alpental is open 9am-4pm'); // Flaky! LLM might say "Alpental: 9:00 AM - 4:00 PM" ``` ### ❌ Testing Too Many Things at Once ```typescript // Bad: Tests tool selection, response quality, and formatting expect(response).toBe(EXACT_EXPECTED_OUTPUT); // Good: Test one thing at a time expect(toolsUsed).toContain('summit_schedule'); // Tool selection expect(response).toMatch(/alpental/i); // Relevance // Tier 2 handles quality ``` ### ❌ Ignoring Model Updates OpenAI updates models regularly. What worked yesterday might break tomorrow. Monitor Tier 2 trends to catch degradation. ## Industry Comparison: Other Approaches We're not the only ones solving this problem. Here are alternative approaches: ### **Prompt Regression Testing** (used by LangChain) - Store prompt templates in version control - Test that template changes don't break known cases - Focus on prompt engineering rather than output validation ### **LLM-as-a-Judge** (used by OpenAI for GPT-4 evals) - Use a stronger model to evaluate weaker models - Constitutional AI approach - Our Tier 2 is inspired by this ### **Assertion-Based Testing** (used by PromptLayer) - Define semantic assertions like "response contains date" - Use NLP to validate claims rather than string matching - More sophisticated than our keyword approach ### **Human-in-the-Loop** (used by Anthropic) - Sample responses sent to humans for rating - Gold standard for quality, but doesn't scale - We reserve this for Tier 2 evaluation of edge cases ## Future Improvements ### 1. **Semantic Similarity Scoring** Instead of keyword matching, use embeddings to measure semantic distance: ```typescript const similarity = cosineSimilarity(embed(response), embed('Expected to mention Alpental and weekend hours')); expect(similarity).toBeGreaterThan(0.8); ``` ### 2. **Automated Golden Prompt Generation** Use LLMs to generate diverse test cases based on existing ones: ```typescript const variants = await generateVariants("What's open at Summit today?", { count: 10, diversity: 'high' }); // "Which areas are operational at Summit right now?" // "Tell me what's currently available at Summit" // etc. ``` ### 3. **Continuous Monitoring** - Track Tier 2 scores over time - Alert when scores drop below baseline - Correlate with model version updates ### 4. **Multi-Modal Testing** Extend to test function calling with images, audio, and video inputs. ## Try It Yourself All the code for this testing strategy is open source in our [portfolio monorepo](https://github.com/nsmaassel/nx-portfolio-monorepo): - **Tier 1 tests**: `apps/deployable/api/generative-ai-api/src/tests/` - **Tier 2 evaluations**: `apps/evaluation/llm-evaluations/` - **Testing docs**: `docs/TESTING_SPECIFICATION.md` The [Function Calling Demo](/demos/function-calling) shows this system in action. ## Conclusion Testing AI applications requires rethinking traditional TDD principles: - ✅ **Do** validate behavior (tool selection, semantic relevance) - ❌ **Don't** assert exact outputs - ✅ **Do** split deterministic (Tier 1) from quality (Tier 2) tests - ❌ **Don't** block CI/CD on subjective quality metrics - ✅ **Do** use real API calls for integration tests - ❌ **Don't** over-rely on mocks This two-tier approach gives us: - **Fast, reliable CI/CD** (Tier 1 blocks regressions) - **Quality insights** (Tier 2 guides improvements) - **Cost control** (Tier 1 is cheap, Tier 2 runs selectively) AI systems are inherently non-deterministic, but that doesn't mean they're untestable. You just need the right strategy. --- **Want to dive deeper?** Check out the [related demo](/demos/function-calling) to see this testing strategy in action, or explore our [testing specification](https://github.com/nsmaassel/nx-portfolio-monorepo/blob/main/docs/TESTING_SPECIFICATION.md) for implementation details. **Next in this series**: Evaluating LLM responses with AI judges (deep dive into Tier 2)

Spec-Driven Development: Augmenting Modern Software Practices with AI

Nick Maassel — Thu, 04 Dec 2025 00:00:00 GMT

## What is Spec-Driven Development? Spec-Driven Development (SDD) combines the rigor of formal specifications with the velocity of AI-assisted development. Rather than writing code first and documentation later, we define clear specifications upfront, then use AI tools to accelerate implementation while maintaining architectural integrity. This approach is especially powerful when paired with tools like Speckit, which provides structured templates for planning, architectural decisions, data models, and implementation tasks. ## The Problem We're Solving Traditional software development cycles often suffer from: - **Unclear requirements**: Vague user stories lead to misaligned implementations - **Scope creep**: Features grow unbounded without clear acceptance criteria - **Rework cycles**: Architectural misunderstandings discovered mid-development - **Poor estimation**: Task sizing is guesswork without detailed planning - **Over-engineering**: Without clear boundaries, developers add unnecessary complexity Meanwhile, AI-assisted development is powerful but needs structure: - Raw AI code generation can be chaotic without direction - AI excels at implementation but needs architecture guidance - Output quality depends on input clarity **Spec-Driven Development is the connective tissue that turns AI's raw power into directed progress.** ## The Spec-Driven Workflow ```plaintext 1. PLAN (Async, Human-led) ↓ Define problem, constraints, user stories Output: Detailed specification document 2. ARCHITECTURE (Async, Human-led) ↓ Review spec, make architectural decisions Identify data models, API contracts, integration points Output: Technical architecture & data model diagrams 3. TASK GENERATION (Async, AI-assisted) ↓ AI generates granular, parallelizable tasks Human refines task breakdown and dependencies Output: Actionable task checklist 4. IMPLEMENTATION (Async, AI-accelerated) ↓ Developers implement tasks using AI coding agents Spec ensures consistency and prevents rework Output: Working features, tested code 5. VALIDATION (Async, Human-led) ↓ E2E tests verify spec compliance Acceptance criteria checked off Output: Completed feature ready for review ``` ## Benefits of Spec-Driven Development ### Better Estimation With detailed task breakdowns upfront, estimation becomes much more accurate. You know: - How many tasks there are - Approximate complexity of each task - Dependencies between tasks - Parallelization opportunities ### Reduced Rework Clear architectural decisions prevent "wait, should we do it this way?" mid-implementation. Everyone's aligned on: - Data model structure - API contract design - Component boundaries - Edge cases and error handling ### AI-Friendly Development Specifications provide the "context" that AI tools need to be most effective. AI can: - Generate scaffolding from spec - Create thorough test coverage - Implement well-defined interfaces - Handle tedious implementation details ### Parallel Execution With clear task boundaries and minimal dependencies, teams can work in parallel. Spec-driven development explicitly identifies: - Which tasks are independent (marked [P]) - Which tasks have dependencies - Optimal execution order This is massive for solo developers using AI agents—you can delegate independent tasks to agents while you focus on architecture and validation. ### Better Onboarding New developers (or new AI agents) can onboard faster by reading the spec: - What problem are we solving? - What's the architecture? - What's the data model? - What tasks exist and how do they relate? ## How We Applied It to This Blog **Spec Location**: `/specs/007-portfolio-blog/` This entire blog project was built using the Spec-Driven approach: 1. **Problem Definition** (`plan.md`): Blog as portfolio enhancement, use Astro for performance, support cross-linking to demos 2. **Architecture Decisions** (`research.md`, `spec.md`): Astro with NX integration, content collections for Markdown, static site generation 3. **Data Model** (`data-model.md`): Blog Post schema with frontmatter, draft mode, optional demo linking 4. **Task Breakdown** (`tasks.md`): 45 granular tasks grouped into 6 phases, parallelization opportunities marked 5. **Implementation**: E2E tests first, then components, then pages 6. **Validation**: Lighthouse scores, accessibility checks, cross-browser testing The spec made it possible to: - Understand the complete scope upfront - Identify what could run in parallel - Delegate independent tasks to AI - Verify completion against clear criteria ## Tools That Enable Spec-Driven Development We use several tools to make spec-driven development practical: ### 1. **Speckit** - Specification Generation Speckit provides structured templates for: - `plan.md`: Problem definition and user stories - `spec.md`: Feature specification with constitution and requirements - `research.md`: Technical research and tool decisions - `data-model.md`: Entity schemas and validation rules - `tasks.md`: Actionable task breakdown - `quickstart.md`: Getting started guide Each template includes sections that force you to think through the problem completely before coding. ### 2. **NX** - Task-Based Build System NX's task graph makes spec-driven development easier: - Tasks map naturally to spec tasks - Dependencies between NX tasks can reflect spec dependencies - `--affected` flag lets you validate only changed specs - Task caching avoids redundant work ### 3. **AI Coding Agents** - Implementation Acceleration With a clear spec, agents can: - Implement tasks autonomously - Generate comprehensive tests - Handle refactoring consistently - Maintain architectural boundaries ## Real Example: This Blog Post Even this meta post follows spec-driven principles: - **Spec**: "Write a blog post explaining spec-driven development and how it's used in this portfolio" - **Acceptance Criteria**: - Explain what SDD is - Show the workflow - List benefits - Demonstrate with real example - Include code examples - Demonstrate cross-linking to portfolio - **Structure**: Outline created upfront, then sections filled in - **Validation**: Does it meet acceptance criteria? ✅ ## When Spec-Driven Development Shines Spec-driven development is most valuable for: - **Medium-to-large features** (small tasks don't need extensive specs) - **Architectural decisions** that impact many components - **Team projects** where alignment is critical - **AI-assisted development** where structure prevents chaos - **Personal projects using agent automation** (like this portfolio) For tiny bug fixes, spec-driven is overkill. But for anything interesting? Spec first. ## When It's Less Useful Spec-driven development isn't always the answer: - **Spike/exploration work**: When you're figuring out if something's possible, specs come after - **Hot fixes in production**: Urgent bugs need quick fixes, not 2 hours of spec writing - **Well-established patterns**: If you've done this exact thing before, the spec can be minimal Good engineers know when to spec and when to just build. ## Getting Started with Spec-Driven Development 1. **Define the problem**: What are we building? Why? For whom? 2. **Research options**: What tools/libraries exist? What are the trade-offs? 3. **Design the architecture**: How will components fit together? 4. **Model the data**: What entities exist? How do they relate? 5. **Break into tasks**: What specific work needs to happen? In what order? 6. **Execute tasks**: Implement one task at a time, with clear acceptance criteria 7. **Validate**: Does the result match the spec? ## Conclusion Spec-Driven Development isn't about bureaucracy—it's about clarity. A good spec is a contract between you and future-you, between teammates, and between humans and AI agents. When combined with AI-assisted development, specs become force multipliers. They give AI the structure it needs to be most effective, while keeping humans in control of architecture and strategy. This blog itself is proof: built faster and cleaner using spec-driven practices, with clear tasks that could be parallelized or delegated to agents. Try it on your next project. Start small—just write a plan before you code. See if it saves you rework. Once you experience the clarity, you'll likely come back to it. --- **Next in this series**: Architecture patterns for AI-assisted development (coming soon)