<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Portfolio Blog - Nick Maassel</title><description>Technical blog covering NX monorepos, Azure deployment, AI integration, and full-stack TypeScript development.</description><link>https://blog.maassel.dev/</link><language>en-us</language><lastBuildDate>Tue, 17 Mar 2026 22:59:41 GMT</lastBuildDate><item><title>Building Summit AI: A Real-Time Ski Schedule &amp; Weather App</title><link>https://blog.maassel.dev/posts/building-summit-ai/</link><guid isPermaLink="true">https://blog.maassel.dev/posts/building-summit-ai/</guid><description>How I built Summit AI, a modern web app that shows real-time ski area schedules, live webcams, and NOAA weather forecasts with a responsive design and powder alerts.</description><pubDate>Tue, 23 Dec 2025 00:00:00 GMT</pubDate><content:encoded>## Introduction

As a frequent visitor to Summit at Snoqualmie, I found myself constantly checking their website to see which base areas were open, what the weather looked like, and whether it was worth making the drive from Seattle. After too many times juggling multiple browser tabs, I decided to build **Summit AI** — a single-page app that consolidates all this information in one beautifully designed interface.

🚀 **Live App:** [summit-ai.maassel.dev](https://summit-ai.maassel.dev)

## Key Features

### 1. **Real-Time Schedule Data**

Summit AI scrapes the official Summit at Snoqualmie website daily to provide up-to-date information on:

- **5 Base Areas**: Summit West, Summit Central, Silver Fir, Alpental, and Summit East
- **Operating Hours**: Exact open/close times for each area
- **Status Indicators**: Color-coded badges (Open ✅, Closed ❌, TBD ⏳)
- **Special Tags**: &quot;Powder Magnet&quot; for Alpental (gets the most snow!)

![Summit AI Daily View](/images/summit-ai-daily-view.png)

### 2. **Live Weather Integration (NOAA)**

Instead of relying on third-party weather APIs with rate limits or paywalls, Summit AI fetches data directly from the **National Weather Service (NOAA)** for Snoqualmie Pass:

- **7-Day Forecast**: Temperature, conditions, wind speed
- **Snowfall Predictions**: Tracks accumulation for powder alerts
- **Powder Alerts**: Automatic banners when ≥3&quot; of snow is forecasted
- **Fresh Snow Badges**: 1-3&quot; = &quot;Fresh Snow,&quot; 3&quot;+ = &quot;POWDER ALERT&quot; ❄️

The NOAA API is free, reliable, and doesn&apos;t require authentication—perfect for a hobby project!

### 3. **Interactive Calendar Navigation**

A custom calendar component shows:

- **Week/day position indicators**: See where you are in the month
- **Snow icons on forecast days**: Visual indicators for expected snowfall
- **Quick date jumps**: &quot;Today,&quot; &quot;Tomorrow,&quot; shortcuts for fast navigation
- **Clickable dates**: Jump to any day instantly

### 4. **Dual View Modes**

**Daily View** — Deep dive into a single date:

- All 5 base area statuses and hours
- Weather forecast details
- Live webcam previews (see below)
- Powder alerts with exact accumulation amounts

**Weekly View** — At-a-glance 7-day overview:

- Compact grid showing all areas across the week
- Snow accumulation badges on each day
- Color-coded status cells for quick scanning
- Click any day to jump to detailed daily view

![Summit AI Weekly View](/images/summit-ai-weekly-view.png)

### 5. **Live Webcam Integration**

One of my favorite features: **embedded YouTube webcams** showing live conditions at each base area.

**Desktop Experience:**

- **5 webcam previews** in the left sidebar (under the calendar)
- 2-column grid with labels for each area
- **Hover-to-expand**: Mouse over any webcam to see a full-size overlay in the center of the screen

**Mobile Experience:**

- Horizontal scrollable row of webcam previews
- Optimized for touch navigation
- Saves vertical space for schedule content

The webcams use YouTube&apos;s embed API with autoplay disabled (per user preference standards) and are sourced from Summit&apos;s official channels.

### 6. **Fully Responsive Design**

Summit AI adapts seamlessly from desktop to mobile:

**Desktop (≥900px):**

- Side-by-side layout: Calendar/webcams on the left, schedule on the right
- Sticky positioning keeps calendar visible while scrolling
- Hover interactions for webcams and schedule details

**Mobile (&lt;900px):**

- Stacked vertical layout: Calendar → Webcams → Toggle → Schedule
- Touch-friendly buttons and navigation
- Optimized font sizes and spacing

![Summit AI Mobile View](/images/summit-ai-mobile-view.png)

### 7. **Powder Alert System**

The app automatically calculates snowfall based on NOAA forecasts and displays:

- **Powder Alert Banner** (≥3&quot; snow in 24hrs): Bright cyan highlight with snowfall amount
- **Fresh Snow Banner** (1-3&quot; snow): Green success banner
- **Calendar Snow Icons** (❄️): Visual indicators on dates with predicted snowfall

This makes it easy to spot the best days to hit the slopes!

## Tech Stack

- **Frontend**: React 18 + TypeScript + Vite
- **UI Framework**: Material-UI (MUI) v5
- **Styling**: Design tokens for colors, typography, spacing
- **Data Sources**:
  - Summit at Snoqualmie (web scraping via backend API)
  - NOAA National Weather Service (public API)
  - YouTube (embedded live webcams)
- **Hosting**: Azure Static Web Apps (with CDN)
- **Backend**: Express API for schedule scraping and caching

## Design Philosophy

I wanted Summit AI to feel modern and polished while being blazingly fast. Key decisions:

1. **Gradient Header**: Teal-to-pink gradient matches ski culture vibes
2. **Status Color Coding**:
   - Green (Open) = Good to go ✅
   - Orange (Closed) = Stay home ❌
   - Yellow (TBD) = Check back later ⏳
3. **Typography Hierarchy**: Clear headings, readable body text, and compact data tables
4. **Micro-interactions**: Smooth transitions, hover effects, and animations
5. **Performance**: React Query for caching, lazy-loaded images, optimized bundle size

## Challenges &amp; Solutions

### Challenge 1: Real-Time Schedule Data

**Problem**: Summit&apos;s website doesn&apos;t have a public API.  
**Solution**: Built a backend scraper that runs daily (Azure Functions) and caches the schedule data as JSON. The frontend fetches from this cached endpoint.

### Challenge 2: Weather Forecast Parsing

**Problem**: NOAA&apos;s API returns raw text descriptions like &quot;Snow likely, mainly after 4pm.&quot;  
**Solution**: Implemented regex parsing to extract snowfall amounts (&quot;3 to 5 inches&quot;) and normalize them into numeric values for powder alerts.

### Challenge 3: Responsive Webcams

**Problem**: YouTube embeds are heavyweight and can slow down the page.  
**Solution**: Used lazy loading (`loading=&quot;lazy&quot;`) and conditional rendering (only load on viewport visibility). Also disabled autoplay until user hovers (desktop) or taps (mobile).

### Challenge 4: Calendar State Management

**Problem**: Syncing calendar selection with daily/weekly view navigation.  
**Solution**: Lifted state to the root `ScheduleView` component and passed callbacks down. Week view calculates Monday-start week dynamically.

## What&apos;s Next?

**Phase 2 Features** (coming soon):

- 🤖 **AI-Powered Insights**: &quot;Best for beginners today: Summit West&quot; (using GPT-4)
- 📊 **Historical Data**: &quot;This day last year had 12\&quot; of powder&quot;
- 🎿 **Crowd Predictions**: &quot;Expect heavy traffic on weekends&quot;
- 🔔 **Push Notifications**: &quot;Powder alert for tomorrow!&quot;

## Lessons Learned

1. **Design Tokens are Worth It**: Centralizing colors, spacing, and typography made theming painless.
2. **NOAA&apos;s API is Underrated**: Free, reliable, and well-documented. More devs should use it!
3. **Playwright for Screenshots**: Automated screenshot capture (used for this blog post!) is a game-changer for documentation.
4. **Monorepo Power**: NX made it easy to share types between frontend/backend and run tests across the entire stack.

## Try It Yourself

🔗 **Live App**: [summit-ai.maassel.dev](https://summit-ai.maassel.dev)  
💻 **Source Code**: (Private repo, but happy to discuss implementation!)

If you&apos;re a skier or snowboarder in the Pacific Northwest, give Summit AI a try and let me know what you think! I&apos;m always open to feedback and feature suggestions.

---

**Tags**: #React #TypeScript #Vite #MaterialUI #WeatherAPI #WebScraping #Azure #NXMonorepo #ResponsiveDesign

---

_Have questions about how I built this? Want to discuss the architecture or design decisions? Feel free to reach out!_</content:encoded><category>react</category><category>typescript</category><category>vite</category><category>material-ui</category><category>weather-api</category><category>web-scraping</category><author>Nick Maassel</author></item><item><title>Testing Production AI Apps: Two-Tier Strategy for LLM Function Calling</title><link>https://blog.maassel.dev/posts/testing-production-ai-apps/</link><guid isPermaLink="true">https://blog.maassel.dev/posts/testing-production-ai-apps/</guid><description>How to build reliable automated tests for non-deterministic AI systems using a two-tier approach: deterministic validation for CI/CD and AI judges for quality assessment.</description><pubDate>Thu, 11 Dec 2025 00:00:00 GMT</pubDate><content:encoded>## The Testing Paradox for AI Apps

Traditional software testing relies on determinism: given the same input, you get the same output. But AI systems—especially LLMs—are fundamentally non-deterministic. The same prompt can produce different responses every time.

So how do you write automated tests for production AI applications?

**You can&apos;t assert exact outputs, but you can validate behavior.**

This post shares the two-tier testing strategy we use for our [Function Calling Demo](/demos/function-calling), which uses OpenAI&apos;s GPT models to automatically select and execute backend APIs based on natural language queries.

## The Problem: Non-Determinism Meets TDD

Traditional TDD approach (doesn&apos;t work for LLMs):

```typescript
// ❌ This will be flaky
test(&apos;should answer weather question&apos;, () =&gt; {
  const response = llm.ask(&quot;What&apos;s the weather in Seattle?&quot;);
  expect(response).toBe(&quot;It&apos;s 52°F and rainy in Seattle.&quot;);
  // Fails 90% of the time - LLM phrases it differently
});
```

The LLM might say:

- &quot;It&apos;s 52°F and rainy in Seattle.&quot;
- &quot;Seattle is currently experiencing rainy weather at 52 degrees Fahrenheit.&quot;
- &quot;The weather in Seattle is rainy with a temperature of 52°F.&quot;
- &quot;Seattle: 52°F, precipitation expected.&quot;

All correct answers, but none match the assertion.

## Two-Tier Testing Strategy

We split testing into two complementary tiers:

### Tier 1: Deterministic Validation (Blocks CI/CD)

- ✅ **Tool selection correctness** - Did the LLM choose the right function?
- ✅ **Response structure** - Does the API return expected fields?
- ✅ **Semantic relevance** - Does the response contain keywords related to the question?
- 🚫 **Blocks PRs if failing**
- ⚡ **Fast** (~30 seconds for full suite)

### Tier 2: Quality Assessment (Advisory)

- 📊 **Response quality** - Is it coherent, helpful, and complete?
- 📊 **Model comparison** - Which model performs best (GPT-5 vs GPT-4.1)?
- 📊 **AI judge grading** - Another LLM evaluates quality
- 💡 **Advisory only** - Doesn&apos;t block PRs
- 🕐 **Slower** (~5 minutes for full evaluation)

## Tier 1: Real API Calls with Flexible Validation

Here&apos;s a real test from our function calling demo:

```typescript
it(&apos;should handle &quot;What\&apos;s the weekend schedule at Alpental?&quot;&apos;, async () =&gt; {
  // 1. Send real user question to production API
  const response = await fetch(`${API_BASE_URL}/api/agent-chat`, {
    method: &apos;POST&apos;,
    body: JSON.stringify({
      message: &quot;What&apos;s the weekend schedule at Alpental?&quot;,
      context: {},
    }),
  });

  expect(response.status).toBe(200);
  const data = await response.json();

  // 2. Validate API response structure
  expect(data).toHaveProperty(&apos;requestId&apos;);
  expect(data).toHaveProperty(&apos;response&apos;);
  expect(data).toHaveProperty(&apos;toolsUsed&apos;);

  // 3. Validate tool selection (function calling behavior)
  const toolNames = data.toolsUsed.map((tool: any) =&gt; tool.name);
  expect(toolNames.length).toBeGreaterThan(0);
  expect(
    toolNames.some(
      (name: string) =&gt; name.includes(&apos;summit_schedule&apos;) // Right category of tool
    )
  ).toBe(true);

  // 4. Validate semantic relevance (flexible regex)
  expect(data.response.toLowerCase()).toMatch(/alpental|weekend|saturday|sunday/);
  // Any of these keywords prove the response is relevant
});
```

### What We&apos;re Validating

**✅ Tool Selection** (Most Critical)

```typescript
expect(toolNames.some((name) =&gt; name.includes(&apos;summit_schedule&apos;))).toBe(true);
```

If the LLM chose `weather_api` instead of `summit_schedule`, that&apos;s a regression—even if the response sounds plausible.

**✅ Response Structure** (API Contract)

```typescript
expect(data).toHaveProperty(&apos;toolsUsed&apos;);
```

The API shape must remain stable for frontend consumers.

**✅ Semantic Relevance** (Flexible Keywords)

```typescript
expect(data.response.toLowerCase()).toMatch(/alpental|weekend|saturday|sunday/);
```

We&apos;re not checking exact wording—just that the response is _about_ the right topic.

### What We&apos;re NOT Validating

❌ **Exact wording** - LLMs rephrase constantly  
❌ **Grammar/style** - Subjective and changes with model updates  
❌ **Tone** - That&apos;s a quality concern (Tier 2)  
❌ **Completeness** - That&apos;s also Tier 2

## Why This Works

### 1. **Catches Real Regressions**

- System prompt changes that break tool selection
- Tool definition changes that confuse the LLM
- MCP server connectivity issues
- Response formatting bugs

### 2. **Tolerates Non-Determinism**

- LLM can phrase answers differently each time
- Minor wording variations don&apos;t fail tests
- Focuses on **behavior** not **exact output**

### 3. **Fast Enough for CI/CD**

- Each test ~3-5 seconds (real LLM API call)
- Full suite ~30 seconds
- Acceptable for pre-push hooks

### 4. **Real Integration Testing**

```typescript
// Entire stack is exercised:
User Question
  → Express API route
    → OpenAI Function Calling
      → MCP Server
        → Backend API
          → Data source
      → LLM formats response
    → Returns to user
```

This is **real integration testing**, not mocked unit tests.

## Tier 2: AI Judge for Quality Assessment

While Tier 1 blocks regressions, Tier 2 evaluates **quality** using another LLM as a judge:

```typescript
// Simplified example - actual implementation is more sophisticated
const evaluateResponse = async (question: string, response: string) =&gt; {
  const judgePrompt = `
    Evaluate this AI assistant response on a scale of 1-10:
    
    Question: ${question}
    Response: ${response}
    
    Criteria:
    - Accuracy: Does it answer the question correctly?
    - Completeness: Is all relevant information included?
    - Clarity: Is the response easy to understand?
    - Conciseness: Is it appropriately brief?
    
    Return JSON: { &quot;score&quot;: 8, &quot;reasoning&quot;: &quot;...&quot; }
  `;

  const judgment = await gpt5.evaluate(judgePrompt);
  return judgment;
};
```

### What Tier 2 Evaluates

**📊 Response Quality**

- Coherence and readability
- Appropriate level of detail
- Helpful and user-friendly

**📊 Model Comparison**

- GPT-5 vs GPT-4.1-nano performance
- Cost vs quality trade-offs
- Which model handles edge cases better

**📊 Regression Detection Over Time**

- Are responses getting worse with model updates?
- Is the system prompt still effective?

### Why Tier 2 Doesn&apos;t Block PRs

- **Subjective metrics** - Quality is harder to define than correctness
- **Model updates** - OpenAI can change model behavior without warning
- **Cost concerns** - Running AI judges on every PR is expensive
- **Speed** - Takes 5+ minutes for comprehensive evaluation

Instead, Tier 2 runs:

- **Nightly** - Against production endpoints
- **On-demand** - When investigating quality issues
- **Before releases** - To ensure no quality degradation

## Real-World Example: Weekend Query Bug

We recently fixed a bug where weekend queries returned only Saturday&apos;s schedule without mentioning Sunday. Here&apos;s how the two-tier approach caught it:

### Tier 1 Test (Caught the Regression)

```typescript
it(&apos;should query BOTH days for &quot;next weekend&quot;&apos;, async () =&gt; {
  const response = await askAgent(&quot;What&apos;s open next weekend?&quot;);

  // Extract dates from tool calls
  const dates = response.toolsUsed.flatMap((tool) =&gt; tool.arguments?.date).filter(Boolean);

  // Must query both Saturday AND Sunday
  expect(dates.length).toBe(2);
  expect(dates).toContain(&apos;2025-12-21&apos;); // Saturday
  expect(dates).toContain(&apos;2025-12-22&apos;); // Sunday
});
```

This test **blocked the PR** until we fixed the system prompt to explicitly instruct the LLM to query both days.

### Tier 2 Evaluation (Assessed User Experience)

```json
{
  &quot;question&quot;: &quot;What&apos;s open next weekend?&quot;,
  &quot;score&quot;: 6,
  &quot;reasoning&quot;: &quot;Response mentions Saturday schedule but doesn&apos;t explicitly state Sunday hours. User might assume Sunday is closed when it&apos;s actually open. Incomplete information.&quot;
}
```

The AI judge identified the **user experience problem** even though Tier 1 didn&apos;t catch it initially (we added that test after).

## Coverage Strategy

Our function calling demo has 100% coverage of UI sample questions:

| Sample Question                            | Test Coverage      |
| ------------------------------------------ | ------------------ |
| &quot;What&apos;s open at Summit today?&quot;             | ✅ Tier 1 + Tier 2 |
| &quot;What&apos;s the weekend schedule at Alpental?&quot; | ✅ Tier 1 + Tier 2 |
| &quot;Tell me about the Summit West base area&quot;  | ✅ Tier 1 + Tier 2 |
| &quot;What was open yesterday at the summit?&quot;   | ✅ Tier 1 + Tier 2 |
| &quot;What&apos;s the weather like at Summit?&quot;       | ✅ Tier 1 + Tier 2 |

**Tier 1 Tests**: 16 tests covering tool selection and semantic relevance  
**Tier 2 Evaluations**: 20+ golden prompts for quality assessment

## Implementation Details

### Test Infrastructure

**Tier 1 Location**: `apps/deployable/api/generative-ai-api/src/tests/`

- `agent-chat-llm-integration.spec.ts` - Main demo sample questions
- `agent-chat-weekend-queries.spec.ts` - Weekend-specific edge cases
- `agent-chat-area-filtering.spec.ts` - Area filtering logic

**Tier 2 Location**: `apps/evaluation/llm-evaluations/`

- `src/golden-prompts/` - Curated test cases
- `src/ai-judge/` - GPT-5 evaluation logic
- `src/comparison/` - Multi-model comparison

### Running the Tests

**Tier 1 (CI/CD)**:

```bash
# Run on every push (pre-push hook)
nx test generative-ai-api --testPathPattern=agent-chat

# Prerequisites:
# - OPENAI_API_KEY environment variable
# - API server running (or auto-started by tests)
```

**Tier 2 (On-Demand)**:

```bash
# Run full evaluation suite
npx nx run llm-evaluations:eval:function-calling

# Compare models
npx nx run llm-evaluations:compare --models=gpt-5,gpt-4.1-nano
```

## Lessons Learned

### 1. **Start with Tier 1, Add Tier 2 Later**

Get the deterministic tests working first. They provide immediate feedback and catch most bugs.

### 2. **Don&apos;t Assert Exact Strings**

Use flexible regex patterns with alternation:

```typescript
// ✅ Good
expect(response).toMatch(/open|available|accessible/);

// ❌ Bad
expect(response).toContain(&apos;The area is currently open&apos;);
```

### 3. **Test Tool Selection First**

The most important validation is: **Did the LLM choose the right function?**

If tool selection is correct, response quality issues can be fixed with prompt tuning. If tool selection is wrong, the system is fundamentally broken.

### 4. **Golden Prompts are Gold**

Curate a set of &quot;golden&quot; test cases that represent real user queries. Protect these fiercely—they&apos;re your regression suite.

### 5. **AI Judges Need Structure**

Give your AI judge clear criteria and ask for JSON output. Freeform evaluations are hard to aggregate.

### 6. **Cost vs Coverage Trade-Offs**

- Tier 1: Run on every commit (~$0.02 per run)
- Tier 2: Run nightly (~$2 per full evaluation)

Budget accordingly.

## Common Pitfalls

### ❌ Over-Reliance on Mocks

```typescript
// This doesn&apos;t test the actual LLM behavior
mock(llm).toReturn({ tool: &apos;summit_schedule&apos;, args: {...} });
```

Use real API calls for integration tests. Mock at the boundary (external services), not at the LLM.

### ❌ Exact String Matching

```typescript
expect(response).toBe(&apos;Alpental is open 9am-4pm&apos;);
// Flaky! LLM might say &quot;Alpental: 9:00 AM - 4:00 PM&quot;
```

### ❌ Testing Too Many Things at Once

```typescript
// Bad: Tests tool selection, response quality, and formatting
expect(response).toBe(EXACT_EXPECTED_OUTPUT);

// Good: Test one thing at a time
expect(toolsUsed).toContain(&apos;summit_schedule&apos;); // Tool selection
expect(response).toMatch(/alpental/i); // Relevance
// Tier 2 handles quality
```

### ❌ Ignoring Model Updates

OpenAI updates models regularly. What worked yesterday might break tomorrow. Monitor Tier 2 trends to catch degradation.

## Industry Comparison: Other Approaches

We&apos;re not the only ones solving this problem. Here are alternative approaches:

### **Prompt Regression Testing** (used by LangChain)

- Store prompt templates in version control
- Test that template changes don&apos;t break known cases
- Focus on prompt engineering rather than output validation

### **LLM-as-a-Judge** (used by OpenAI for GPT-4 evals)

- Use a stronger model to evaluate weaker models
- Constitutional AI approach
- Our Tier 2 is inspired by this

### **Assertion-Based Testing** (used by PromptLayer)

- Define semantic assertions like &quot;response contains date&quot;
- Use NLP to validate claims rather than string matching
- More sophisticated than our keyword approach

### **Human-in-the-Loop** (used by Anthropic)

- Sample responses sent to humans for rating
- Gold standard for quality, but doesn&apos;t scale
- We reserve this for Tier 2 evaluation of edge cases

## Future Improvements

### 1. **Semantic Similarity Scoring**

Instead of keyword matching, use embeddings to measure semantic distance:

```typescript
const similarity = cosineSimilarity(embed(response), embed(&apos;Expected to mention Alpental and weekend hours&apos;));
expect(similarity).toBeGreaterThan(0.8);
```

### 2. **Automated Golden Prompt Generation**

Use LLMs to generate diverse test cases based on existing ones:

```typescript
const variants = await generateVariants(&quot;What&apos;s open at Summit today?&quot;, { count: 10, diversity: &apos;high&apos; });
// &quot;Which areas are operational at Summit right now?&quot;
// &quot;Tell me what&apos;s currently available at Summit&quot;
// etc.
```

### 3. **Continuous Monitoring**

- Track Tier 2 scores over time
- Alert when scores drop below baseline
- Correlate with model version updates

### 4. **Multi-Modal Testing**

Extend to test function calling with images, audio, and video inputs.

## Try It Yourself

All the code for this testing strategy is open source in our [portfolio monorepo](https://github.com/nsmaassel/nx-portfolio-monorepo):

- **Tier 1 tests**: `apps/deployable/api/generative-ai-api/src/tests/`
- **Tier 2 evaluations**: `apps/evaluation/llm-evaluations/`
- **Testing docs**: `docs/TESTING_SPECIFICATION.md`

The [Function Calling Demo](/demos/function-calling) shows this system in action.

## Conclusion

Testing AI applications requires rethinking traditional TDD principles:

- ✅ **Do** validate behavior (tool selection, semantic relevance)
- ❌ **Don&apos;t** assert exact outputs
- ✅ **Do** split deterministic (Tier 1) from quality (Tier 2) tests
- ❌ **Don&apos;t** block CI/CD on subjective quality metrics
- ✅ **Do** use real API calls for integration tests
- ❌ **Don&apos;t** over-rely on mocks

This two-tier approach gives us:

- **Fast, reliable CI/CD** (Tier 1 blocks regressions)
- **Quality insights** (Tier 2 guides improvements)
- **Cost control** (Tier 1 is cheap, Tier 2 runs selectively)

AI systems are inherently non-deterministic, but that doesn&apos;t mean they&apos;re untestable. You just need the right strategy.

---

**Want to dive deeper?** Check out the [related demo](/demos/function-calling) to see this testing strategy in action, or explore our [testing specification](https://github.com/nsmaassel/nx-portfolio-monorepo/blob/main/docs/TESTING_SPECIFICATION.md) for implementation details.

**Next in this series**: Evaluating LLM responses with AI judges (deep dive into Tier 2)</content:encoded><category>ai-testing</category><category>llm</category><category>testing</category><category>function-calling</category><category>quality-assurance</category><author>Nick Maassel</author></item><item><title>Spec-Driven Development: Augmenting Modern Software Practices with AI</title><link>https://blog.maassel.dev/posts/spec-driven-development/</link><guid isPermaLink="true">https://blog.maassel.dev/posts/spec-driven-development/</guid><description>How to combine traditional best practices like TDD with AI-assisted development tools to improve estimation, reduce over-engineering, and set better expectations.</description><pubDate>Thu, 04 Dec 2025 00:00:00 GMT</pubDate><content:encoded>## What is Spec-Driven Development?

Spec-Driven Development (SDD) combines the rigor of formal specifications with the velocity of AI-assisted development. Rather than writing code first and documentation later, we define clear specifications upfront, then use AI tools to accelerate implementation while maintaining architectural integrity.

This approach is especially powerful when paired with tools like Speckit, which provides structured templates for planning, architectural decisions, data models, and implementation tasks.

## The Problem We&apos;re Solving

Traditional software development cycles often suffer from:

- **Unclear requirements**: Vague user stories lead to misaligned implementations
- **Scope creep**: Features grow unbounded without clear acceptance criteria
- **Rework cycles**: Architectural misunderstandings discovered mid-development
- **Poor estimation**: Task sizing is guesswork without detailed planning
- **Over-engineering**: Without clear boundaries, developers add unnecessary complexity

Meanwhile, AI-assisted development is powerful but needs structure:

- Raw AI code generation can be chaotic without direction
- AI excels at implementation but needs architecture guidance
- Output quality depends on input clarity

**Spec-Driven Development is the connective tissue that turns AI&apos;s raw power into directed progress.**

## The Spec-Driven Workflow

```plaintext
1. PLAN (Async, Human-led)
   ↓
   Define problem, constraints, user stories
   Output: Detailed specification document

2. ARCHITECTURE (Async, Human-led)
   ↓
   Review spec, make architectural decisions
   Identify data models, API contracts, integration points
   Output: Technical architecture &amp; data model diagrams

3. TASK GENERATION (Async, AI-assisted)
   ↓
   AI generates granular, parallelizable tasks
   Human refines task breakdown and dependencies
   Output: Actionable task checklist

4. IMPLEMENTATION (Async, AI-accelerated)
   ↓
   Developers implement tasks using AI coding agents
   Spec ensures consistency and prevents rework
   Output: Working features, tested code

5. VALIDATION (Async, Human-led)
   ↓
   E2E tests verify spec compliance
   Acceptance criteria checked off
   Output: Completed feature ready for review
```

## Benefits of Spec-Driven Development

### Better Estimation

With detailed task breakdowns upfront, estimation becomes much more accurate. You know:

- How many tasks there are
- Approximate complexity of each task
- Dependencies between tasks
- Parallelization opportunities

### Reduced Rework

Clear architectural decisions prevent &quot;wait, should we do it this way?&quot; mid-implementation. Everyone&apos;s aligned on:

- Data model structure
- API contract design
- Component boundaries
- Edge cases and error handling

### AI-Friendly Development

Specifications provide the &quot;context&quot; that AI tools need to be most effective. AI can:

- Generate scaffolding from spec
- Create thorough test coverage
- Implement well-defined interfaces
- Handle tedious implementation details

### Parallel Execution

With clear task boundaries and minimal dependencies, teams can work in parallel. Spec-driven development explicitly identifies:

- Which tasks are independent (marked [P])
- Which tasks have dependencies
- Optimal execution order

This is massive for solo developers using AI agents—you can delegate independent tasks to agents while you focus on architecture and validation.

### Better Onboarding

New developers (or new AI agents) can onboard faster by reading the spec:

- What problem are we solving?
- What&apos;s the architecture?
- What&apos;s the data model?
- What tasks exist and how do they relate?

## How We Applied It to This Blog

**Spec Location**: `/specs/007-portfolio-blog/`

This entire blog project was built using the Spec-Driven approach:

1. **Problem Definition** (`plan.md`): Blog as portfolio enhancement, use Astro for performance, support cross-linking to demos
2. **Architecture Decisions** (`research.md`, `spec.md`): Astro with NX integration, content collections for Markdown, static site generation
3. **Data Model** (`data-model.md`): Blog Post schema with frontmatter, draft mode, optional demo linking
4. **Task Breakdown** (`tasks.md`): 45 granular tasks grouped into 6 phases, parallelization opportunities marked
5. **Implementation**: E2E tests first, then components, then pages
6. **Validation**: Lighthouse scores, accessibility checks, cross-browser testing

The spec made it possible to:

- Understand the complete scope upfront
- Identify what could run in parallel
- Delegate independent tasks to AI
- Verify completion against clear criteria

## Tools That Enable Spec-Driven Development

We use several tools to make spec-driven development practical:

### 1. **Speckit** - Specification Generation

Speckit provides structured templates for:

- `plan.md`: Problem definition and user stories
- `spec.md`: Feature specification with constitution and requirements
- `research.md`: Technical research and tool decisions
- `data-model.md`: Entity schemas and validation rules
- `tasks.md`: Actionable task breakdown
- `quickstart.md`: Getting started guide

Each template includes sections that force you to think through the problem completely before coding.

### 2. **NX** - Task-Based Build System

NX&apos;s task graph makes spec-driven development easier:

- Tasks map naturally to spec tasks
- Dependencies between NX tasks can reflect spec dependencies
- `--affected` flag lets you validate only changed specs
- Task caching avoids redundant work

### 3. **AI Coding Agents** - Implementation Acceleration

With a clear spec, agents can:

- Implement tasks autonomously
- Generate comprehensive tests
- Handle refactoring consistently
- Maintain architectural boundaries

## Real Example: This Blog Post

Even this meta post follows spec-driven principles:

- **Spec**: &quot;Write a blog post explaining spec-driven development and how it&apos;s used in this portfolio&quot;
- **Acceptance Criteria**:
  - Explain what SDD is
  - Show the workflow
  - List benefits
  - Demonstrate with real example
  - Include code examples
  - Demonstrate cross-linking to portfolio
- **Structure**: Outline created upfront, then sections filled in
- **Validation**: Does it meet acceptance criteria? ✅

## When Spec-Driven Development Shines

Spec-driven development is most valuable for:

- **Medium-to-large features** (small tasks don&apos;t need extensive specs)
- **Architectural decisions** that impact many components
- **Team projects** where alignment is critical
- **AI-assisted development** where structure prevents chaos
- **Personal projects using agent automation** (like this portfolio)

For tiny bug fixes, spec-driven is overkill. But for anything interesting? Spec first.

## When It&apos;s Less Useful

Spec-driven development isn&apos;t always the answer:

- **Spike/exploration work**: When you&apos;re figuring out if something&apos;s possible, specs come after
- **Hot fixes in production**: Urgent bugs need quick fixes, not 2 hours of spec writing
- **Well-established patterns**: If you&apos;ve done this exact thing before, the spec can be minimal

Good engineers know when to spec and when to just build.

## Getting Started with Spec-Driven Development

1. **Define the problem**: What are we building? Why? For whom?
2. **Research options**: What tools/libraries exist? What are the trade-offs?
3. **Design the architecture**: How will components fit together?
4. **Model the data**: What entities exist? How do they relate?
5. **Break into tasks**: What specific work needs to happen? In what order?
6. **Execute tasks**: Implement one task at a time, with clear acceptance criteria
7. **Validate**: Does the result match the spec?

## Conclusion

Spec-Driven Development isn&apos;t about bureaucracy—it&apos;s about clarity. A good spec is a contract between you and future-you, between teammates, and between humans and AI agents.

When combined with AI-assisted development, specs become force multipliers. They give AI the structure it needs to be most effective, while keeping humans in control of architecture and strategy.

This blog itself is proof: built faster and cleaner using spec-driven practices, with clear tasks that could be parallelized or delegated to agents.

Try it on your next project. Start small—just write a plan before you code. See if it saves you rework. Once you experience the clarity, you&apos;ll likely come back to it.

---

**Next in this series**: Architecture patterns for AI-assisted development (coming soon)</content:encoded><category>spec-driven-development</category><category>ai-development</category><category>best-practices</category><category>speckit</category><author>Nick Maassel</author></item></channel></rss>