The Testing Paradox for AI Apps

Traditional software testing relies on determinism: given the same input, you get the same output. But AI systems—especially LLMs—are fundamentally non-deterministic. The same prompt can produce different responses every time.

So how do you write automated tests for production AI applications?

You can’t assert exact outputs, but you can validate behavior.

This post shares the two-tier testing strategy we use for our Function Calling Demo, which uses OpenAI’s GPT models to automatically select and execute backend APIs based on natural language queries.

The Problem: Non-Determinism Meets TDD

Traditional TDD approach (doesn’t work for LLMs):

// ❌ This will be flaky
test('should answer weather question', () => {
  const response = llm.ask("What's the weather in Seattle?");
  expect(response).toBe("It's 52°F and rainy in Seattle.");
  // Fails 90% of the time - LLM phrases it differently
});

The LLM might say:

  • “It’s 52°F and rainy in Seattle.”
  • “Seattle is currently experiencing rainy weather at 52 degrees Fahrenheit.”
  • “The weather in Seattle is rainy with a temperature of 52°F.”
  • “Seattle: 52°F, precipitation expected.”

All correct answers, but none match the assertion.

Two-Tier Testing Strategy

We split testing into two complementary tiers:

Tier 1: Deterministic Validation (Blocks CI/CD)

  • Tool selection correctness - Did the LLM choose the right function?
  • Response structure - Does the API return expected fields?
  • Semantic relevance - Does the response contain keywords related to the question?
  • 🚫 Blocks PRs if failing
  • Fast (~30 seconds for full suite)

Tier 2: Quality Assessment (Advisory)

  • 📊 Response quality - Is it coherent, helpful, and complete?
  • 📊 Model comparison - Which model performs best (GPT-5 vs GPT-4.1)?
  • 📊 AI judge grading - Another LLM evaluates quality
  • 💡 Advisory only - Doesn’t block PRs
  • 🕐 Slower (~5 minutes for full evaluation)

Tier 1: Real API Calls with Flexible Validation

Here’s a real test from our function calling demo:

it('should handle "What\'s the weekend schedule at Alpental?"', async () => {
  // 1. Send real user question to production API
  const response = await fetch(`${API_BASE_URL}/api/agent-chat`, {
    method: 'POST',
    body: JSON.stringify({
      message: "What's the weekend schedule at Alpental?",
      context: {},
    }),
  });

  expect(response.status).toBe(200);
  const data = await response.json();

  // 2. Validate API response structure
  expect(data).toHaveProperty('requestId');
  expect(data).toHaveProperty('response');
  expect(data).toHaveProperty('toolsUsed');

  // 3. Validate tool selection (function calling behavior)
  const toolNames = data.toolsUsed.map((tool: any) => tool.name);
  expect(toolNames.length).toBeGreaterThan(0);
  expect(
    toolNames.some(
      (name: string) => name.includes('summit_schedule') // Right category of tool
    )
  ).toBe(true);

  // 4. Validate semantic relevance (flexible regex)
  expect(data.response.toLowerCase()).toMatch(/alpental|weekend|saturday|sunday/);
  // Any of these keywords prove the response is relevant
});

What We’re Validating

✅ Tool Selection (Most Critical)

expect(toolNames.some((name) => name.includes('summit_schedule'))).toBe(true);

If the LLM chose weather_api instead of summit_schedule, that’s a regression—even if the response sounds plausible.

✅ Response Structure (API Contract)

expect(data).toHaveProperty('toolsUsed');

The API shape must remain stable for frontend consumers.

✅ Semantic Relevance (Flexible Keywords)

expect(data.response.toLowerCase()).toMatch(/alpental|weekend|saturday|sunday/);

We’re not checking exact wording—just that the response is about the right topic.

What We’re NOT Validating

Exact wording - LLMs rephrase constantly
Grammar/style - Subjective and changes with model updates
Tone - That’s a quality concern (Tier 2)
Completeness - That’s also Tier 2

Why This Works

1. Catches Real Regressions

  • System prompt changes that break tool selection
  • Tool definition changes that confuse the LLM
  • MCP server connectivity issues
  • Response formatting bugs

2. Tolerates Non-Determinism

  • LLM can phrase answers differently each time
  • Minor wording variations don’t fail tests
  • Focuses on behavior not exact output

3. Fast Enough for CI/CD

  • Each test ~3-5 seconds (real LLM API call)
  • Full suite ~30 seconds
  • Acceptable for pre-push hooks

4. Real Integration Testing

// Entire stack is exercised:
User Question
Express API route
OpenAI Function Calling
MCP Server
Backend API
Data source
LLM formats response
Returns to user

This is real integration testing, not mocked unit tests.

Tier 2: AI Judge for Quality Assessment

While Tier 1 blocks regressions, Tier 2 evaluates quality using another LLM as a judge:

// Simplified example - actual implementation is more sophisticated
const evaluateResponse = async (question: string, response: string) => {
  const judgePrompt = `
    Evaluate this AI assistant response on a scale of 1-10:
    
    Question: ${question}
    Response: ${response}
    
    Criteria:
    - Accuracy: Does it answer the question correctly?
    - Completeness: Is all relevant information included?
    - Clarity: Is the response easy to understand?
    - Conciseness: Is it appropriately brief?
    
    Return JSON: { "score": 8, "reasoning": "..." }
  `;

  const judgment = await gpt5.evaluate(judgePrompt);
  return judgment;
};

What Tier 2 Evaluates

📊 Response Quality

  • Coherence and readability
  • Appropriate level of detail
  • Helpful and user-friendly

📊 Model Comparison

  • GPT-5 vs GPT-4.1-nano performance
  • Cost vs quality trade-offs
  • Which model handles edge cases better

📊 Regression Detection Over Time

  • Are responses getting worse with model updates?
  • Is the system prompt still effective?

Why Tier 2 Doesn’t Block PRs

  • Subjective metrics - Quality is harder to define than correctness
  • Model updates - OpenAI can change model behavior without warning
  • Cost concerns - Running AI judges on every PR is expensive
  • Speed - Takes 5+ minutes for comprehensive evaluation

Instead, Tier 2 runs:

  • Nightly - Against production endpoints
  • On-demand - When investigating quality issues
  • Before releases - To ensure no quality degradation

Real-World Example: Weekend Query Bug

We recently fixed a bug where weekend queries returned only Saturday’s schedule without mentioning Sunday. Here’s how the two-tier approach caught it:

Tier 1 Test (Caught the Regression)

it('should query BOTH days for "next weekend"', async () => {
  const response = await askAgent("What's open next weekend?");

  // Extract dates from tool calls
  const dates = response.toolsUsed.flatMap((tool) => tool.arguments?.date).filter(Boolean);

  // Must query both Saturday AND Sunday
  expect(dates.length).toBe(2);
  expect(dates).toContain('2025-12-21'); // Saturday
  expect(dates).toContain('2025-12-22'); // Sunday
});

This test blocked the PR until we fixed the system prompt to explicitly instruct the LLM to query both days.

Tier 2 Evaluation (Assessed User Experience)

{
  "question": "What's open next weekend?",
  "score": 6,
  "reasoning": "Response mentions Saturday schedule but doesn't explicitly state Sunday hours. User might assume Sunday is closed when it's actually open. Incomplete information."
}

The AI judge identified the user experience problem even though Tier 1 didn’t catch it initially (we added that test after).

Coverage Strategy

Our function calling demo has 100% coverage of UI sample questions:

Sample QuestionTest Coverage
”What’s open at Summit today?”✅ Tier 1 + Tier 2
”What’s the weekend schedule at Alpental?”✅ Tier 1 + Tier 2
”Tell me about the Summit West base area”✅ Tier 1 + Tier 2
”What was open yesterday at the summit?”✅ Tier 1 + Tier 2
”What’s the weather like at Summit?”✅ Tier 1 + Tier 2

Tier 1 Tests: 16 tests covering tool selection and semantic relevance
Tier 2 Evaluations: 20+ golden prompts for quality assessment

Implementation Details

Test Infrastructure

Tier 1 Location: apps/deployable/api/generative-ai-api/src/tests/

  • agent-chat-llm-integration.spec.ts - Main demo sample questions
  • agent-chat-weekend-queries.spec.ts - Weekend-specific edge cases
  • agent-chat-area-filtering.spec.ts - Area filtering logic

Tier 2 Location: apps/evaluation/llm-evaluations/

  • src/golden-prompts/ - Curated test cases
  • src/ai-judge/ - GPT-5 evaluation logic
  • src/comparison/ - Multi-model comparison

Running the Tests

Tier 1 (CI/CD):

# Run on every push (pre-push hook)
nx test generative-ai-api --testPathPattern=agent-chat

# Prerequisites:
# - OPENAI_API_KEY environment variable
# - API server running (or auto-started by tests)

Tier 2 (On-Demand):

# Run full evaluation suite
npx nx run llm-evaluations:eval:function-calling

# Compare models
npx nx run llm-evaluations:compare --models=gpt-5,gpt-4.1-nano

Lessons Learned

1. Start with Tier 1, Add Tier 2 Later

Get the deterministic tests working first. They provide immediate feedback and catch most bugs.

2. Don’t Assert Exact Strings

Use flexible regex patterns with alternation:

// ✅ Good
expect(response).toMatch(/open|available|accessible/);

// ❌ Bad
expect(response).toContain('The area is currently open');

3. Test Tool Selection First

The most important validation is: Did the LLM choose the right function?

If tool selection is correct, response quality issues can be fixed with prompt tuning. If tool selection is wrong, the system is fundamentally broken.

4. Golden Prompts are Gold

Curate a set of “golden” test cases that represent real user queries. Protect these fiercely—they’re your regression suite.

5. AI Judges Need Structure

Give your AI judge clear criteria and ask for JSON output. Freeform evaluations are hard to aggregate.

6. Cost vs Coverage Trade-Offs

  • Tier 1: Run on every commit (~$0.02 per run)
  • Tier 2: Run nightly (~$2 per full evaluation)

Budget accordingly.

Common Pitfalls

❌ Over-Reliance on Mocks

// This doesn't test the actual LLM behavior
mock(llm).toReturn({ tool: 'summit_schedule', args: {...} });

Use real API calls for integration tests. Mock at the boundary (external services), not at the LLM.

❌ Exact String Matching

expect(response).toBe('Alpental is open 9am-4pm');
// Flaky! LLM might say "Alpental: 9:00 AM - 4:00 PM"

❌ Testing Too Many Things at Once

// Bad: Tests tool selection, response quality, and formatting
expect(response).toBe(EXACT_EXPECTED_OUTPUT);

// Good: Test one thing at a time
expect(toolsUsed).toContain('summit_schedule'); // Tool selection
expect(response).toMatch(/alpental/i); // Relevance
// Tier 2 handles quality

❌ Ignoring Model Updates

OpenAI updates models regularly. What worked yesterday might break tomorrow. Monitor Tier 2 trends to catch degradation.

Industry Comparison: Other Approaches

We’re not the only ones solving this problem. Here are alternative approaches:

Prompt Regression Testing (used by LangChain)

  • Store prompt templates in version control
  • Test that template changes don’t break known cases
  • Focus on prompt engineering rather than output validation

LLM-as-a-Judge (used by OpenAI for GPT-4 evals)

  • Use a stronger model to evaluate weaker models
  • Constitutional AI approach
  • Our Tier 2 is inspired by this

Assertion-Based Testing (used by PromptLayer)

  • Define semantic assertions like “response contains date”
  • Use NLP to validate claims rather than string matching
  • More sophisticated than our keyword approach

Human-in-the-Loop (used by Anthropic)

  • Sample responses sent to humans for rating
  • Gold standard for quality, but doesn’t scale
  • We reserve this for Tier 2 evaluation of edge cases

Future Improvements

1. Semantic Similarity Scoring

Instead of keyword matching, use embeddings to measure semantic distance:

const similarity = cosineSimilarity(embed(response), embed('Expected to mention Alpental and weekend hours'));
expect(similarity).toBeGreaterThan(0.8);

2. Automated Golden Prompt Generation

Use LLMs to generate diverse test cases based on existing ones:

const variants = await generateVariants("What's open at Summit today?", { count: 10, diversity: 'high' });
// "Which areas are operational at Summit right now?"
// "Tell me what's currently available at Summit"
// etc.

3. Continuous Monitoring

  • Track Tier 2 scores over time
  • Alert when scores drop below baseline
  • Correlate with model version updates

4. Multi-Modal Testing

Extend to test function calling with images, audio, and video inputs.

Try It Yourself

All the code for this testing strategy is open source in our portfolio monorepo:

  • Tier 1 tests: apps/deployable/api/generative-ai-api/src/tests/
  • Tier 2 evaluations: apps/evaluation/llm-evaluations/
  • Testing docs: docs/TESTING_SPECIFICATION.md

The Function Calling Demo shows this system in action.

Conclusion

Testing AI applications requires rethinking traditional TDD principles:

  • Do validate behavior (tool selection, semantic relevance)
  • Don’t assert exact outputs
  • Do split deterministic (Tier 1) from quality (Tier 2) tests
  • Don’t block CI/CD on subjective quality metrics
  • Do use real API calls for integration tests
  • Don’t over-rely on mocks

This two-tier approach gives us:

  • Fast, reliable CI/CD (Tier 1 blocks regressions)
  • Quality insights (Tier 2 guides improvements)
  • Cost control (Tier 1 is cheap, Tier 2 runs selectively)

AI systems are inherently non-deterministic, but that doesn’t mean they’re untestable. You just need the right strategy.


Want to dive deeper? Check out the related demo to see this testing strategy in action, or explore our testing specification for implementation details.

Next in this series: Evaluating LLM responses with AI judges (deep dive into Tier 2)