Building a 6-Model Review Panel with GitHub Copilot Custom Agents

TL;DR: I built a system that sends any design decision to 6 different AI models simultaneously — Gemini 3 Pro, GPT-5.2, Claude Sonnet 4.5, GPT-5.2-Codex, Claude Opus 4.6, and Grok Code Fast 1 — and synthesizes their feedback into a single report. Each model reviews through a different lens (mobile-first, info architecture, cognitive load, engineering feasibility, deep reasoning, fast gut-check), and an orchestrator agent merges their perspectives. It costs ~13 premium requests per review, takes about 30 seconds, and has already surfaced blind spots I’d have missed with any single model.

Here’s how I built it, what I learned, and how it compares to multi-agent approaches in other tools.

The Problem: One Model, One Perspective

I was working on a precipitation timeline feature for a ski conditions dashboard. The weather data comes from NOAA’s gridpoint API as hourly time-series arrays, and I needed to decide how to bucket those hours into meaningful periods for skiers. Should I stick with two periods (overnight/daytime)? Split into three (overnight/morning/evening)? Four periods with a 2pm boundary for twilight pass holders? Add pass-aware dynamic periods?

Each option had trade-offs across UX, engineering complexity, data accuracy, and user segmentation. I realized I was going back and forth in my own head, and what I really wanted was a panel of reviewers — each bringing a different perspective to the same question.

That’s when I discovered that VS Code’s custom agents now support model selection and subagent orchestration.

What Are Custom Agents?

Since the January 2026 VS Code release, you can create .agent.md files that define specialized AI agents with:

Custom system prompts — tell the agent what lens to use
Model selection — pin to a specific model (or provide a fallback chain)
Tool access — control what the agent can do (search, fetch, read files, invoke other agents)
Subagent restrictions — an orchestrator agent can specify exactly which other agents it’s allowed to call

Agents live in two places:

Workspace-level: .github/agents/ — shared with your team via git
User-level: ~\AppData\Roaming\Code - Insiders\User\agents\ — personal, available across all projects

The key insight: by creating multiple agents pinned to different models, and an orchestrator that dispatches to all of them, you get a multi-model review panel that runs in parallel.

The Architecture

┌─────────────────────────────────────────────────┐
│  @multi-review "Should I use 3 or 4 periods?"   │
│                                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌──────────┐ │
│  │review-gemini│  │ review-gpt  │  │review-    │ │
│  │Gemini 3 Pro │  │  GPT-5.2    │  │claude     │ │
│  │mobile-first │  │info arch    │  │cog load   │ │
│  └──────┬──────┘  └──────┬──────┘  └─────┬─────┘ │
│         │                │               │        │
│  ┌──────┴──────┐  ┌──────┴──────┐  ┌─────┴─────┐ │
│  │review-codex │  │ review-opus │  │review-grok│ │
│  │GPT-5.2-Codex│  │Claude Opus  │  │Grok Fast 1│ │
│  │engineering  │  │deep reason  │  │gut check  │ │
│  └──────┬──────┘  └──────┬──────┘  └─────┬─────┘ │
│         │                │               │        │
│         └────────┬───────┘───────────────┘        │
│          ┌───────┴────────┐                       │
│          │  Orchestrator  │                       │
│          │  Synthesis +   │                       │
│          │  Comparison    │                       │
│          └────────────────┘                       │
└─────────────────────────────────────────────────┘

Seven .agent.md files total: 6 reviewers + 1 orchestrator.

Building the Reviewer Agents

Each reviewer agent follows the same structure but with a distinct lens:

Example: The Gemini Reviewer

---
name: review-gemini
description: Product/UX/technical reviewer powered by Gemini.
model:
  - gemini-3-pro (copilot)
  - gemini-2.5-pro (copilot)
tools: ['search', 'fetch', 'read']
user-invokable: false
---

Key decisions in the design:

model as an array provides automatic fallback. If Gemini 3 Pro is unavailable, it silently falls back to Gemini 2.5 Pro.
user-invokable: false hides the agent from the chat dropdown — it only runs when the orchestrator calls it as a subagent.
tools are intentionally limited. Reviewers can search and read but can’t write files or run terminals.

The system prompt gives each model a distinct personality:

Agent	Model	Lens	What It Asks
review-gemini	Gemini 3 Pro	Mobile-first, practical	”How does this feel on a phone?“
review-gpt	GPT-5.2	Info architecture, competitive	”What do comparable products do?“
review-claude	Claude Sonnet 4.5	Cognitive load, behavioral psych	”How many decisions are you forcing?“
review-codex	GPT-5.2-Codex	Engineering, implementation	”What’s the actual code complexity?“
review-opus	Claude Opus 4.6	Deep reasoning, system-level	”What are the second-order effects?“
review-grok	Grok Code Fast 1	Fast gut-check	”Does this really matter?”

The Cost Profile

Not all reviews are equal in cost:

Agent	Model	Premium Multiplier	Role
review-grok	Grok Code Fast 1	0.25x	Cheapest — the quick sanity check
review-gemini	Gemini 3 Pro	1x	Standard
review-gpt	GPT-5.2	1x	Standard
review-claude	Claude Sonnet 4.5	1x	Standard
review-codex	GPT-5.2-Codex	1x	Standard
review-opus	Claude Opus 4.6	10x	Most expensive — deep reasoning
Total		~13.25x	Full panel

For budget-conscious reviews: skip Opus (saves 10x) or run only Gemini + GPT + Claude (3x total).

The Orchestrator

The orchestrator is the glue. Its frontmatter restricts which agents it can invoke:

---
name: multi-review
description: Multi-model review panel.
tools: ['agent', 'search', 'fetch', 'read']
agents: ['review-gemini', 'review-gpt', 'review-claude', 'review-codex', 'review-opus', 'review-grok']
---

Its system prompt defines a 3-step workflow:

Parse the question and formulate a clear review prompt
Dispatch to all 6 reviewers in parallel
Synthesize into consensus, divergence, per-reviewer highlights, comparison table, and final recommendation

The synthesis step is the real value. Raw output from 6 models is overwhelming. The orchestrator distills it into: “All 6 agree on X. Gemini and Grok diverge on Y. Opus surfaced a second-order concern about Z that nobody else caught.”

How to Use It

In VS Code chat:

@multi-review I'm deciding between 3 time periods (overnight/daytime/evening)
and 4 periods (splitting daytime at 2pm for twilight pass holders).
Context: ski conditions dashboard, NOAA hourly data,
user segments include day pass holders (9am-5pm)
and twilight pass holders (2pm-close at 9:30pm).

The orchestrator dispatches to all 6 models, collects their reviews, and returns a synthesized report with a comparison table.

How This Compares to Other Tools

The multi-agent landscape has exploded in early 2026. Here’s how the major tools approach it differently:

GitHub Copilot (VS Code) — Declarative Agent Orchestration

Approach: .agent.md files with YAML frontmatter. Agents can invoke subagents. Parallel execution since January 2026.

Strengths:

Broadest model selection (GPT, Claude, Gemini, Grok — 10+ models)
Declarative config — no code required to define agents
User-level agents work across all workspaces
Agent Skills (Anthropic’s open standard) for reusable capabilities
Agent Sessions view consolidates local + background + cloud agents

Limitations:

No file system isolation between parallel agents (unlike Cursor’s worktrees)
Subagent context is isolated from parent — can’t share intermediate state

Claude Code — Agent Teams with Deep Context

Approach: Agent teams via the Agent SDK. Peer agents coordinate toward shared goals rather than leader-follower hierarchy.

Strengths:

1M token context window with Opus 4.6 (!)
Context compaction for long-running sessions
Adaptive thinking — model decides when to use extended reasoning
Can take over any subagent mid-execution (Shift+Up/Down)
CLAUDE.md files as persistent project memory

Limitations:

Naturally biased toward Claude models
Agent team coordination is “research preview” — still maturing
No declarative multi-model setup like VS Code’s .agent.md

Cursor — Parallel Agents with Git Worktree Isolation

Approach: Each parallel agent gets its own Git worktree (isolated working directory, shared .git object store).

Strengths:

True file system isolation — agents can’t conflict
Up to 8 agents in parallel on a single prompt
Plan mode: plan with one model, execute with another
Planner/worker/judge architecture for scaling to hundreds of agents

Limitations:

Requires Git overhead for isolation
Coordination through shared branches, not shared context
No declarative agent definition format (yet)

Copilot CLI — Terminal-Native Agents

Approach: Built-in specialized agents (Explore, Task, Plan, Code-review) with automatic delegation.

Strengths:

Agents auto-select based on your prompt — no manual choice needed
Same .agent.md format as VS Code for custom agents
Agent Registry (with JetBrains, Zed) for cross-IDE discovery
Auto-compaction at 95% context usage

Limitations:

Terminal-only — no visual comparison UI
Model selection more limited than VS Code

Windsurf — Flow-Aware Cascade

Approach: Cascade agent tracks all developer actions (edits, commands, clipboard, terminal) to infer intent.

Strengths:

Implicit intent inference — “continue my work” actually works
Arena Mode for head-to-head model comparison
Git worktree isolation (like Cursor)
Memories system for cross-session context

Limitations:

More opaque agent behavior (hard to debug the “flow” reasoning)
Tighter vendor coupling than open-standard approaches

Amazon Q Developer — AWS-Native Agents

Approach: Custom agents via configuration files with granular tool/path permissions.

Strengths:

Deep AWS service integration (CloudWatch, Lambda, DynamoDB analysis)
Granular permission model (read-only vs write-only per path)
Free tier: 50 agentic chats/month

Limitations:

Heavily AWS-focused — less general-purpose
Fewer model choices than VS Code or Cursor

The Comparison Matrix

Capability	VS Code Copilot	Claude Code	Cursor	Copilot CLI	Windsurf	Amazon Q
Custom agent definitions	`.agent.md` ✅	Agent SDK ✅	Rules ⚠️	`.agent.md` ✅	Rules ⚠️	Config files ✅
Multi-model in same workflow	✅ (10+ models)	⚠️ (Claude-focused)	✅ (~8 models)	✅	✅	⚠️ (fewer)
Sub-agent orchestration	✅ (parallel)	✅ (agent teams)	✅ (worktrees)	✅ (auto-delegate)	⚠️ (multi-cascade)	⚠️
File system isolation	❌	❌	✅ (worktrees)	N/A	✅ (worktrees)	❌
Declarative agent config	✅	Partial	❌	✅	❌	✅
Cross-IDE portability	✅ (ACP + Registry)	❌	❌	✅ (ACP)	❌	❌
Context window	Standard	1M tokens	Standard	Standard	Standard	Standard

What I’ve Learned

This isn’t just “get more opinions for confidence.” Models trained differently actually surface different concerns. In my precipitation timeline review:

Gemini focused on mobile scrolling behavior and touch targets
GPT referenced how Ski Utah and OpenSnow handle time bucketing
Claude Sonnet flagged cognitive overload from too many time periods
Codex pointed out the code complexity of dynamic period boundaries
Opus identified a second-order effect: changing periods would affect historical comparisons
Grok said “3 periods is fine, ship it, stop overthinking”

Grok’s directness was surprisingly valuable. Sometimes the most useful review is the one that says “this isn’t worth the complexity.”

2. The Orchestrator Synthesis Is The Key Feature

Raw output from 6 models is ~3,000-5,000 words. Nobody reads that. The orchestrator’s job — identifying consensus, surfacing divergence, building comparison tables — is what makes the system usable. Without synthesis, it’s just noise.

3. Cost Management Matters

At 13.25x premium requests per full review, you won’t run this on every commit message. I use it for:

Architecture decisions (which database? which API pattern?)
UX decisions with multiple valid approaches
Reviewing my own specs before implementation

For quick checks, I skip Opus and Codex (keeps it at 3.25x).

4. `user-invokable: false` Is Essential for Clean UX

Without this flag, all 6 reviewer agents would appear in your chat dropdown alongside your regular agents. Setting user-invokable: false keeps them hidden — they only activate when the orchestrator calls them. This is the difference between a usable system and a cluttered mess.

5. Fallback Chains Handle Model Deprecations Gracefully

GitHub deprecates model versions regularly (next batch: Feb 17, 2026). The array syntax for model ensures your agents keep working:

model:
  - gemini-3-pro (copilot) # primary
  - gemini-2.5-pro (copilot) # fallback

When Gemini 2.5 Pro gets deprecated on Feb 17, the agent is already set to prefer Gemini 3 Pro. Zero downtime.

Setting This Up Yourself

Step 1: Create Your Agent Directory

For user-level agents (available across all workspaces):

macOS/Linux: ~/.vscode-insiders/data/User/agents/ (or ~/.vscode/data/User/agents/ for stable)
Windows: %APPDATA%\Code - Insiders\User\agents\

For workspace-level agents (shared via git): .github/agents/

Step 2: Create Reviewer Agents

Create one .agent.md file per model. Minimum viable agent:

---
name: review-mymodel
description: Technical reviewer on ModelName.
model: model-name (copilot)
tools: ['search', 'fetch', 'read']
user-invokable: false
---
You are a technical reviewer. Rate options as Strong/Moderate/Weak.
Give a clear recommendation with reasoning.

Step 3: Create the Orchestrator

---
name: multi-review
description: Multi-model review orchestrator.
tools: ['agent', 'search', 'fetch', 'read']
agents: ['review-mymodel1', 'review-mymodel2', 'review-mymodel3']
---

Dispatch the user's question to all reviewers in parallel.
Synthesize into: Consensus, Divergence, Comparison Table, Recommendation.

Step 4: Use It

@multi-review [your question with context]

What’s Next

The multi-agent ecosystem is moving fast. A few things I’m watching:

Agent Client Protocol (ACP) — open standard from GitHub + JetBrains + Zed for cross-IDE agent portability. Your .agent.md files could work in JetBrains IDEs without modification.
Agent Skills — Anthropic’s open standard for reusable agent capabilities. Think npm packages but for agent behaviors.
Extended context windows — Claude Code’s 1M token context enables agent sessions that span days. This changes what’s possible for long-running autonomous agents.
Planner/worker/judge patterns — Cursor’s research on scaling to hundreds of parallel agents suggests a future where “run 100 agents on this codebase” is a normal workflow.

The role of the developer is shifting from “person who writes code” to “person who orchestrates AI agents that write code.” Multi-model review panels are an early, practical example of that shift — and the tools to build them are available today.

Built with GitHub Copilot in VS Code Insiders, running Claude Opus 4.6. Agent files available at github.com/nickmaassel.

The Problem: One Model, One Perspective

What Are Custom Agents?

The Architecture

Building the Reviewer Agents

Example: The Gemini Reviewer

The Cost Profile

The Orchestrator

How to Use It

How This Compares to Other Tools

GitHub Copilot (VS Code) — Declarative Agent Orchestration

Claude Code — Agent Teams with Deep Context

Cursor — Parallel Agents with Git Worktree Isolation

Copilot CLI — Terminal-Native Agents

Windsurf — Flow-Aware Cascade

Amazon Q Developer — AWS-Native Agents

The Comparison Matrix

What I’ve Learned

1. Different Models Have Genuinely Different Blind Spots

2. The Orchestrator Synthesis Is The Key Feature

3. Cost Management Matters

4. user-invokable: false Is Essential for Clean UX

5. Fallback Chains Handle Model Deprecations Gracefully

Setting This Up Yourself

Step 1: Create Your Agent Directory

Step 2: Create Reviewer Agents

Step 3: Create the Orchestrator

Step 4: Use It

What’s Next

4. `user-invokable: false` Is Essential for Clean UX