TL;DR: I built a system that sends any design decision to 6 different AI models simultaneously — Gemini 3 Pro, GPT-5.2, Claude Sonnet 4.5, GPT-5.2-Codex, Claude Opus 4.6, and Grok Code Fast 1 — and synthesizes their feedback into a single report. Each model reviews through a different lens (mobile-first, info architecture, cognitive load, engineering feasibility, deep reasoning, fast gut-check), and an orchestrator agent merges their perspectives. It costs ~13 premium requests per review, takes about 30 seconds, and has already surfaced blind spots I’d have missed with any single model.
Here’s how I built it, what I learned, and how it compares to multi-agent approaches in other tools.
The Problem: One Model, One Perspective
I was working on a precipitation timeline feature for a ski conditions dashboard. The weather data comes from NOAA’s gridpoint API as hourly time-series arrays, and I needed to decide how to bucket those hours into meaningful periods for skiers. Should I stick with two periods (overnight/daytime)? Split into three (overnight/morning/evening)? Four periods with a 2pm boundary for twilight pass holders? Add pass-aware dynamic periods?
Each option had trade-offs across UX, engineering complexity, data accuracy, and user segmentation. I realized I was going back and forth in my own head, and what I really wanted was a panel of reviewers — each bringing a different perspective to the same question.
That’s when I discovered that VS Code’s custom agents now support model selection and subagent orchestration.
What Are Custom Agents?
Since the January 2026 VS Code release, you can create .agent.md files that define specialized AI agents with:
- Custom system prompts — tell the agent what lens to use
- Model selection — pin to a specific model (or provide a fallback chain)
- Tool access — control what the agent can do (search, fetch, read files, invoke other agents)
- Subagent restrictions — an orchestrator agent can specify exactly which other agents it’s allowed to call
Agents live in two places:
- Workspace-level:
.github/agents/— shared with your team via git - User-level:
~\AppData\Roaming\Code - Insiders\User\agents\— personal, available across all projects
The key insight: by creating multiple agents pinned to different models, and an orchestrator that dispatches to all of them, you get a multi-model review panel that runs in parallel.
The Architecture
┌─────────────────────────────────────────────────┐
│ @multi-review "Should I use 3 or 4 periods?" │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────┐ │
│ │review-gemini│ │ review-gpt │ │review- │ │
│ │Gemini 3 Pro │ │ GPT-5.2 │ │claude │ │
│ │mobile-first │ │info arch │ │cog load │ │
│ └──────┬──────┘ └──────┬──────┘ └─────┬─────┘ │
│ │ │ │ │
│ ┌──────┴──────┐ ┌──────┴──────┐ ┌─────┴─────┐ │
│ │review-codex │ │ review-opus │ │review-grok│ │
│ │GPT-5.2-Codex│ │Claude Opus │ │Grok Fast 1│ │
│ │engineering │ │deep reason │ │gut check │ │
│ └──────┬──────┘ └──────┬──────┘ └─────┬─────┘ │
│ │ │ │ │
│ └────────┬───────┘───────────────┘ │
│ ┌───────┴────────┐ │
│ │ Orchestrator │ │
│ │ Synthesis + │ │
│ │ Comparison │ │
│ └────────────────┘ │
└─────────────────────────────────────────────────┘
Seven .agent.md files total: 6 reviewers + 1 orchestrator.
Building the Reviewer Agents
Each reviewer agent follows the same structure but with a distinct lens:
Example: The Gemini Reviewer
---
name: review-gemini
description: Product/UX/technical reviewer powered by Gemini.
model:
- gemini-3-pro (copilot)
- gemini-2.5-pro (copilot)
tools: ['search', 'fetch', 'read']
user-invokable: false
---
Key decisions in the design:
modelas an array provides automatic fallback. If Gemini 3 Pro is unavailable, it silently falls back to Gemini 2.5 Pro.user-invokable: falsehides the agent from the chat dropdown — it only runs when the orchestrator calls it as a subagent.toolsare intentionally limited. Reviewers can search and read but can’t write files or run terminals.
The system prompt gives each model a distinct personality:
| Agent | Model | Lens | What It Asks |
|---|---|---|---|
| review-gemini | Gemini 3 Pro | Mobile-first, practical | ”How does this feel on a phone?“ |
| review-gpt | GPT-5.2 | Info architecture, competitive | ”What do comparable products do?“ |
| review-claude | Claude Sonnet 4.5 | Cognitive load, behavioral psych | ”How many decisions are you forcing?“ |
| review-codex | GPT-5.2-Codex | Engineering, implementation | ”What’s the actual code complexity?“ |
| review-opus | Claude Opus 4.6 | Deep reasoning, system-level | ”What are the second-order effects?“ |
| review-grok | Grok Code Fast 1 | Fast gut-check | ”Does this really matter?” |
The Cost Profile
Not all reviews are equal in cost:
| Agent | Model | Premium Multiplier | Role |
|---|---|---|---|
| review-grok | Grok Code Fast 1 | 0.25x | Cheapest — the quick sanity check |
| review-gemini | Gemini 3 Pro | 1x | Standard |
| review-gpt | GPT-5.2 | 1x | Standard |
| review-claude | Claude Sonnet 4.5 | 1x | Standard |
| review-codex | GPT-5.2-Codex | 1x | Standard |
| review-opus | Claude Opus 4.6 | 10x | Most expensive — deep reasoning |
| Total | ~13.25x | Full panel |
For budget-conscious reviews: skip Opus (saves 10x) or run only Gemini + GPT + Claude (3x total).
The Orchestrator
The orchestrator is the glue. Its frontmatter restricts which agents it can invoke:
---
name: multi-review
description: Multi-model review panel.
tools: ['agent', 'search', 'fetch', 'read']
agents: ['review-gemini', 'review-gpt', 'review-claude', 'review-codex', 'review-opus', 'review-grok']
---
Its system prompt defines a 3-step workflow:
- Parse the question and formulate a clear review prompt
- Dispatch to all 6 reviewers in parallel
- Synthesize into consensus, divergence, per-reviewer highlights, comparison table, and final recommendation
The synthesis step is the real value. Raw output from 6 models is overwhelming. The orchestrator distills it into: “All 6 agree on X. Gemini and Grok diverge on Y. Opus surfaced a second-order concern about Z that nobody else caught.”
How to Use It
In VS Code chat:
@multi-review I'm deciding between 3 time periods (overnight/daytime/evening)
and 4 periods (splitting daytime at 2pm for twilight pass holders).
Context: ski conditions dashboard, NOAA hourly data,
user segments include day pass holders (9am-5pm)
and twilight pass holders (2pm-close at 9:30pm).
The orchestrator dispatches to all 6 models, collects their reviews, and returns a synthesized report with a comparison table.
How This Compares to Other Tools
The multi-agent landscape has exploded in early 2026. Here’s how the major tools approach it differently:
GitHub Copilot (VS Code) — Declarative Agent Orchestration
Approach: .agent.md files with YAML frontmatter. Agents can invoke subagents. Parallel execution since January 2026.
Strengths:
- Broadest model selection (GPT, Claude, Gemini, Grok — 10+ models)
- Declarative config — no code required to define agents
- User-level agents work across all workspaces
- Agent Skills (Anthropic’s open standard) for reusable capabilities
- Agent Sessions view consolidates local + background + cloud agents
Limitations:
- No file system isolation between parallel agents (unlike Cursor’s worktrees)
- Subagent context is isolated from parent — can’t share intermediate state
Claude Code — Agent Teams with Deep Context
Approach: Agent teams via the Agent SDK. Peer agents coordinate toward shared goals rather than leader-follower hierarchy.
Strengths:
- 1M token context window with Opus 4.6 (!)
- Context compaction for long-running sessions
- Adaptive thinking — model decides when to use extended reasoning
- Can take over any subagent mid-execution (Shift+Up/Down)
- CLAUDE.md files as persistent project memory
Limitations:
- Naturally biased toward Claude models
- Agent team coordination is “research preview” — still maturing
- No declarative multi-model setup like VS Code’s
.agent.md
Cursor — Parallel Agents with Git Worktree Isolation
Approach: Each parallel agent gets its own Git worktree (isolated working directory, shared .git object store).
Strengths:
- True file system isolation — agents can’t conflict
- Up to 8 agents in parallel on a single prompt
- Plan mode: plan with one model, execute with another
- Planner/worker/judge architecture for scaling to hundreds of agents
Limitations:
- Requires Git overhead for isolation
- Coordination through shared branches, not shared context
- No declarative agent definition format (yet)
Copilot CLI — Terminal-Native Agents
Approach: Built-in specialized agents (Explore, Task, Plan, Code-review) with automatic delegation.
Strengths:
- Agents auto-select based on your prompt — no manual choice needed
- Same
.agent.mdformat as VS Code for custom agents - Agent Registry (with JetBrains, Zed) for cross-IDE discovery
- Auto-compaction at 95% context usage
Limitations:
- Terminal-only — no visual comparison UI
- Model selection more limited than VS Code
Windsurf — Flow-Aware Cascade
Approach: Cascade agent tracks all developer actions (edits, commands, clipboard, terminal) to infer intent.
Strengths:
- Implicit intent inference — “continue my work” actually works
- Arena Mode for head-to-head model comparison
- Git worktree isolation (like Cursor)
- Memories system for cross-session context
Limitations:
- More opaque agent behavior (hard to debug the “flow” reasoning)
- Tighter vendor coupling than open-standard approaches
Amazon Q Developer — AWS-Native Agents
Approach: Custom agents via configuration files with granular tool/path permissions.
Strengths:
- Deep AWS service integration (CloudWatch, Lambda, DynamoDB analysis)
- Granular permission model (read-only vs write-only per path)
- Free tier: 50 agentic chats/month
Limitations:
- Heavily AWS-focused — less general-purpose
- Fewer model choices than VS Code or Cursor
The Comparison Matrix
| Capability | VS Code Copilot | Claude Code | Cursor | Copilot CLI | Windsurf | Amazon Q |
|---|---|---|---|---|---|---|
| Custom agent definitions | .agent.md ✅ | Agent SDK ✅ | Rules ⚠️ | .agent.md ✅ | Rules ⚠️ | Config files ✅ |
| Multi-model in same workflow | ✅ (10+ models) | ⚠️ (Claude-focused) | ✅ (~8 models) | ✅ | ✅ | ⚠️ (fewer) |
| Sub-agent orchestration | ✅ (parallel) | ✅ (agent teams) | ✅ (worktrees) | ✅ (auto-delegate) | ⚠️ (multi-cascade) | ⚠️ |
| File system isolation | ❌ | ❌ | ✅ (worktrees) | N/A | ✅ (worktrees) | ❌ |
| Declarative agent config | ✅ | Partial | ❌ | ✅ | ❌ | ✅ |
| Cross-IDE portability | ✅ (ACP + Registry) | ❌ | ❌ | ✅ (ACP) | ❌ | ❌ |
| Context window | Standard | 1M tokens | Standard | Standard | Standard | Standard |
What I’ve Learned
1. Different Models Have Genuinely Different Blind Spots
This isn’t just “get more opinions for confidence.” Models trained differently actually surface different concerns. In my precipitation timeline review:
- Gemini focused on mobile scrolling behavior and touch targets
- GPT referenced how Ski Utah and OpenSnow handle time bucketing
- Claude Sonnet flagged cognitive overload from too many time periods
- Codex pointed out the code complexity of dynamic period boundaries
- Opus identified a second-order effect: changing periods would affect historical comparisons
- Grok said “3 periods is fine, ship it, stop overthinking”
Grok’s directness was surprisingly valuable. Sometimes the most useful review is the one that says “this isn’t worth the complexity.”
2. The Orchestrator Synthesis Is The Key Feature
Raw output from 6 models is ~3,000-5,000 words. Nobody reads that. The orchestrator’s job — identifying consensus, surfacing divergence, building comparison tables — is what makes the system usable. Without synthesis, it’s just noise.
3. Cost Management Matters
At 13.25x premium requests per full review, you won’t run this on every commit message. I use it for:
- Architecture decisions (which database? which API pattern?)
- UX decisions with multiple valid approaches
- Reviewing my own specs before implementation
For quick checks, I skip Opus and Codex (keeps it at 3.25x).
4. user-invokable: false Is Essential for Clean UX
Without this flag, all 6 reviewer agents would appear in your chat dropdown alongside your regular agents. Setting user-invokable: false keeps them hidden — they only activate when the orchestrator calls them. This is the difference between a usable system and a cluttered mess.
5. Fallback Chains Handle Model Deprecations Gracefully
GitHub deprecates model versions regularly (next batch: Feb 17, 2026). The array syntax for model ensures your agents keep working:
model:
- gemini-3-pro (copilot) # primary
- gemini-2.5-pro (copilot) # fallback
When Gemini 2.5 Pro gets deprecated on Feb 17, the agent is already set to prefer Gemini 3 Pro. Zero downtime.
Setting This Up Yourself
Step 1: Create Your Agent Directory
For user-level agents (available across all workspaces):
- macOS/Linux:
~/.vscode-insiders/data/User/agents/(or~/.vscode/data/User/agents/for stable) - Windows:
%APPDATA%\Code - Insiders\User\agents\
For workspace-level agents (shared via git): .github/agents/
Step 2: Create Reviewer Agents
Create one .agent.md file per model. Minimum viable agent:
---
name: review-mymodel
description: Technical reviewer on ModelName.
model: model-name (copilot)
tools: ['search', 'fetch', 'read']
user-invokable: false
---
You are a technical reviewer. Rate options as Strong/Moderate/Weak.
Give a clear recommendation with reasoning.
Step 3: Create the Orchestrator
---
name: multi-review
description: Multi-model review orchestrator.
tools: ['agent', 'search', 'fetch', 'read']
agents: ['review-mymodel1', 'review-mymodel2', 'review-mymodel3']
---
Dispatch the user's question to all reviewers in parallel.
Synthesize into: Consensus, Divergence, Comparison Table, Recommendation.
Step 4: Use It
@multi-review [your question with context]
What’s Next
The multi-agent ecosystem is moving fast. A few things I’m watching:
- Agent Client Protocol (ACP) — open standard from GitHub + JetBrains + Zed for cross-IDE agent portability. Your
.agent.mdfiles could work in JetBrains IDEs without modification. - Agent Skills — Anthropic’s open standard for reusable agent capabilities. Think npm packages but for agent behaviors.
- Extended context windows — Claude Code’s 1M token context enables agent sessions that span days. This changes what’s possible for long-running autonomous agents.
- Planner/worker/judge patterns — Cursor’s research on scaling to hundreds of parallel agents suggests a future where “run 100 agents on this codebase” is a normal workflow.
The role of the developer is shifting from “person who writes code” to “person who orchestrates AI agents that write code.” Multi-model review panels are an early, practical example of that shift — and the tools to build them are available today.
Built with GitHub Copilot in VS Code Insiders, running Claude Opus 4.6. Agent files available at github.com/nickmaassel.