Building an Evaluation Harness for VSCode Copilot Chat
Because your prompts deserve better than vibe-checks
You built a VSCode Copilot Chat prompt that extracts key information from server error logs. You paste a log entry, it returns structured data: error type, severity, and affected component. It works great for some logs, but fails on others. You tweak the prompt to fix the failing cases - now the ones that worked are broken. Without systematic testing, you’re playing whack-a-mole: every fix introduces new problems, and you can’t tell if you’re making progress or just moving issues around. Sound familiar?
While VSCode Copilot Chat is marketed for code interaction, it turns out that with Agent mode, custom tools, and reusable prompts, you can build agentic workflows with almost no code. The batteries are included - file operations, web access, and MCP tools - making it well-suited for rapid prototyping of AI workflows that may or may not interact with your codebase. But there’s a catch: while you can test your custom MCP servers and tools, there’s no way to evaluate the prompts that orchestrate them. You’re stuck with “it seems to work fine” - yet evals are essential for systematically improving AI solutions.
This is a real problem. Without evaluation, you can’t measure performance, catch regressions, or improve reliably. You are left between vibe checks and building a fully-fledged custom solution that could take weeks - effort that could be completely wasted if the workflow turns out to be less useful than expected. In this post, we’ll build a scrappy evaluation harness for VSCode Copilot Chat prompts that hopefully gets us 80% of the value with 20% of the effort. The code is available at github.com/bepuca/copilot-chat-eval.
The Problem
To build an evaluation harness for VSCode Copilot Chat prompts, we need to solve a specific challenge: how do we programmatically run the same prompt with different inputs and capture the results?
Let’s be concrete. Say you have a prompt that parses server logs. You want to:
- Feed it 20 different log entries from your test dataset
- Capture each extraction response
- Check if those extraction are correct
- Track performance over time
But VSCode Chat is designed for interactive use, not automation. There’s no obvious API to send a message and get a response programmatically. And here’s the crucial constraint: LLMs are sensitive to context, so our evaluation must mimic manual use exactly. If we test with a different context or invocation method, we’re not actually testing what users experience - and we don’t know what context differences might exist.
For simplicity, we’ll focus on single-turn interactions: one prompt, one response. Multi-turn conversations add complexity we could tackle later.
Our requirements are straightforward: - Define a reusable prompt that accepts parameters - Create a dataset of test inputs - Run each input through Copilot Chat automatically. - Capture and save the responses for analysis
If we can do this, we can finally measure our prompts’ performance.
Exploring Our Options
Since VSCode doesn’t expose chat functionality through an API, we need to get creative. The primary way to programmatically interact with VSCode is by building an extension. After digging through the documentation, I found three potential approaches:
- Build a Chat Extension: Create a chat participant that users invoke with
@participant
. This won’t work because:- It changes the user’s workflow (they’d have to type
@eval /myprompt
instead of just/myprompt
). - Not supported in Agent mode (at least for now)
- Different invocation = different LLM context = invalid evaluation. Off the table, then.
- It changes the user’s workflow (they’d have to type
- Use the Language Model API: Call Copilot’s models directly from our extension. Promising, but:
- No custom system prompts allowed (might not match chat’s behavior).
- Unclear if Agent mode and tools work through this API.
- If we can’t replicate chat features, we’re not really testing the same thing.
- Automate VSCode commands: Use the same commands that keybindings trigger to programmatically control the chat. A bit of a workaround, but:
- Could reproduce exact user interactions.
- Relies on somewhat undocumented behavior that might break.
- No guarantees it’ll work or keep working.
Each of these options has significant drawbacks that make evaluation difficult.
With VSCode Chat Copilot going open source, the community might eventually build proper evaluation tools. But that could take months - and often something shipped today is better than perfection later.
Let’s take a practical approach and focus on what’s realistically achievable.
Testing the Language Model API
Option 2 (Language Model API) seems most promising - if it works, we get clean programmatic access. But there’s a critical question: does it use the same system prompt as the chat interface? If not, we’re evaluating something different from what users experience.
To find out, we need to see what prompts VSCode actually sends. Following a similar approach as Hamel’s, we’ll use mitmproxy to intercept VSCode’s API calls and examine the system prompts.
Setting Up mitmproxy
To set up mitmproxy to intercept VSCode’s API calls:
- Install:
uv tool install mitmproxy
or follow official instructions - Start the browser-based UI:
mitmweb
- Configure VSCode to use the proxy: Add
"http.proxy": "http://127.0.0.1:8080"
tosettings.json
. Or set up the “Http: Proxy” setting in the UI. - Trust the certificate so requests succeed: Follow mitmproxy’s instructions for your OS.
mitmproxy
security note
mitmproxy
creates a local user-specific CA certificate on first use. Keep that private key secure, since anyone who obtains it could decrypt your traffic.- These instructions run the proxy on localhost, all intercepted data stays on your machine, but be sure to configure only VS Code (not your entire system) to use the proxy so other applications aren’t inadvertently routed through it.
- Once you’re finished inspecting calls, remember to remove or disable the proxy and untrust the certificate, especially if you’re handling sensitive code or secrets in the editor.
What Chat Sends
With mitmproxy running, we can trigger a chat message and searched for calls to https://api.enterprise.githubcopilot.com/chat/completions
. Here’s the request structure:
{
"messages": [
{
"role": "system",
"content": "You are an AI programming assistant.\nWhen...",
"copilot_cache_control": {
"type": "ephemeral"
}
},
{
"role": "user",
"content": "<environment_info>\nThe user's ...",
"copilot_cache_control": {
"type": "ephemeral"
}
},
{
"role": "user",
"content": "<context>\nThe current date is...",
"copilot_cache_control": {
"type": "ephemeral"
}
}
],
"model": "claude-sonnet-4",
"temperature": 0,
"top_p": 1,
"max_tokens": 16000,
"tools": [
// many and irrelevant at this point
]
}
The raw JSON has escaped characters, but when formatted for readability, the key parts are below. Note that the details of the Chat system prompt are not critical, we just need to see if it matches the Language Model API’s.
Key insights for evaluation:
- Temperature is 0 - good for reproducibility
- Environment info is injected - OS details make context vary between users
- Date is injected - same prompt on different days = different context
Perfect reproducibility is impossible since VSCode injects dynamic context. But that’s OK - we’re building a practical tool, not a perfect one. Some signal beats no signal.
What the Language Model API sends
To see if the Language Model API uses the same system prompt as chat, I built a minimal VSCode extension that calls the API directly. The key part:
const [model] = await vscode.lm.selectChatModels({
: 'copilot',
vendor: 'claude-sonnet-4'
family;
})const messages = [
.LanguageModelChatMessage.User('What is the meaning of life?')
vscode;
]const request = model.sendRequest(messages, {}, token);
When I ran this extension and checked mitmproxy, here’s what the Language Model API sends:
{
"messages": [
{
"role": "system",
"content": "Follow Microsoft content policies...",
},
{
"role": "user",
"content": "What is the meaning of life?"
}
],
"model": "claude-sonnet-4",
"temperature": 0.1,
"top_p": 1,
"max_tokens": 16000,
"n": 1,
"stream": true
}
Once again, formatting the system prompt for readability:
The verdict: completely different system prompts.
Chat gets a comprehensive system prompt with detailed instructions about tools, file editing, and workspace context. The Language Model API gets a minimal prompt focused on basic content policies.
Key differences:
- System prompt: Chat has ~200 lines of instructions, API has ~10 lines
- Context injection: Chat auto-adds environment info, API gives you full control
- Temperature: Chat uses 0, API defaults to 0.1
Bottom line: The Language Model API evaluates a different system than what users experience in chat. Our best remaining option now is option 3 - automating VSCode commands to control the actual chat interface.
Hacking VSCode Commands
Since the Language Model API won’t work, I needed to find a way to control VSCode’s chat interface programmatically. Most VSCode actions can be triggered via commands. While not all of them are explicitly documented, the rest can be found in keybindings.json
.
I extracted all chat commands from VSCode’s keybindings with a simple grep:
grep -o 'workbench\.action\.chat\.[^"]*' keybindings.json | sort -u
This found 50+ commands:
After some experimentation, the key ones for automation:
workbench.action.chat.newChat
- Start fresh conversationworkbench.action.chat.attachFile
- Attach prompt fileworkbench.action.chat.open
- Focus and send message to chatworkbench.action.chat.export
- Export chat to JSON
Building the Evaluation Loop
The automation flow is straightforward. For each record in the evaluation dataset:
- Start a new chat
- Attach the prompt file
- Send the test input
- Wait for completion
- Export and save results
Here’s the core loop:
const promptUri = vscode.window.activeTextEditor?.document.uri;
const resultsFile = await initResultsFile(root, promptUri);
for (const rec of records) {
if (typeof rec.input !== 'string') continue;
await vscode.commands.executeCommand('workbench.action.chat.newChat');
await vscode.commands.executeCommand('workbench.action.chat.attachFile', promptUri);
await vscode.commands.executeCommand('workbench.action.chat.open', rec.input);
await sleep(rec.waitMs); // let the chat finish, as we cannot query the status
await vscode.commands.executeCommand('workbench.action.chat.export');
await sleep(500); // give VS Code time to write the file
const exportDir = path.dirname(promptUri.fsPath); // default destination
await collectAndAppendChatExport(exportDir, resultsFile);
}
Example: Country Capitals
Let’s test this with a simple prompt that takes a country name and returns its capital:
capital.prompt.md
---
mode: agent
tools: []
---
The user provides a country and you should answer with only the capital of that country.
The dataset format is simple - each test case needs: - input
: The message to send to the prompt - waitMs
: How long to wait for the response (we can’t detect completion)
dataset.json
[
{"input": "France", "waitMs": 4000, "capital": "Paris"},
{"input": "Japan", "waitMs": 4000, "capital": "Tokyo"},
{"input": "Spain", "waitMs": 4000, "capital": "Madrid"}
]
Running the extension processes each test case and exports the results to .github/evals/<prompt>/<datetime>.json
. Then we can evaluate the responses:
eval_capital.py
import json
import sys
from pathlib import Path
= json.loads(Path(sys.argv[1]).read_text())
dataset = json.loads(Path(sys.argv[2]).read_text())
results
= 0
correct for record, result_chat in zip(dataset, results):
= result_chat["requests"][0]["response"][0]["value"]
answer += 1 * (record["capital"] in answer)
correct
= correct / len(dataset) * 100
accuracy print(f"Accuracy: {accuracy:.2f}% ({correct}/{len(dataset)})")
$ python eval_capital.py dataset.json .github/evals/capital/20250625-0840.json
Accuracy: 100.00% (3/3)
Seeing it in Action
The evaluation harness works! We can now systematically test our prompts and track performance over time.
While our example uses simple string matching, you can make evaluation as sophisticated as needed - LLM-as-judge for complex outputs or multi-dimensional scoring, for instance. The key is starting simple and iterating.
Conclusion
Yes, we can evaluate VSCode Copilot Chat workflows - with significant limitations. Our approach has rough edges:
- Sequential execution - Evaluations run one record at a time, no parallelization
- Fixed wait times - We must guess how long each prompt takes since there’s no way to query completion status, leading to either wasted time or incomplete responses
- Manual save dialog - Users must press Enter for each evaluation run since the export command doesn’t accept a file path
And it only works for a subset of prompts:
- Read-only - No side effects like file modifications or API calls. Running these in evaluation could cause real damage or spam external services.
- Stateless - Don’t depend on current workspace changes or git state. Reproducing “review my current changes” would require setting up different workspace states for each test case.
- Single-turn - One input, one output. Multi-turn conversations require simulating user responses, which adds significant complexity.
- Time-insensitive - VSCode injects the current date into prompts. If your workflow depends on “today’s date,” results will vary between evaluation runs.
Despite these constraints, many useful workflows still fit within them: generating code snippets, writing documentation, analyzing error messages, or converting data formats. And crucially, having imperfect evaluation beats having none at all.
Where this helps most: early experimentation. When you’re testing whether a prompt idea even works, this approach provides basic feedback without building a full custom solution.
The workflow becomes:
- Build your prompt in VSCode using familiar tools.
- Create a test dataset with representative inputs.
- Build an evaluation script to derive metrics.
- Run the evaluation harness to get systematic feedback.
- Iterate based on concrete results rather than guesswork.
- Make data-driven decisions about next steps.
This combination - fast workflow development in VSCode Copilot Chat + actual performance measurement - lets you quickly validate whether a workflow delivers real value and how reliable it is. With both pieces of data, you can make informed decisions about whether to invest in a custom solution.
Further work
This extension is a proof of concept that works by creatively leveraging VSCode’s existing mechanisms. It may be brittle and break as the platform evolves, especially at the current pace of AI tooling development. However, having a working blueprint makes iteration easier than starting from scratch.
With VSCode Chat going open source, there may be opportunities to build more robust evaluation tools with official support.
The capital cities example was deliberately simple. Agent mode prompts can reference tools via hashtags (e.g., #search_repositories
, #file_search
), and these tool-using, agentic workflows are where the possibilities expand significantly. The evaluation harness captures outputs regardless of which tools were invoked, making it just as applicable to complex workflows as simple ones.
References
- Repo with the code: bepuca/copilot-chat-eval
- VSCode Copilot Chat
- Agent mode
- Custom tools
- Reusable prompts
- Model Context Protocol (MCP)
- Your AI product needs Evals - Hamel’s Blog
- Hypothesis Driven Development for AI Products
- VSCode Language Model API
- Building a VSCode extension
- Chat Extension API
- VSCode Chat Copilot going open source
- Ugly Code and Dumb Things - Armin Ronacher
- **** You, Show Me The Prompt - Hamel’s Blog
- mitmproxy
- VSCode commands
- Copilot Chat Eval Repository