[WIP] test: eval testing setup W-18964528 #102

cristiand391 · 2025-07-10T19:21:41Z

What does this PR do?

Adds a basic setup for eval testing using vitest-evals:
https://github.com/getsentry/vitest-evals

The setup intends to cover real world user flows in natural language and assert that:

our MCP tools are properly selected for the task (bad tool description -> tool not used)
tool responses are evaluated and match the expected values/tone/instructions.
new tools don't collide with existing ones (2 or more tools with very similar descriptions could confuse the LLM).
LLM properly connects "IDE context" and with what tools might expect (contextual stuff like open project dir, current open file, etc)

Regarding 4:
Here's the code that manages the "current context" (open file, locations, etc) in vscode copilot:
https://github.com/microsoft/vscode-copilot-chat/blob/main/src/extension/prompts/node/panel/currentEditor.tsx

The goal is to put some of these stuff that we assume might be available in a IDE to mimic a real-world scenario, examples:

inject an sfdx project path as the open project in the context:

The current open project dir is "${process.env.SF_EVAL_PROMPT_PROJECT_DIR}"

this is a part of the system prompt that we can set dynamically, think a setup where a dreamhouse project via testkit is created -> we pass the project dir as an env var and run the eval test.

current open dir

similar to the previous example, just inject a file path in the system prompt like:

Current open file: /path/to/apex/class
Selected location: L10-L30

for cases like "deploy this file", "run this apex test class", etc.

Setup:

We use the vercel ai SDKs to set up a model that supports function calling and an MCP client.
User flows are written in tests, then the runner will set it as the user prompt (along with our tools) and return a processed response (the tool gets called by the MCP client) that is scored by another LLM call.

Example running the `sf-query-org.eval.ts` eval test:

This uses a pre-seeded dreamhouse project and tests:

List the name of the Property__c records in my org, ordered in ascending order by their name.

then asserts that the response includes the expected record names in the org.

example with the invalid expected response (has made up names) uncommented:

Credits:

Implementation is based on Sentry and Cloudfare MCPs:

What issues does this PR fix or reference?

@W-18964528@

test: eval testing setup

981522b

cristiand391 changed the title ~~[WIP] test: eval testing setup~~ [WIP] test: eval testing setup W-18964528 Jul 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] test: eval testing setup W-18964528 #102

[WIP] test: eval testing setup W-18964528 #102

Uh oh!

cristiand391 commented Jul 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

[WIP] test: eval testing setup W-18964528 #102

Are you sure you want to change the base?

[WIP] test: eval testing setup W-18964528 #102

Uh oh!

Conversation

cristiand391 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

inject an sfdx project path as the open project in the context:

current open dir

Setup:

Example running the sf-query-org.eval.ts eval test:

Credits:

What issues does this PR fix or reference?

Uh oh!

Uh oh!

cristiand391 commented Jul 10, 2025 •

edited

Loading

Example running the `sf-query-org.eval.ts` eval test: