Skip to content

[WIP] test: eval testing setup W-18964528 #102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cristiand391
Copy link
Member

@cristiand391 cristiand391 commented Jul 10, 2025

What does this PR do?

Adds a basic setup for eval testing using vitest-evals:
https://github.com/getsentry/vitest-evals

The setup intends to cover real world user flows in natural language and assert that:

  1. our MCP tools are properly selected for the task (bad tool description -> tool not used)
  2. tool responses are evaluated and match the expected values/tone/instructions.
  3. new tools don't collide with existing ones (2 or more tools with very similar descriptions could confuse the LLM).
  4. LLM properly connects "IDE context" and with what tools might expect (contextual stuff like open project dir, current open file, etc)

Regarding 4:
Here's the code that manages the "current context" (open file, locations, etc) in vscode copilot:
https://github.com/microsoft/vscode-copilot-chat/blob/main/src/extension/prompts/node/panel/currentEditor.tsx

The goal is to put some of these stuff that we assume might be available in a IDE to mimic a real-world scenario, examples:

inject an sfdx project path as the open project in the context:

The current open project dir is "${process.env.SF_EVAL_PROMPT_PROJECT_DIR}"

this is a part of the system prompt that we can set dynamically, think a setup where a dreamhouse project via testkit is created -> we pass the project dir as an env var and run the eval test.

current open dir

similar to the previous example, just inject a file path in the system prompt like:

Current open file: /path/to/apex/class
Selected location: L10-L30

for cases like "deploy this file", "run this apex test class", etc.

Setup:

We use the vercel ai SDKs to set up a model that supports function calling and an MCP client.
User flows are written in tests, then the runner will set it as the user prompt (along with our tools) and return a processed response (the tool gets called by the MCP client) that is scored by another LLM call.

Example running the sf-query-org.eval.ts eval test:

This uses a pre-seeded dreamhouse project and tests:

List the name of the Property__c records in my org, ordered in ascending order by their name.

then asserts that the response includes the expected record names in the org.

Screenshot 2025-07-10 at 16 08 51

example with the invalid expected response (has made up names) uncommented:
Screenshot 2025-07-10 at 16 21 00

Credits:

Implementation is based on Sentry and Cloudfare MCPs:

What issues does this PR fix or reference?

@W-18964528@

@cristiand391 cristiand391 changed the title [WIP] test: eval testing setup [WIP] test: eval testing setup W-18964528 Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant