[WIP] test: eval testing setup W-18964528 #102
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds a basic setup for eval testing using
vitest-evals
:https://github.com/getsentry/vitest-evals
The setup intends to cover real world user flows in natural language and assert that:
Regarding 4:
Here's the code that manages the "current context" (open file, locations, etc) in vscode copilot:
https://github.com/microsoft/vscode-copilot-chat/blob/main/src/extension/prompts/node/panel/currentEditor.tsx
The goal is to put some of these stuff that we assume might be available in a IDE to mimic a real-world scenario, examples:
inject an sfdx project path as the open project in the context:
this is a part of the system prompt that we can set dynamically, think a setup where a dreamhouse project via testkit is created -> we pass the project dir as an env var and run the eval test.
current open dir
similar to the previous example, just inject a file path in the system prompt like:
for cases like "deploy this file", "run this apex test class", etc.
Setup:
We use the vercel ai SDKs to set up a model that supports function calling and an MCP client.
User flows are written in tests, then the runner will set it as the user prompt (along with our tools) and return a processed response (the tool gets called by the MCP client) that is scored by another LLM call.
Example running the
sf-query-org.eval.ts
eval test:This uses a pre-seeded dreamhouse project and tests:
then asserts that the response includes the expected record names in the org.
example with the invalid

expected
response (has made up names) uncommented:Credits:
Implementation is based on Sentry and Cloudfare MCPs:
What issues does this PR fix or reference?
@W-18964528@