EvalBench

EvalBench is a flexible framework designed to measure the quality of generative AI (GenAI) workflows around database specific tasks. As of now, it provides a comprehensive set of tools, and modules to evaluate models on NL2SQL tasks, including capability of running and scoring DQL, DML, and DDL queries across multiple supported databases. Its modular, plug-and-play architecture allows you to seamlessly integrate custom components while leveraging a robust evaluation pipeline, result storage, scoring strategies, and dashboarding capabilities.

Getting Started

Follow the steps below to run EvalBench on your local VM.

Note: Evalbench requires python 3.10 or higher.

1. Clone the Repository

Clone the EvalBench repository from GitHub:

git clone [email protected]:GoogleCloudPlatform/evalbench.git

2. Set Up a Virtual Environment

Navigate to the repository directory and create a virtual environment:

cd evalbench
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

Install the required Python dependencies:

pip install -r requirements.txt

Due to proto conflict between google-cloud packages you may need to force install common-protos:

pip install --force-reinstall googleapis-common-protos==1.64.0

4. Configure GCP Authentication (For Vertex AI | Gemini Examples)

If gcloud is not installed already, follow the steps in gcloud installation guide.

Then, authenticate using the Google Cloud CLI:

gcloud auth application-default login

This step sets up the necessary credentials for accessing Vertex AI resources on your GCP project.

We can globally set our gcp_project_id using

export EVAL_GCP_PROJECT_ID=your_project_id_here
export EVAL_GCP_PROJECT_REGION=your_region_here

5. Set Your Evaluation Configuration

For a quick start, let's run NL2SQL on some sqlite DQL queries.

First, read through sqlite/run_dql.yaml and see the configuration settings we will be running.

Now, configure your evaluation by setting the EVAL_CONFIG environment variable. For example, to run a configuration using the db_blog dataset on SQLite:

export EVAL_CONFIG=datasets/bat/example_run_config.yaml

6. Run EvalBench

Start the evaluation process using the provided shell script:

./evalbench/run.sh

Overview

EvalBench's architecture is built around a modular design that supports diverse evaluation needs:

Modular and Plug-and-Play: Easily integrate custom scoring modules, data processors, and dashboard components.
Flexible Evaluation Pipeline: Seamlessly run DQL, DML, and DDL tasks while using a consistent base pipeline.
Result Storage and Reporting: Store results in various formats (e.g., CSV, BigQuery) and visualize performance with built-in dashboards.
Customizability: Configure and extend EvalBench to measure the performance of GenAI workflows tailored to your specific requirements.

Evalbench allows quickly creating experiments and A/B testing improvements (Available when BigQuery reporting mode set in run_config)

This includes being able to measure and quantify the specific improvements on databases or specific dialects:

Evalbench Reporting by Databaes / Dialects

And allowing digging deeper into the exact details of the improvements and regressions including highlighting the changes, how they impacted the score and a LLM annotated explanation of the scoring changes if LLM rater is used.

A complete guide of Evalbench's available functionality can be found in run-config documentation

Please explore the repository to learn more about customizing your evaluation workflows, integrating new metrics, and leveraging the full potential of EvalBench.

For additional documentation, examples, and support, please refer to the EvalBench documentation. Enjoy evaluating your GenAI models!

Name		Name	Last commit message	Last commit date
Latest commit History 509 Commits
.github/workflows		.github/workflows
datasets		datasets
docs		docs
evalbench		evalbench
evalbench_service		evalbench_service
.gitignore		.gitignore
.pycodestyle		.pycodestyle
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
noxfile.py		noxfile.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_client.sh		run_client.sh
run_service.sh		run_service.sh
tunnel.sh		tunnel.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalBench

Getting Started

1. Clone the Repository

2. Set Up a Virtual Environment

3. Install Dependencies

4. Configure GCP Authentication (For Vertex AI | Gemini Examples)

5. Set Your Evaluation Configuration

6. Run EvalBench

Overview

About

Releases 1

Packages

Contributors 9

Languages

License

GoogleCloudPlatform/evalbench

Folders and files

Latest commit

History

Repository files navigation

EvalBench

Getting Started

1. Clone the Repository

2. Set Up a Virtual Environment

3. Install Dependencies

4. Configure GCP Authentication (For Vertex AI | Gemini Examples)

5. Set Your Evaluation Configuration

6. Run EvalBench

Overview

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 9

Languages

Packages