CLEAR (Comprehensive LLM Error Analysis and Reporting) is an interactive, open-source package for LLM-based error analysis. It helps surface meaningful, recurring issues in model outputs by combining automated evaluation with powerful visualization tools.
The workflow consists of two main phases:
-
Analysis
Generates textual feedback for each instance; Identifies system-level error categories from these critiques and quantifies their frequencies. -
Interactive Dashboard
An intuitive dashboard provides a comprehensive view of model behavior. Users can:- Explore aggregate visualizations of identified issues
- Apply dynamic filters to focus on specific error types or score ranges
- Drill down into individual examples that illustrate specific failure patterns
CLEAR makes it easier to diagnose model shortcomings and prioritize targeted improvements.
You can run CLEAR as a full pipeline, or reuse specific stages (generation, evaluation, or just UI).
Requires Python 3.10+ and the necessary credentials for a supported provider.
git clone https://github.com/IBM/CLEAR.git
cd CLEAR
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .
pip install clear-eval
` 2. ### Set provider type and credentials CLEAR requires a supported LLM provider and credentials to run analysis. See supported providers β
β οΈ Using a private proxy or openai deployment? You must configure your model names explicitly (see below). Otherwise, default model names will be used automatically for supported providers.
The sample dataset is a small subset of the GSM8K math problems. For running on the sample data and default configuration, you simpy have to set your provider and run
run-clear-eval-analysis --provider=openai # or rits, watsonx
This will:
- Run the full CLEAR pipeline
- Save results under:
results/gsm8k/sample_output/
run-clear-eval-dashboard
Or set the port with
run-clear-eval-dashboard --port <port>
Then:
- Upload the generated ZIP file from
results/gsm8k/sample_output/
- Explore issues, scores, filters, and drill into examples
Run the dashboard:
run-clear-eval-dashboard
Then you can load the pre-generated sample output zip.
you can manually upload a sample .zip
file located at:
<your-env>/site-packages/clear_eval/sample_data/gsm8k/analysis_results_gsm8k_default.zip
π Or just download it directly from the GitHub repo.
CLEAR takes a CSV file as input, with each row representing a single instance to be evaluated.
Column | Used When | Description |
---|---|---|
id |
Always | Unique identifier for the instance |
model_input |
Always | Prompt provided to the generation model |
response |
Using pre-generated responses | Pre-generated model response (ignored if generation is enabled) |
ground_truth |
Performing reference based analysis | Ground-truth answer for evaluation (optional) |
others | --input_columns is used |
Additional input columns to show in dashboard (e.g. question ) |
CLEAR can be run via the CLI or Python API.
Each stage has its own entry point:
run-clear-eval-analysis --config_path path/to/config.yaml # run full pypeline
run-clear-eval-generation --config_path path/to/config.yaml # run generation only
run-clear-eval-evaluation --config_path path/to/config.yaml # Assume generation responses are given, run evaluation
- If
--config_path
is specified, all parameters are taken from the config unless explicitly overridden - CLI flags passed directly override corresponding config values
from clear_eval.analysis_runner import run_clear_eval_analysis, run_clear_eval_generation, run_clear_eval_evaluation
run_clear_eval_analysis(
config_path="configs/sample_run_config.yaml"
)
You may also pass overrides instead of using a config file:
from clear_eval.analysis_runner import run_clear_eval_analysis
run_clear_eval_analysis(
run_name="my_data",
provider="openai",
data_path="my_data.csv",
gen_model_name="gpt-3.5-turbo",
eval_model_name="gpt-4",
output_dir="results/gsm8k/",
perform_generation=False,
input_columns=["question"]
)
run-clear-eval-dashboard
Upload the ZIP file generated in your --output-dir
when prompted.
Arguments can be provided via:
- A YAML config file (
--config_path
) - CLI flags
- Python function parameters (when using the API)
β οΈ Boolean arguments (perform_generation
,is_reference_based
,resume_enabled
)
These must be set explicitly totrue
orfalse
in YAML, CLI, or Python.
On the CLI, use--flag True
or--flag False
(case-insensitive).
β οΈ Naming Convention
Parameter names usesnake_case
in YAML and Python, but use--kebab-case
in CLI.
For example:
- YAML:
perform_generation: true
- Python:
perform_generation=True
- CLI:
--perform-generation True
Argument | Description | Default |
---|---|---|
--config_path |
Path to a YAML config file (all values loaded unless overridden by CLI args) | |
--run_name |
Unique run name (used in result file names) | |
--data_path |
Path to input CSV file | |
--output_dir |
Output directory to write results | |
--provider |
Model provider: openai , watsonx , rits |
|
--eval_model_name |
Name of judge model (e.g. gpt-4o ) |
|
--gen_model_name |
Name of the generator model to evaluate. If not running generations - the generator name to display. | |
--perform_generation |
Whether to generate responses or use existing response column |
True |
--is_reference_based |
Use reference-based evaluation (requires ground_truth column in input) |
False |
--resume_enabled |
Whether to reuse intermediate outputs from previous runs stored in output_dir | True |
--evaluation_criteria |
Custom criteria dictionary for scoring individual records: {"criteria_name1":"criteria_desc1", ...} supported for yaml config and python. |
None |
--input_columns |
Comma-separated list of additional input fields (other than model_input ) to appear in the results and dashboard (e.g. question ) |
None |
Depending on your selected --provider
:
Provider | Required Environment Variables |
---|---|
openai |
OPENAI_API_KEY , [OPENAI_API_BASE if using proxy ] |
watsonx |
WATSONX_APIKEY , WATSONX_URL , WATSONX_SPACE_ID or PROJECT_ID |
rits |
RITS_API_KEY |