Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models

Background 🧠

Large Language Models (LLMs) excel at numerous tasks but often struggle with confidence calibration — the alignment between their predicted confidence and actual accuracy. Models can be overconfident in incorrect answers or underconfident in correct ones, which poses significant challenges for critical applications.

Calibration in LLMs

Calibration errors occur when a model's predicted confidence diverges from its actual accuracy, potentially misleading users. For instance, a model might assert 80% confidence in a fact but be incorrect. By measuring the Expected Calibration Error (ECE), this project quantifies the discrepancy between a model's predicted confidence and its actual correctness.

Example of overconfidence in LLMs. The model predicts "Geoffrey Hinton" with 93% confidence for the wrong answer to a factual question.

Supported Models

OpenAI Models:
- gpt-4-turbo
- gpt-4o-mini
- gpt-4o
Groq Models:
- llama-3.1-8b-instant
- llama3-8b-8192
- gemma2-9b-it

Note: The supported models list can be extended with slight modifications in the code.

Installation 🛠️

Follow these steps to set up the project:

Clone the Repository:

git clone https://github.com/prateekchhikara/llms-calibration
cd llms-calibration

Create and Activate a Virtual Environment (Optional but Recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Dependencies:
```
pip install -r requirements.txt
```

Set Up Environment Variables:

OPENAI_API_KEY="your_openai_api_key"
GROQ_API_KEY="your_groq_api_key"

Usage 🚀

Basic Evaluation

Run an evaluation for a specific model using the following command:

python main.py \
    --model_name "gpt-4o-mini" \
    --start_index 0 \
    --end_index 20 \
    --results_dir "results/" \
    --approach "normal"

Batch Processing

For processing multiple batches, utilize the provided shell script:

Make the Script Executable:
```
chmod +x run.sh
```
Execute the Script:
```
./run.sh
```

Project Structure 📁

├── main.py                 # Main evaluation script
├── llms.py                 # LLM client implementations
├── utils.py                # Utility functions
├── prompts.py              # Evaluation prompts
├── run.sh                  # Batch processing script
├── generate_figures.ipynb  # Visualization notebook
└── README.md               # Project documentation

License 📄

This project is licensed under the MIT License.

Contact 📧

For questions and feedback, please contact me directly at [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models

Background 🧠

Calibration in LLMs

Supported Models

Installation 🛠️

Usage 🚀

Basic Evaluation

Batch Processing

Project Structure 📁

License 📄

Contact 📧

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dataset		dataset
figures		figures
images		images
results		results
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_distractors.ipynb		generate_distractors.ipynb
generate_figures.ipynb		generate_figures.ipynb
llms.py		llms.py
main.py		main.py
prompts.py		prompts.py
requirements.txt		requirements.txt
run.sh		run.sh
utils.py		utils.py

Uh oh!

License

Uh oh!

prateekchhikara/llms-calibration

Folders and files

Latest commit

History

Repository files navigation

Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models

Background 🧠

Calibration in LLMs

Supported Models

Installation 🛠️

Usage 🚀

Basic Evaluation

Batch Processing

Project Structure 📁

License 📄

Contact 📧

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages