Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models
Large Language Models (LLMs) excel at numerous tasks but often struggle with confidence calibration — the alignment between their predicted confidence and actual accuracy. Models can be overconfident in incorrect answers or underconfident in correct ones, which poses significant challenges for critical applications.
Calibration errors occur when a model's predicted confidence diverges from its actual accuracy, potentially misleading users. For instance, a model might assert 80% confidence in a fact but be incorrect. By measuring the Expected Calibration Error (ECE), this project quantifies the discrepancy between a model's predicted confidence and its actual correctness.
Example of overconfidence in LLMs. The model predicts "Geoffrey Hinton" with 93% confidence for the wrong answer to a factual question.
-
OpenAI Models:
gpt-4-turbogpt-4o-minigpt-4o
-
Groq Models:
llama-3.1-8b-instantllama3-8b-8192gemma2-9b-it
Note: The supported models list can be extended with slight modifications in the code.
Follow these steps to set up the project:
-
Clone the Repository:
git clone https://github.com/prateekchhikara/llms-calibration cd llms-calibration -
Create and Activate a Virtual Environment (Optional but Recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
-
Set Up Environment Variables:
OPENAI_API_KEY="your_openai_api_key" GROQ_API_KEY="your_groq_api_key"
Run an evaluation for a specific model using the following command:
python main.py \
--model_name "gpt-4o-mini" \
--start_index 0 \
--end_index 20 \
--results_dir "results/" \
--approach "normal"For processing multiple batches, utilize the provided shell script:
-
Make the Script Executable:
chmod +x run.sh
-
Execute the Script:
./run.sh
├── main.py # Main evaluation script
├── llms.py # LLM client implementations
├── utils.py # Utility functions
├── prompts.py # Evaluation prompts
├── run.sh # Batch processing script
├── generate_figures.ipynb # Visualization notebook
└── README.md # Project documentation
This project is licensed under the MIT License.
For questions and feedback, please contact me directly at [email protected].