Skip to content

Trustworthy-Engineered-Autonomy-Lab/cot-confidence-swebench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SWE-Bench Lite Evaluation Script

A simple script to test how well AI models can fix software bugs by generating git patches.

What This Script Does

This script tests AI models on real software bugs from the SWE-Bench Lite dataset. It:

  • Takes bug descriptions from real projects
  • Asks an AI model to generate a fix (as a git patch)
  • Tests if the fix can actually be applied to the code
  • Reports how well the model performed

Quick Start

1. Install Requirements

pip install requests datasets tqdm

2. Install and Start Ollama

# Install Ollama (if you haven't already)
# Download from: https://ollama.ai/

# Start Ollama service
ollama serve

# Pull a coding model
ollama pull qwen2.5-coder:7b

3. Run the Evaluation

# Test with 3 bug fixes
python3 script.py --instances 3 --batch-size 1

# Test with 10 bug fixes
python3 script.py --instances 10 --batch-size 2

# Use a different model
python3 script.py --instances 5 --batch-size 1 --model codellama:13b

Understanding the Results

What You'll See

The script will show:

  • Progress bars for each bug being processed
  • Whether each patch was successfully applied
  • A summary at the end with success rates

Output Files

  • swebench_results.json: Detailed results for each bug
  • swebench_eval.log: Log file with technical details

Example Summary

==================================================
EVALUATION SUMMARY
==================================================
Total instances processed: 3
Patches successfully applied: 1 (33.3%)
Patches failed to apply: 2 (66.7%)
Average processing time: 39.58 seconds
==================================================

Common Issues and Solutions

"Cannot connect to Ollama"

  • Make sure Ollama is running: ollama serve
  • Check if the service is on: curl http://localhost:11434/api/tags

"Patch failed to apply"

  • This means the AI generated an incomplete or malformed patch
  • Try using a better model or adjusting the prompt
  • Check the log file for specific error details

Custom Models

# Use a different Ollama model
python3 script.py --model deepseek-coder:6.7b

# Use a different Ollama server
python3 script.py --ollama-url http://192.168.1.100:11434

Batch Processing

# Process more bugs at once (faster but uses more memory)
python3 script.py --instances 50 --batch-size 10

# Process one at a time (slower but more reliable)
python3 script.py --instances 10 --batch-size 1

Custom Output

# Save results to a different file
python3 script.py --output my_results.json

# Use a different working directory
python3 script.py --work-dir ./my_work_folder

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages