SWE-Bench Lite Evaluation Script

A simple script to test how well AI models can fix software bugs by generating git patches.

What This Script Does

This script tests AI models on real software bugs from the SWE-Bench Lite dataset. It:

Takes bug descriptions from real projects
Asks an AI model to generate a fix (as a git patch)
Tests if the fix can actually be applied to the code
Reports how well the model performed

Quick Start

1. Install Requirements

pip install requests datasets tqdm

2. Install and Start Ollama

# Install Ollama (if you haven't already)
# Download from: https://ollama.ai/

# Start Ollama service
ollama serve

# Pull a coding model
ollama pull qwen2.5-coder:7b

3. Run the Evaluation

# Test with 3 bug fixes
python3 script.py --instances 3 --batch-size 1

# Test with 10 bug fixes
python3 script.py --instances 10 --batch-size 2

# Use a different model
python3 script.py --instances 5 --batch-size 1 --model codellama:13b

Understanding the Results

What You'll See

The script will show:

Progress bars for each bug being processed
Whether each patch was successfully applied
A summary at the end with success rates

Output Files

swebench_results.json: Detailed results for each bug
swebench_eval.log: Log file with technical details

Example Summary

==================================================
EVALUATION SUMMARY
==================================================
Total instances processed: 3
Patches successfully applied: 1 (33.3%)
Patches failed to apply: 2 (66.7%)
Average processing time: 39.58 seconds
==================================================

Common Issues and Solutions

"Cannot connect to Ollama"

Make sure Ollama is running: ollama serve
Check if the service is on: curl http://localhost:11434/api/tags

"Patch failed to apply"

This means the AI generated an incomplete or malformed patch
Try using a better model or adjusting the prompt
Check the log file for specific error details

Custom Models

# Use a different Ollama model
python3 script.py --model deepseek-coder:6.7b

# Use a different Ollama server
python3 script.py --ollama-url http://192.168.1.100:11434

Batch Processing

# Process more bugs at once (faster but uses more memory)
python3 script.py --instances 50 --batch-size 10

# Process one at a time (slower but more reliable)
python3 script.py --instances 10 --batch-size 1

Custom Output

# Save results to a different file
python3 script.py --output my_results.json

# Use a different working directory
python3 script.py --work-dir ./my_work_folder

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
script.py		script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SWE-Bench Lite Evaluation Script

What This Script Does

Quick Start

1. Install Requirements

2. Install and Start Ollama

3. Run the Evaluation

Understanding the Results

What You'll See

Output Files

Example Summary

Common Issues and Solutions

"Cannot connect to Ollama"

"Patch failed to apply"

Custom Models

Batch Processing

Custom Output

About

Uh oh!

Releases

Packages

Languages

Trustworthy-Engineered-Autonomy-Lab/cot-confidence-swebench

Folders and files

Latest commit

History

Repository files navigation

SWE-Bench Lite Evaluation Script

What This Script Does

Quick Start

1. Install Requirements

2. Install and Start Ollama

3. Run the Evaluation

Understanding the Results

What You'll See

Output Files

Example Summary

Common Issues and Solutions

"Cannot connect to Ollama"

"Patch failed to apply"

Custom Models

Batch Processing

Custom Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages