AI-powered model auditing agent with multi-agent debate for robust evaluation of machine learning models.
uv sync
uv run python main.py --model resnet50 --dataset CIFAR10 --weights path/to/weights.pth
pip install -e .
python main.py --model resnet50 --dataset CIFAR10 --weights path/to/weights.pth
uv sync --extra medical # or pip install -e ".[medical]"
python main.py --model resnet50 --dataset CIFAR10 --weights models/model.pth
# ISIC skin lesion classification
python main.py --model siim-isic --dataset isic --weights models/isic/model.pth
# HAM10000 dataset
python main.py --model deepderm --dataset ham10000 --weights models/ham10000.pth
--subset N
: Use N samples for faster evaluation--no-debate
: Disable multi-agent debate--single-agent
: Use single agent instead of multi-agent debate--device
: Specify device (cpu, cuda, mps)
Set your API keys:
export ANTHROPIC_API_KEY="your-key"
export OPENAI_API_KEY="your-key" # if using non-Anthropic models
main.py
- Interactive model auditor with multi-agent debatetestbench.py
- Automated evaluation scriptutils/agent.py
- Multi-agent conversation systemarchitectures/
- Custom model architecturesprompts/
- System prompts for different evaluation phasesmodels/
- Pre-trained model weightsresults/
- Evaluation results and conversation logs