A benchmark suite for evaluating machine learning models on antibody biophysical property prediction.
This repository provides:
- Models for predicting antibody developability properties
- Standardized evaluation framework with consistent metrics
- Pre-computed features from various computational tools
- Benchmark dataset (GDPa1) with measured biophysical properties
Each model is an isolated Pixi project with its own dependencies and lockfile, ensuring reproducibility.
- Install Pixi (if not already installed):
curl -fsSL https://pixi.sh/install.sh | bash- Clone the repository:
git clone <repository-url>
cd abdev-benchmarkEach model follows a standard train/predict workflow. For example, with TAP Linear:
cd models/tap_linear
pixi install
# Train the model
pixi run python -m tap_linear train \
--data ../../data/GDPa1_v1.2_20250814.csv \
--run-dir ./runs/my_run
# Generate predictions
pixi run python -m tap_linear predict \
--data ../../data/GDPa1_v1.2_20250814.csv \
--run-dir ./runs/my_run \
--out-dir ./outputs/train
# Predict on heldout data
pixi run python -m tap_linear predict \
--data ../../data/heldout-set-sequences.csv \
--run-dir ./runs/my_run \
--out-dir ./outputs/heldoutAll models implement the same BaseModel interface with train() and predict() methods.
Note: Models train on ALL provided data. The orchestrator handles data splitting for cross-validation.
To train, predict, and evaluate all models:
pixi run allThis orchestrator will:
- Automatically discover all models (directories with
pixi.tomlinmodels/) - Install dependencies for each model
- Train models with 5-fold cross-validation on GDPa1
- Generate predictions for both CV and heldout test sets
- Evaluate predictions and compute metrics (Spearman, Top 10% Recall)
- Display summary tables with results
- Save artifacts to
outputs/models/,outputs/predictions/,outputs/evaluation/
After running all models, you'll see a summary table like this:
These results use the defined folds in the GDPa1 training set and present the metrics of the corresponding test folds.
Spearman ρ (Cross-validation fold average)
| Model | AC-SINS_pH7.4 | HIC | PR_CHO | Titer | Tm2 |
|---|---|---|---|---|---|
| esm2_tap_ridge | 0.480 | 0.420 | 0.413 | 0.221 | 0.265 |
| ablang2_elastic_net | 0.509 | 0.461 | 0.362 | 0.356 | 0.101 |
| moe_baseline | 0.424 | 0.656 | 0.353 | 0.184 | 0.107 |
| esm2_tap_rf | 0.339 | 0.310 | 0.327 | 0.223 | 0.303 |
| esm2_ridge | 0.420 | 0.416 | 0.420 | 0.180 | -0.098 |
| deepsp_ridge | 0.348 | 0.531 | 0.257 | 0.114 | 0.073 |
| esm2_tap_xgb | 0.304 | 0.262 | 0.256 | 0.147 | 0.328 |
| piggen | 0.388 | 0.346 | 0.424 | 0.238 | -0.119 |
| onehot_ridge | 0.230 | 0.233 | 0.204 | 0.193 | -0.006 |
| tap_single_features | 0.327 | 0.231 | 0.074 | 0.126 | — |
| tap_linear | 0.294 | 0.222 | 0.136 | 0.113 | -0.115 |
| aggrescan3d | — | 0.404 | 0.112 | — | — |
| saprot_vh | — | — | 0.289 | — | 0.162 |
| antifold | — | — | — | 0.194 | 0.084 |
| deepviscosity | — | 0.176 | — | — | — |
| random_predictor | -0.026 | 0.002 | -0.081 | 0.068 | -0.000 |
Options:
pixi run all # Full workflow (train + predict + eval)
pixi run fast-only # A subset of models which train and eval quickly.
pixi run all-skip-train # Skip training (use existing models)
pixi run all-skip-eval # Skip evaluation step
python run_all_models.py --help # See all optionsNote: Some models are compute heavy and have hyperparameter sweeps as part of their training process (e.g.
moe_baseline). For experimentation, it may be advantageous to create a new config with the subset of models of interest.
You can customize behavior via config files in configs/:
python run_all_models.py --config configs/custom.tomlA subset of models have been evaluated on the heldout test set with the following results:
Spearman ρ (Heldout test set)
| Model | AC-SINS_pH7.4 | HIC | PR_CHO | Titer | Tm2 |
|---|---|---|---|---|---|
| ablang2_elastic_net | 0.220 | 0.356 | 0.159 | 0.283 | -0.095 |
| esm2_tap_ridge | 0.084 | 0.407 | 0.160 | -0.041 | 0.205 |
| moe_baseline | 0.103 | 0.495 | 0.081 | 0.110 | -0.140 |
| esm2_tap_xgb | 0.089 | 0.197 | 0.053 | 0.056 | 0.102 |
| esm2_ridge | 0.066 | 0.403 | 0.024 | 0.045 | -0.058 |
| tap_linear | 0.032 | 0.348 | 0.136 | 0.063 | -0.107 |
| piggen | 0.061 | 0.406 | 0.005 | -0.067 | -0.027 |
| deepsp_ridge | -0.028 | 0.404 | -0.042 | 0.129 | -0.111 |
| esm2_tap_rf | 0.068 | 0.339 | -0.003 | -0.092 | 0.012 |
| onehot_ridge | -0.114 | 0.273 | -0.157 | 0.010 | -0.115 |
| random_predictor | -0.029 | -0.191 | 0.065 | 0.131 | -0.277 |
| tap_single_features | -0.161 | 0.050 | -0.074 | -0.020 | — |
| aggrescan3d | — | 0.535 | 0.006 | — | — |
| antifold | — | — | — | 0.134 | -0.016 |
| deepviscosity | — | — | — | — | — |
| saprot_vh | — | — | — | — | — |
abdev-benchmark/
├── models/ # Models (each is a Pixi project)
│ └── random_predictor/ # E.g. Random model (performance floor)
├── libs/
│ └── abdev_core/ # Shared utilities, base classes, and evaluation
├── configs/ # Configuration files for orchestrator
├── data/ # Benchmark datasets and precomputed features
├── outputs/ # Generated outputs (models, predictions, evaluation)
│ ├── models/ # Trained model artifacts
│ ├── predictions/ # Generated predictions
│ └── evaluation/ # Evaluation metrics
└── pixi.toml # Root environment with orchestrator dependencies
| Model | Description | Trains Model | Data Source |
|---|---|---|---|
| moe_baseline | Ridge/MLP on MOE molecular descriptors | Yes | MOE features |
| ablang2_elastic_net | ElasticNet on AbLang2 paired embeddings | Yes | Sequences (AbLang2 model) |
| esm2_tap_ridge | Ridge on ESM2-PCA + TAP + subtypes | Yes | Sequences (ESM2 model) + TAP features |
| esm2_tap_rf | Random Forest on ESM2-PCA + TAP + subtypes | Yes | Sequences (ESM2 model) + TAP features |
| esm2_tap_xgb | XGBoost on ESM2-PCA + TAP + subtypes | Yes | Sequences (ESM2 model) + TAP features |
| esm2_ridge | Ridge regression on ESM2 embeddings | Yes | Sequences (ESM2 model) |
| deepsp_ridge | Ridge regression on DeepSP spatial features computed on-the-fly | Yes | Sequences (DeepSP model) |
| tap_linear | Ridge regression on TAP descriptors | Yes | TAP features |
| piggen | Ridge regression on p-IgGen embeddings | Yes | Sequences (p-IgGen model) |
| tap_single_features | Individual TAP features as predictors | No | TAP features |
| aggrescan3d | Aggregation propensity from structure | No | Tamarind |
| antifold | Antibody stability predictions | No | Tamarind (with AntiBodyBuilder3 predicted structures) |
| saprot_vh | Protein language model features | No | Tamarind |
| deepviscosity | Viscosity predictions | No | Tamarind |
| random_predictor | Random predictions (baseline floor) | No | None |
All models implement the BaseModel interface with standardized train() and predict() commands. See individual model READMEs for details.
| Baseline | Extra info |
|---|---|
| Tamarind models | The models above were run on Tamarind.bio, using either VH/VL inputs or inputting predicted structures |
| AntiBodyBuilder3 predicted structures | |
| MOE predicted structures | MOE's antibody modeler takes the best matching framework in the PDB (%ID) and the most sequence similar template in the PDB for each CDR. It constructs a chimeric template from this combination of templates (filtering those that cause issues such as clash), then it makes the mutations with exhaustive sidechain packing and energy minimizes the model with Amber19 and a specific protocol to maximize reproducibility and preserve the experimental backbone coordinates. |
The benchmark evaluates predictions for 5 biophysical properties:
- HIC: Hydrophobic Interaction Chromatography retention time (lower is better)
- Tm2: Second melting temperature in °C (higher is better)
- Titer: Expression titer in mg/L (higher is better)
- PR_CHO: Polyreactivity CHO (lower is better)
- AC-SINS_pH7.4: Self-interaction at pH 7.4 (lower is better)
For each property:
- Spearman correlation: Rank correlation between predicted and true values
- Top 10% recall: Fraction of true top 10% captured in predicted top 10%
For cross-validation datasets, metrics are averaged across 5 folds.
Prediction CSVs must contain:
antibody_name(required)- One or more property columns (HIC, Tm2, Titer, PR_CHO, AC-SINS_pH7.4)
See data/schema/README.md for detailed format specifications.
Prediction format validation is handled automatically by the orchestrator using abdev_core.validate_prediction_format().
All models must implement the BaseModel interface with train() and predict() methods.
-
Create directory structure:
mkdir -p models/your_model/src/your_model
-
Create
pixi.tomlwith dependencies:[workspace] name = "your-model" version = "0.1.0" channels = ["conda-forge"] platforms = ["linux-64", "osx-64", "osx-arm64"] [dependencies] python = "3.11.*" pandas = ">=2.0" typer = ">=0.9" [pypi-dependencies] abdev-core = { path = "../../libs/abdev_core", editable = true } your-model = { path = ".", editable = true }
-
Create
pyproject.tomlfor package metadata. -
Implement
src/your_model/model.py:from pathlib import Path import pandas as pd from abdev_core import BaseModel class YourModel(BaseModel): def train(self, df: pd.DataFrame, run_dir: Path, *, seed: int = 42) -> None: """Train model on ALL provided data and save artifacts to run_dir.""" # Train on ALL samples in df (no internal CV) # Your training logic here pass def predict(self, df: pd.DataFrame, run_dir: Path) -> pd.DataFrame: """Generate predictions for ALL provided samples. Returns: DataFrame with predictions. Orchestrator handles saving to file. """ # Predict on ALL samples in df # Your prediction logic here # Return DataFrame (don't save to disk - orchestrator handles I/O) return df_with_predictions
-
Create
src/your_model/run.py:from abdev_core import create_cli_app from .model import YourModel app = create_cli_app(YourModel, "Your Model") if __name__ == "__main__": app()
-
Create
src/your_model/__main__.py:from .run import app if __name__ == "__main__": app()
-
Add
README.mddocumenting your approach. -
Test your model:
# From repository root python tests/test_model_contract.py --model your_model # Or test train/predict manually cd models/your_model pixi install pixi run python -m your_model train --data ../../data/GDPa1_v1.2_20250814.csv --run-dir ./test_run pixi run python -m your_model predict --data ../../data/GDPa1_v1.2_20250814.csv --run-dir ./test_run --out-dir ./test_out
See models/random_predictor/ for a complete minimal example.
Validate that all models implement the train/predict contract correctly:
# Install dev environment dependencies (includes pytest)
pixi install -e dev
# Test all models
pixi run -e dev test-contract
# Or run with options
pixi run -e dev python tests/test_model_contract.py --model tap_linear # Test specific model
pixi run -e dev python tests/test_model_contract.py --skip-train # Skip training step
pixi run -e dev python tests/test_model_contract.py --help # See all optionsThis test script validates:
- Train command executes successfully and creates artifacts
- Predict command works on both training and heldout data
- Output predictions follow the required CSV format
- All required columns are present
Note: The test script uses pixi run to activate each model's environment, matching how the orchestrator runs models.
If you use this benchmark, please cite:
[Citation information to be added]
- Tamarind.bio: Computed features for Aggrescan3D, AntiFold, BALM_Paired, DeepSP, DeepViscosity, Saprot, TEMPRO, TAP
- Nels Thorsteinsen: MOE structure predictions
- Contributors to individual model methods (see model READMEs)
This project is licensed under the MIT License - see the LICENSE file for details.
Note: Datasets and individual model implementations may have their own licenses and terms of use. Please refer to the specific documentation in each model directory and the data/ directory for details.