A Snakemake-based pipeline for benchmarking sample-level prediction methods from single-cell data, with a focus on Multiple Instance Learning (MultiMIL) and other machine learning approaches.
This pipeline provides a comprehensive framework for comparing different methods for sample-level prediction from single-cell data, including:
- MultiMIL, classification and (ordinal) regression
- Random Forest (RF)
- Multiclass Regression (MR)
- Feed-forward Neural Network (NN)
The pipeline compares the performance on:
- bulk
- cell-type bulk
- frequency vector
- cell embedding representations.
- Modular Design: Easy to add new methods and datasets
- Reproducible: Snakemake ensures reproducible workflows
- Comprehensive Evaluation: Multiple metrics and visualization options
- Conda Environment: Isolated environment management
The repository includes ready-to-use example data and configuration files.
- Clone and setup:
git clone https://github.com/theislab/sample-prediction-pipeline.git
cd sample-prediction-pipeline
mamba env create -f envs/sample_prediction_pipeline.yaml
mamba activate sample_prediction_pipeline
- Run the example:
snakemake --cores 1
This will run all methods on the example dataset with minimal test parameters and produce results in data/reports/
.
For detailed installation instructions, configuration options, and advanced usage, see the Complete Guide.
- multimil: MultiMIL for sample-level classification prediction
- multimil_reg: MultiMIL for sample-level regression prediction
- pb_rf: Random Forest on pseudobulk
- pb_nn: Neural Network on pseudobulk
- pb_mr: Multi-class Regression on pseudobulk
- ct_pb_rf: Random Forest on cell type-aware pseudobulk
- ct_pb_nn: Neural Network on cell type-aware pseudobulk
- ct_pb_mr: Multi-class Regression on cell type-aware pseudobulk
- freq_rf: Random Forest on frequency data
- freq_nn: Neural Network on frequency data
- freq_mr: Multi-class Regression on frequency data
- gex_rf: Random Forest on cell embeddings
- gex_nn: Neural Network on cell embeddings
- gex_mr: Multi-class Regression on cell embeddings
data/
├── reports/
│ ├── {task}/
│ │ ├── {method}/
│ │ │ └── {hash}/
│ │ │ └── accuracy.tsv
│ │ └── best_{method}.txt
│ ├── methods.tsv
│ ├── best.tsv
│ └── {task}_accuracy.png
└── tasks.tsv
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the BSD-3-Clause License - see the LICENSE file for details.
For questions and support, please open an issue on GitHub.