Sample Prediction Pipeline

A Snakemake-based pipeline for benchmarking sample-level prediction methods from single-cell data, with a focus on Multiple Instance Learning (MultiMIL) and other machine learning approaches.

Overview

This pipeline provides a comprehensive framework for comparing different methods for sample-level prediction from single-cell data, including:

MultiMIL, classification and (ordinal) regression
Random Forest (RF)
Multiclass Regression (MR)
Feed-forward Neural Network (NN)

The pipeline compares the performance on:

bulk
cell-type bulk
frequency vector
cell embedding representations.

Features

Modular Design: Easy to add new methods and datasets
Reproducible: Snakemake ensures reproducible workflows
Comprehensive Evaluation: Multiple metrics and visualization options
Conda Environment: Isolated environment management

Quick Start

The repository includes ready-to-use example data and configuration files.

Clone and setup:

git clone https://github.com/theislab/sample-prediction-pipeline.git
cd sample-prediction-pipeline
mamba env create -f envs/sample_prediction_pipeline.yaml
mamba activate sample_prediction_pipeline

Run the example:

snakemake --cores 1

This will run all methods on the example dataset with minimal test parameters and produce results in data/reports/.

Documentation

For detailed installation instructions, configuration options, and advanced usage, see the Complete Guide.

Supported Methods

MultiMIL

multimil: MultiMIL for sample-level classification prediction
multimil_reg: MultiMIL for sample-level regression prediction

Pseudo-bulk Methods

pb_rf: Random Forest on pseudobulk
pb_nn: Neural Network on pseudobulk
pb_mr: Multi-class Regression on pseudobulk

Cell Type Methods

ct_pb_rf: Random Forest on cell type-aware pseudobulk
ct_pb_nn: Neural Network on cell type-aware pseudobulk
ct_pb_mr: Multi-class Regression on cell type-aware pseudobulk

Frequency-based Methods

freq_rf: Random Forest on frequency data
freq_nn: Neural Network on frequency data
freq_mr: Multi-class Regression on frequency data

Cell Embedding Methods

gex_rf: Random Forest on cell embeddings
gex_nn: Neural Network on cell embeddings
gex_mr: Multi-class Regression on cell embeddings

Output Structure

data/
├── reports/
│   ├── {task}/
│   │   ├── {method}/
│   │   │   └── {hash}/
│   │   │       └── accuracy.tsv
│   │   └── best_{method}.txt
│   ├── methods.tsv
│   ├── best.tsv
│   └── {task}_accuracy.png
└── tasks.tsv

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

This project is licensed under the BSD-3-Clause License - see the LICENSE file for details.

Support

For questions and support, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data/example		data/example
docs		docs
envs		envs
params		params
scripts		scripts
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sample Prediction Pipeline

Overview

Features

Quick Start

Documentation

Supported Methods

MultiMIL

Pseudo-bulk Methods

Cell Type Methods

Frequency-based Methods

Cell Embedding Methods

Output Structure

Contributing

License

Support

About

Uh oh!

Releases

Packages

Languages

License

theislab/sample-prediction-pipeline

Folders and files

Latest commit

History

Repository files navigation

Sample Prediction Pipeline

Overview

Features

Quick Start

Documentation

Supported Methods

MultiMIL

Pseudo-bulk Methods

Cell Type Methods

Frequency-based Methods

Cell Embedding Methods

Output Structure

Contributing

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages