Sourcepredict is a Python package distributed through Conda, to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking. Sourcepredict solves this problem by using machine learning classification on dimensionally reduced datasets.
With conda (recommended)
$ conda install -c conda-forge -c maxibor sourcepredictWith pip
$ pip install sourcepredict- Sink taxonomic count file (see example file and documentation)
- Source taxonomic count file (see example file and documentation)
- Source label file (see example file and documentation)
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/test/dog_test_sink_sample.csv -O dog_example.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_labels.csv -O sp_labels.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_sources.csv -O sp_sources.csv
$ sourcepredict -s sp_sources.csv -l sp_labels.csv dog_example.csv
Step 1: Checking for unknown proportion
== Sample: ERR1915662 ==
Adding unknown
Normalizing (GMPR)
Computing Bray-Curtis distance
Performing MDS embedding in 2 dimensions
KNN machine learning
Training KNN classifier on 2 cores...
-> Testing Accuracy: 1.0
----------------------
- Sample: ERR1915662
known:98.61%
unknown:1.39%
Step 2: Checking for source proportion
Computing weighted_unifrac distance on species rank
TSNE embedding in 2 dimensions
KNN machine learning
Performing 5 fold cross validation on 2 cores...
Trained KNN classifier with 10 neighbors
-> Testing Accuracy: 0.99
----------------------
- Sample: ERR1915662
Canis_familiaris:96.1%
Homo_sapiens:2.47%
Soil:1.43%
Sourcepredict result written to dog_test_sample.sourcepredict.csvSourcepredict output the predicted source contribution to each sink sample, and the embedding of all samples in the lower dimensional space. See documentation for details.
Depending on the normalization method (-n), the embedding (-me) method, the cpus available for parallel processing (-t), and the data, the runtime should be between a few seconds and a few minutes per sink sample.
The documentation of SourcePredict is available here: sourcepredict.readthedocs.io
- The sources were obtained with a simple Nextflow pipeline, with Kraken2 using the MiniKraken2_v2_8GB.
See the documentation for more informations on how to build a custom source file. - The example source file is here modern_gut_microbiomes_sources.csv
- The example label file is here modern_gut_microbiomes_sources.csv
- Homo sapiens gut microbiome (1, 2, 3, 4, 5, 6)
- Canis familiaris gut microbiome (1)
- Soil microbiome (1, 2, 3)
If you wish to contribute to Sourcepredict, you are welcome and encouraged to contribute by opening an issue, or creating a pull-request. All contributions will be made under the GPLv3 license. More informations can found on the contributing page.
Sourcepredict has been published in JOSS.
@article{Borry2019Sourcepredict,
journal = {Journal of Open Source Software},
doi = {10.21105/joss.01540},
issn = {2475-9066},
number = {41},
publisher = {The Open Journal},
title = {Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification},
url = {http://dx.doi.org/10.21105/joss.01540},
volume = {4},
author = {Borry, Maxime},
pages = {1540},
date = {2019-09-04},
year = {2019},
month = {9},
day = {4}
}
