GuaCAMOLE

GC-aware species abundance estimation from metagenomic data

Overview

GuaCAMOLE is a tool for estimating species abundance from metagenomic data, considering GC content. This package implements various scripts to generate reference distributions, correct for GC bias, and estimate species abundances.

Features

Create GC reference distributions from Kraken2 databases.
Estimate species abundances using GC-aware correction.
Generate detailed plots for data visualization.

Installation

You can install the package directly from a folder:

pip install .

Or from GitHub:

pip install git+https://github.com/Cibiv/GuaCAMOLE.git

To install the quadratic programming solver:

pip install qpsolvers['cvxopt']

Usage

1. Create Reference Distribution

GuaCAMOLE relies on the library.fna files downloaded by Kraken2 to generate the reference distributions. The Kraken2 database needs to be downloaded with the --no-masking option with the kraken2-build command!

To create a GC reference distribution from a Kraken2 database run the following command with the parameters of your sample:

create-reference-dist --lib_path path/to/kraken_db --read_len 150 --fragment_len 400 --ncores 20

Specifying the fragment_len parameters for paired-end sequencing samples can help with the accuracy of GuaCAMOLE, especially if the fragments are a lot longer than the reads. If specified, GuaCAMOLE will use the mean GC content of the read pair as an approximation to the GC content of the fragment. If not specified just single reads.

For GuaCAMOLE to run it also requires a Bracken database with the correct read length to exist in the Kraken database folder.

2. Run GuaCAMOLE for Species Abundance Estimation

To estimate species abundances from your data, run:

guacamole --kraken_report path/to/report --kraken_file path/to/kraken --kraken_db path/to/kraken_db --read_len 150 --output result.txt --read_files path/to/reads_1.fastq path/to/reads_2.fastq

Command-line Options

--kraken_report: Kraken2 report file (required)
--kraken_file: Kraken2 file with classifications (required)
--kraken_db: Path to the Kraken2 database (required)
--read_len: Read length (required)
--output: Output file name (required)
--read_files: Path to input read files (required)
--threshold: Minimum number of reads found for a species to estimate its abundance, default=500
--length_correction, genome size correction (taxonomic vs. read abundance), default=False
--plot, True if detailed plots should be generated
--fp_cycles, Number of iterations for false positive removal, default=4
--reg_weight, Determines how strong the regularization should be [between 0 and 1]', default=0.01
--fragment_len, length of the fragment if paired end and known', default=None
--fasta, True if reads are in fasta format, false if fastq', default=False

Output

The Output is the same as the tab-delimited Bracken output file. Three additional columns are added:

Bracken_estimate, the abundance estimate from Bracken (if length_correction=True they are genome length corrected the same as the GuaCAMOLE estimates)
GuaCAMOLE_estimate, the abundances estimated using the abundance parameter from the GuaCAMOLE algorithm
GuaCMAOLE_est_eff, the abundances computed using the estimated efficiencies by GuaCAMOLE (this does also include esimates for the taxa that were labelled as false positives by GuaCAMOLE)
GC_content, the GC content of the taxon's genome

Demo

To try out GuaCAMOLE you can download the fastq files of one of the replicates of the sample A0 sequenced for a publication by Tourlousse et al. (2021) (for more details see their and our manuscripts). To download the data install sra-tools and run:

fasterq-dump SRR12996245

You need to have a kraken2 database installed with the --no-masking flag set and classify the reads using Kraken2. You can do this yourself or use the pre-classified files in the demo data folder, then you can skip all the next steps and directly run guacamole. The SRR12996245.kraken file which contains the read classifications can be downloaded under the following link:

https://drive.google.com/file/d/1gxz7UFqoq-a25Gh0LsLTnDRjafTGvWSE/view?usp=sharing

SKIP THE NEXT 4 STEPS IF YOU'RE USING THE FILES PROVIDED IN THE demo_data FOLDER To download the Kraken2 database use:

kraken2-build --standard --db demo_db --threads 24 --no-masking

Now you can classify the reads using

kraken2 --db demo_db --threads 24 --report SRR12996245_report.txt --paired SRR12996245_1.fastq SRR12996245_2.fastq > SRR12996245.kraken

Then build a Bracken database for the read length of the sequencing data which is 150 bp:

bracken-build -d demo_db -t 24 -l 150

Now build a GuaCAMOLE database for the corresponding read and fragment length (should take around an hour):

create-reference-dist --lib_path demo_db --read_len 150 --ncores 20 --fragment_len 400

Now run GuaCAMOLE using (should take around 10 minutes):

guacamole --kraken_report SRR12996245_report.txt --kraken_file SRR12996245.kraken --read_len 150 --output SRR12996245.guac --fragment_len 400 --length_correction True --kraken_db demo_db/ --threshold 500 --plot True --read_files SRR12996245_1.fastq SRR12996245_2.fastq

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
build/lib		build/lib
demo_data		demo_data
guacamole.egg-info		guacamole.egg-info
guacamole		guacamole
.Rhistory		.Rhistory
COPYING		COPYING
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GuaCAMOLE

Overview

Features

Installation

Usage

1. Create Reference Distribution

2. Run GuaCAMOLE for Species Abundance Estimation

Command-line Options

Output

Demo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Cibiv/GuaCAMOLE

Folders and files

Latest commit

History

Repository files navigation

GuaCAMOLE

Overview

Features

Installation

Usage

1. Create Reference Distribution

2. Run GuaCAMOLE for Species Abundance Estimation

Command-line Options

Output

Demo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages