GC-aware species abundance estimation from metagenomic data
GuaCAMOLE is a tool for estimating species abundance from metagenomic data, considering GC content. This package implements various scripts to generate reference distributions, correct for GC bias, and estimate species abundances.
- Create GC reference distributions from Kraken2 databases.
- Estimate species abundances using GC-aware correction.
- Generate detailed plots for data visualization.
You can install the package directly from a folder:
pip install .
Or from GitHub:
pip install git+https://github.com/Cibiv/GuaCAMOLE.git
To install the quadratic programming solver:
pip install qpsolvers['cvxopt']
GuaCAMOLE relies on the library.fna files downloaded by Kraken2 to generate the reference distributions. The Kraken2 database needs to be downloaded with the --no-masking
option with the kraken2-build
command!
To create a GC reference distribution from a Kraken2 database run the following command with the parameters of your sample:
create-reference-dist --lib_path path/to/kraken_db --read_len 150 --fragment_len 400 --ncores 20
Specifying the fragment_len
parameters for paired-end sequencing samples can help with the accuracy of GuaCAMOLE, especially if the fragments are a lot longer than the reads. If specified, GuaCAMOLE will use the mean GC content of the read pair as an approximation to the GC content of the fragment. If not specified just single reads.
For GuaCAMOLE to run it also requires a Bracken database with the correct read length to exist in the Kraken database folder.
To estimate species abundances from your data, run:
guacamole --kraken_report path/to/report --kraken_file path/to/kraken --kraken_db path/to/kraken_db --read_len 150 --output result.txt --read_files path/to/reads_1.fastq path/to/reads_2.fastq
--kraken_report
: Kraken2 report file (required)--kraken_file
: Kraken2 file with classifications (required)--kraken_db
: Path to the Kraken2 database (required)--read_len
: Read length (required)--output
: Output file name (required)--read_files
: Path to input read files (required)--threshold
: Minimum number of reads found for a species to estimate its abundance, default=500--length_correction
, genome size correction (taxonomic vs. read abundance), default=False--plot
, True if detailed plots should be generated--fp_cycles
, Number of iterations for false positive removal, default=4--reg_weight
, Determines how strong the regularization should be [between 0 and 1]', default=0.01--fragment_len
, length of the fragment if paired end and known', default=None--fasta
, True if reads are in fasta format, false if fastq', default=False
The Output is the same as the tab-delimited Bracken output file. Three additional columns are added:
Bracken_estimate
, the abundance estimate from Bracken (iflength_correction=True
they are genome length corrected the same as the GuaCAMOLE estimates)GuaCAMOLE_estimate
, the abundances estimated using the abundance parameter from the GuaCAMOLE algorithmGuaCMAOLE_est_eff
, the abundances computed using the estimated efficiencies by GuaCAMOLE (this does also include esimates for the taxa that were labelled as false positives by GuaCAMOLE)GC_content
, the GC content of the taxon's genome
To try out GuaCAMOLE you can download the fastq files of one of the replicates of the sample A0
sequenced for a publication by Tourlousse et al. (2021) (for more details see their and our manuscripts). To download the data install sra-tools
and run:
fasterq-dump SRR12996245
You need to have a kraken2 database installed with the --no-masking
flag set and classify the reads using Kraken2. You can do this yourself or use the pre-classified files in the demo data folder, then you can skip all the next steps and directly run guacamole. The SRR12996245.kraken
file which contains the read classifications can be downloaded under the following link:
https://drive.google.com/file/d/1gxz7UFqoq-a25Gh0LsLTnDRjafTGvWSE/view?usp=sharing
SKIP THE NEXT 4 STEPS IF YOU'RE USING THE FILES PROVIDED IN THE demo_data
FOLDER
To download the Kraken2 database use:
kraken2-build --standard --db demo_db --threads 24 --no-masking
Now you can classify the reads using
kraken2 --db demo_db --threads 24 --report SRR12996245_report.txt --paired SRR12996245_1.fastq SRR12996245_2.fastq > SRR12996245.kraken
Then build a Bracken database for the read length of the sequencing data which is 150 bp:
bracken-build -d demo_db -t 24 -l 150
Now build a GuaCAMOLE database for the corresponding read and fragment length (should take around an hour):
create-reference-dist --lib_path demo_db --read_len 150 --ncores 20 --fragment_len 400
Now run GuaCAMOLE using (should take around 10 minutes):
guacamole --kraken_report SRR12996245_report.txt --kraken_file SRR12996245.kraken --read_len 150 --output SRR12996245.guac --fragment_len 400 --length_correction True --kraken_db demo_db/ --threshold 500 --plot True --read_files SRR12996245_1.fastq SRR12996245_2.fastq