Citation:
Johnston HR, Chopra P, Wingo TS, Patel V, Epstein MP, Mulle JG, Warren ST, Zwick ME, Cutler DJ. PEMapper and PECaller provide a simplified approach to whole-genome sequencing. Proc Natl Acad Sci U S A 2017:114(10):E1923-E1932 DOI: 10.1073/pnas.1618065114 PMID: 28223510
This project includes two generations of tools:
Version 1 (Original): pecaller, pemapper, snp_to_vcf
- Original implementation (2014)
- Uses text-based pileup format
- Kept for informational and compatibility purposes.
Version 2 (Current/Recommended): pecaller2, pemapper2, snp_to_vcf2
- Current implementation (2017-2018)
- Uses optimized binary pileup format for improved performance
- Includes
mpileup_to_pileup2converter for format transformation - Both versions use the same genome index created by
index_genome_whole - This is the primary version described in this documentation
All new projects should use Version 2 tools.
Install using make and further instructions are in Build.md.
During the build process, you will see warnings about unused parameters, which is fine.
# Standard system-wide install
sudo make install
# User-specific install
make install PREFIX=~/.local
# Custom location
make install PREFIX=/opt/mytoolsThe PECaller2 pipeline processes sequencing data through the following stages:
Create a searchable index of your reference genome:
FASTA files → index_genome_whole → reference.sdx + reference.seq + reference.idx
Convert sequencing data to binary pileup format:
Option A - Direct mapping from FASTQ:
FASTQ files → pemapper2 → sample.indel.txt + sample.pileup2
Option B - Convert from existing alignments:
CRAM/BAM → samtools mpileup → mpileup_to_pileup2 → sample.indel.txt + sample.pileup2
Note: For large studies, group samples into batches (B1, B2, etc.) of ~100 samples each.
Call variants from pileup data:
*.indel.txt + *.pileup2 + reference.sdx + regions.bed → pecaller2 → *.base + *.snp + *.dist + *.piles
Identify high-quality variant sites:
*.snp + reference.sdx → make_snplist_formerge.pl → good.bed + bad.bed
Then split by chromosome: good.bed → chr[1-22,X,Y,M].good.bed
Combine variant calls across all batches:
*.base + chr?.good.bed + reference.sdx → pecall_merger → merged.chr?.snp
Integrate indels with SNP calls:
*.indel.txt + merged.chr?.snp + reference.sdx → merge_indel_snp.pl → final.chr?.snp
This project provides two genome indexing programs:
index_genome_whole (Recommended):
- Designed for large genomes (human, mouse, etc.)
- Fixed 16-mer indexing depth optimized for performance
- Uses compressed storage (gzipped files) to save disk space
- No genome size limitations
- Uses ~108 Gb to encode human genome.
index_genome (For development and experimental use):
- Configurable indexing depth and genome size limits
- Memory-efficient for smaller genomes
- Uncompressed output files
- Intended for experimental use
- Download the genome fa files.
For example, to download all of mm10 to the current directory from UCSC:
rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/ ./- Merge all of the fa files into one large fa file.
Use merge_dir_fa.pl to merge the files into a single fa file with chromosomes
sorted in a natural order (e.g., 1-19, M, X, and Y) with unmapped chromosome
after chrY.
merge_dir_fa.pl -d ../mm10_2015-01-25/ -c '1-19,M,X,Y' -o mm10- Index the merged fa file.
Using index_genome_whole (recommended):
Run interactively and provide:
- Output destination (Screen or Disk)
- Input FASTA file path
- Output basename for index files
- Whether DNA will be bisulfite converted (y/n)
Using index_genome (for smaller genomes):
Run interactively or redirect input from a file.
For example, index_genome < in.cmds where in.cmds contains:
d
1000
mm10.fa
mm10
n
Using pemapper2 (recommended):
- Single-ended or paired-ended mapping of FASTQ files is supported
- Outputs binary pileup2 format for optimal performance with
pecaller2 - Automatically generates indel.txt files for insertion/deletion tracking
Legacy pemapper support:
- Available for compatibility with existing workflows
- Outputs standard pileup format for use with original
pecaller
File naming conventions:
- Paired-end files should use
_1_vs_2_or_R1_vs_R2_naming
Using pecaller2 (recommended):
- Place all pileup2 files into a single directory and launch
pecaller2 - Supports BED file regions to restrict calling to specific genomic intervals
- Important: Chromosome order in BED file must match exactly the order in the genome
.sdxfile - Outputs:
.base,.snp,.dist, and.pilesfiles for comprehensive variant analysis
Legacy pecaller support:
- Available for compatibility with text-based pileup inputs
- Use for workflows requiring the original format
Step 1: Quality Control & Site Selection
- Run
make_snplist_formerge.plon SNP files to identify high-quality variant sites - This creates "good.bed" and "bad.bed" files for quality-based filtering
- Split good.bed by chromosome for parallel processing:
chr[1-22,X,Y,M].good.bed
Step 2: Merge Base Files
- Use
pecall_mergerto combine.basefiles across samples at high-quality sites - Input:
.basefiles + chromosome-specific.good.bedfiles + genome.sdxfile - Output: Merged SNP files per chromosome
Combining SNPs and Indels:
- Both
pecaller2andpecallercall each base independently - Insertions are stored separately in
.indel.txtfiles during mapping/calling - Use
merge_indel_snp.plto integrate indels into final SNP files - This step sorts variants and places all indels into the final output
For compatibility with standard genomics tools:
Using snp_to_vcf2 (recommended):
- Converts the optimized SNP format to standard VCF format
- Supports both SNPs and indels in VCF output
- Improved handling of complex variants and metadata
Legacy snp_to_vcf support:
- Available for compatibility with Version 1 workflows
- Converts original text-based SNP format to VCF
snp_tran_counter.plandsnp_tran_silent_rep.plgive transition to transversion counts for variants obsered. The latter provides counts specific to different kinds of classes of variants (i.e., replacement sites) and expects the annotation to come from the Bystro genomic annotator.