A reproducible pipeline for detecting, extracting, and annotating small extrachromosomal circular DNA (eccDNA) from high‑throughput sequencing data.
- Overview
- Prerequisites
- Installation
- Data Acquisition
- Alignment and Indexing
- eccDNA Detection
- Post‑processing
- Annotation
- Results
This pipeline uses Circle‑Map to detect eccDNA (100–400 bp) from paired‑end sequencing reads, cleans and re‑aligns candidates, and annotates genomic features. It is modular, transparent, and version‑controlled for full reproducibility. It is designed to be run on a Linux system with a conda package manager.
-
Clone the Repository
git clone https://github.com/JasonHunter95/microDNA-Detection-and-Analysis.git cd microDNA-Detection-and-Analysis
-
Create and Activate a Conda Environment
conda env create -f environment.yml conda activate circlemap-env
To run the pipeline, you need to have paired-end sequencing data in FASTQ format. You can either download your own data or use the example dataset provided in this repository.
Raw sequencing data can be fetched from the SRA database. The example dataset is from the NCBI SRA database, specifically the SRR413984 accession number.
fasterq-dump SRR413984 --split-files -O data/fastqfiles/
This will download the paired-end reads and save them in the data/fastqfiles/
directory. The files will be named SRR413984_1.fastq
and SRR413984_2.fastq
.
The example dataset is a human sample, and the reference genome used in this example is the human genome (GRCh37). The reference genome file can be downloaded to the data/human_genome/whole
directory as GCF_000001405.13_GRCh37_genomic.NC_000001.10.fna
via executing the following shell script:
Please note that this may take some time to download, as the reference genome is quite large.
bash src/utils/shell_scripts/get_ref_genome.sh
You can also use the following script to extract individual chromosomes from the genome downloaded using samtools faidx. This will create a folder for each chromosome in the genome, and subsequently saves an index file for each. Once again, this may take a few minutes to run. (This is unnecessary if you are using the whole genome, but is useful if you want to run the pipeline on individual chromosomes.)
We also need to download the Gencode annotation file for the human genome. The Gencode annotation file is available in GTF format, and we will convert it to BED format using the gtf2bed
command, and sort the BED file using bedtools sort
.
The following commands can be used to accomplish this:
curl -L -o data/gtfs/human_genome/gencode.v19.annotation.gtf.gz \
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
gunzip data/gtfs/human_genome/gencode.v19.annotation.gtf.gz
gtf2bed < data/gtfs/human_genome/gencode.v19.annotation.gtf > data/beds/whole_genomes/human_genome/gencode.v19.annotation.bed
This will create a BED file with the Gencode annotation in the data/beds/whole_genomes/human_genome/
directory.
bedtools sort -i data/beds/whole_genomes/human_genome/gencode.v19.annotation.bed > data/beds/whole_genomes/human_genome/gencode.v19.annotation.sorted.bed
The first step in the pipeline is to align the paired-end reads to the reference genome. This is done in two steps, first by indexing the reference genome and then aligning the reads to the indexed reference genome.
bash src/utils/shell_scripts/index_all_chr.sh
To index the entire reference genome, you can use the following command:
bwa-mem2 index data/human_genome/whole/GCF_000001405.13_GRCh37_genomic.fna
This will create an index file for the reference genome, which is used by BWA to align the reads to the reference genome. Alternatively, you can run the following command to index the reference genome using Samtools, which is a more common method:
samtools faidx data/human_genome/whole/GCF_000001405.13_GRCh37_genomic.fna
Both commands will create an index file for the reference genome, but the samtools faidx
command is more efficient and faster for large genomes.
This can be done for each chromosome as well with the following command:
for i in {1..22} X Y M; do
data/human_genome/
samtools faidx data/human_genome/chr$i/chr$i.fna
done
Align the reads to the reference genome using BWA, and then sort the output BAM file using Samtools. The file is sent to the data/ folder. The -t
option specifies the threadcount to use for BWA. I only used 4 threads for this because it was done locally on my mac. This can be increased if you are running on a more powerful machine.
bwa-mem2 mem -t 6 data/human_genome/chr1/chr1.fna \
data/fastqfiles/SRR413984_1.fastq data/fastqfiles/SRR413984_2.fastq | \
samtools sort -@ 6 -m 1G -o data/bams/chr1/SRR413984_chr1.sorted.bam
If you want to run this for an individual chromosome, the following script can be used:
bash src/utils/shell_scripts/bwa_align_and_sort_by_chr.sh <chr_name> <sample_prefix>
Now, index the sorted BAM file (here just chr1) using Samtools:
samtools index data/bams/chr1/SRR413984_chr1.sorted.bam
Next, in order to prepare the BAM file for Circle-Map, we need to query sort the BAM file. This is done using the samtools sort -n
command. The -n
option sorts the reads by name, which is required for Circle-Map.
samtools sort -n -o data/bams/chr1/SRR413984_chr1.querysorted.bam data/bams/chr1/SRR413984_chr1.sorted.bam
Or for individual chromosomes, you can run the following script:
bash src/utils/shell_scripts/querysort_bam_by_chr.sh <chr_name> <sample_prefix>
The next step is to detect eccDNA using the Circle-Map tool. The Circle-Map pipeline consists of two main steps: ReadExtractor and Realign. The ReadExtractor step extracts the reads that are most statistically likely to be circular, and the Realign step re-aligns the extracted reads to the reference genome.
Circle-Map ReadExtractor -i data/bams/chr1/SRR413984_chr1.querysorted.bam -o data/bams/chr1/SRR413984_chr1.candidates.bam
Then sort and index the candidates.bam file using Samtools:
samtools sort -o data/bams/chr1/SRR413984_chr1.candidates.sorted.bam data/bams/chr1/SRR413984_chr1.candidates.bam
samtools index data/bams/chr1/SRR413984_chr1.candidates.sorted.bam
Heres a script to do this for individual chromosomes:
bash src/utils/shell_scripts/sort_and_index_candidates_by_chr.sh chr< replace_with_your_chr_number > < sample_id >
Now we can run the Realign step. This step takes the queryname sorted BAM file and the original sorted BAM file as input, and outputs a BED file with the eccDNA coordinates.
Circle-Map Realign \
-i data/bams/chr1/SRR413984_chr1.candidates.sorted.bam \
-qbam data/bams/chr1/SRR413984_chr1.querysorted.bam \
-sbam data/bams/chr1/SRR413984_chr1.sorted.bam \
-fasta data/human_genome/chr1/chr1.fna \
-o data/beds/chr1/SRR413984_chr1.eccdna.bed \
--split 2 \
--threads 8
Run the following script to clean the output files:
python3 src/utils/clean_bed.py \
-i data/beds/chr1/SRR413984_chr1.eccdna.bed \
-o data/beds/chr1/SRR413984_chr1.eccdna.cleaned.bed
-
Annotate the eccDNA using the Gencode annotation file.
We also need to sort our data/beds/chr1/SRR413984_chr1.eccdna.cleaned.ucsc.bed to give closest the proper inputs:
bedtools sort -i data/beds/chr1/SRR413984_chr1.eccdna.cleaned.ucsc.bed > data/beds/chr1/SRR413984_chr1.eccdna.cleaned.ucsc.sorted.bed
-
Convert the eccDNA BED file to UCSC format. This is necessary for the
bedtools intersect
command to work. The Gencode annotation useschr1
,chr2
, etc. as chromosome names, while the CircleMap output usesNC_000001.10
,NC_000002.11
, etc. as chromosome names.sed 's/^NC_000001.10/chr1/' data/beds/chr1/SRR413984_chr1.eccdna.cleaned.bed > data/beds/chr1/SRR413984_chr1.eccdna.cleaned.ucsc.bed
-
Intersect the eccDNA BED file with the Gencode annotation BED file using
bedtools intersect
. This will create a new BED file with the eccDNA coordinates and their corresponding gene annotations.bedtools closest \ -a data/beds/chr1/SRR413984_chr1.eccdna.cleaned.ucsc.sorted.bed \ -b data/beds/chr1/gencode_chr1_only_genes.bed \ -d > data/beds/chr1/SRR413984_chr1.eccdna.closest_genes.bed
-
Generate a clean output of the eccDNA with gene annotations using the
annotate_eccdna_genes.py
script. This will create a new TSV file with the eccDNA coordinates and their corresponding gene annotations.python3 src/utils/annotate_eccdna_closest.py
The final output file data/SRR413969.eccdna.annotated.tsv
contains the eccDNA coordinates and their corresponding gene annotations. The file is tab-separated and contains the following columns:
ecc_id
: eccDNA IDchrom
: Chromosome namestart
: Start position of the eccDNAend
: End position of the eccDNAgene_id
: Gene ID of the corresponding genegene_name
: Gene name of the corresponding genegene_type
: Gene type of the corresponding gene
To get a guick summary of the eccDNA to see which genes are present, you can run the following command:
cut -f7 data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv | sort | uniq -c | sort -nr
This provides counts on the number of eccDNA associated with each gene type. Then you can extract each gene type and save it to a separate file as follows:
awk '$7 == "protein_coding"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_protein_coding_genes.tsv
awk '$7 == "lncRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_lncRNA_genes.tsv
awk '$7 == "miRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_miRNA_genes.tsv
awk '$7 == "snRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_snRNA_genes.tsv
awk '$7 == "snoRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_snoRNA_genes.tsv
awk '$7 == "rRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_rRNA_genes.tsv
awk '$7 == "pseudogene"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_pseudogene_genes.tsv
awk '$7 == "miRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_miRNA_genes.tsv
awk '$7 == "tRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_tRNA_genes.tsv