microDNA-Detection-and-Analysis

A reproducible pipeline for detecting, extracting, and annotating small extrachromosomal circular DNA (eccDNA) from high‑throughput sequencing data.

Overview

This pipeline uses Circle‑Map to detect eccDNA (100–400 bp) from paired‑end sequencing reads, cleans and re‑aligns candidates, and annotates genomic features. It is modular, transparent, and version‑controlled for full reproducibility. It is designed to be run on a Linux system with a conda package manager.

Prerequisites

Conda (miniconda or anaconda)
Python (3.6 or later)
BWA
Samtools
Bedtools
gtf2bed

Installation

Clone the Repository

git clone https://github.com/JasonHunter95/microDNA-Detection-and-Analysis.git
cd microDNA-Detection-and-Analysis

Create and Activate a Conda Environment

conda env create -f environment.yml
conda activate circlemap-env

Data Acquisition

To run the pipeline, you need to have paired-end sequencing data in FASTQ format. You can either download your own data or use the example dataset provided in this repository.

Raw sequencing data can be fetched from the SRA database. The example dataset is from the NCBI SRA database, specifically the SRR413984 accession number.

fasterq-dump SRR413984 --split-files -O data/fastqfiles/

This will download the paired-end reads and save them in the data/fastqfiles/ directory. The files will be named SRR413984_1.fastq and SRR413984_2.fastq.

The example dataset is a human sample, and the reference genome used in this example is the human genome (GRCh37). The reference genome file can be downloaded to the data/human_genome/whole directory as GCF_000001405.13_GRCh37_genomic.NC_000001.10.fna via executing the following shell script:

Please note that this may take some time to download, as the reference genome is quite large.

bash src/utils/shell_scripts/get_ref_genome.sh

You can also use the following script to extract individual chromosomes from the genome downloaded using samtools faidx. This will create a folder for each chromosome in the genome, and subsequently saves an index file for each. Once again, this may take a few minutes to run. (This is unnecessary if you are using the whole genome, but is useful if you want to run the pipeline on individual chromosomes.)

We also need to download the Gencode annotation file for the human genome. The Gencode annotation file is available in GTF format, and we will convert it to BED format using the gtf2bed command, and sort the BED file using bedtools sort. The following commands can be used to accomplish this:

curl -L -o data/gtfs/human_genome/gencode.v19.annotation.gtf.gz \
    https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
gunzip data/gtfs/human_genome/gencode.v19.annotation.gtf.gz

gtf2bed < data/gtfs/human_genome/gencode.v19.annotation.gtf > data/beds/whole_genomes/human_genome/gencode.v19.annotation.bed

This will create a BED file with the Gencode annotation in the data/beds/whole_genomes/human_genome/ directory.

bedtools sort -i data/beds/whole_genomes/human_genome/gencode.v19.annotation.bed > data/beds/whole_genomes/human_genome/gencode.v19.annotation.sorted.bed

Alignment and Indexing

The first step in the pipeline is to align the paired-end reads to the reference genome. This is done in two steps, first by indexing the reference genome and then aligning the reads to the indexed reference genome.

bash src/utils/shell_scripts/index_all_chr.sh

To index the entire reference genome, you can use the following command:

bwa-mem2 index data/human_genome/whole/GCF_000001405.13_GRCh37_genomic.fna

This will create an index file for the reference genome, which is used by BWA to align the reads to the reference genome. Alternatively, you can run the following command to index the reference genome using Samtools, which is a more common method:

samtools faidx data/human_genome/whole/GCF_000001405.13_GRCh37_genomic.fna

Both commands will create an index file for the reference genome, but the samtools faidx command is more efficient and faster for large genomes.

This can be done for each chromosome as well with the following command:

for i in {1..22} X Y M; do
data/human_genome/
    samtools faidx data/human_genome/chr$i/chr$i.fna
done

Align the reads to the reference genome using BWA, and then sort the output BAM file using Samtools. The file is sent to the data/ folder. The -t option specifies the threadcount to use for BWA. I only used 4 threads for this because it was done locally on my mac. This can be increased if you are running on a more powerful machine.

bwa-mem2 mem -t 6 data/human_genome/chr1/chr1.fna \
    data/fastqfiles/SRR413984_1.fastq data/fastqfiles/SRR413984_2.fastq | \
    samtools sort -@ 6 -m 1G -o data/bams/chr1/SRR413984_chr1.sorted.bam

If you want to run this for an individual chromosome, the following script can be used:

bash src/utils/shell_scripts/bwa_align_and_sort_by_chr.sh <chr_name> <sample_prefix>

Now, index the sorted BAM file (here just chr1) using Samtools:

samtools index data/bams/chr1/SRR413984_chr1.sorted.bam

Next, in order to prepare the BAM file for Circle-Map, we need to query sort the BAM file. This is done using the samtools sort -n command. The -n option sorts the reads by name, which is required for Circle-Map.

samtools sort -n -o data/bams/chr1/SRR413984_chr1.querysorted.bam data/bams/chr1/SRR413984_chr1.sorted.bam

Or for individual chromosomes, you can run the following script:

bash src/utils/shell_scripts/querysort_bam_by_chr.sh <chr_name> <sample_prefix>

eccDNA Detection

The next step is to detect eccDNA using the Circle-Map tool. The Circle-Map pipeline consists of two main steps: ReadExtractor and Realign. The ReadExtractor step extracts the reads that are most statistically likely to be circular, and the Realign step re-aligns the extracted reads to the reference genome.

Circle-Map ReadExtractor -i data/bams/chr1/SRR413984_chr1.querysorted.bam -o data/bams/chr1/SRR413984_chr1.candidates.bam

Then sort and index the candidates.bam file using Samtools:

samtools sort -o data/bams/chr1/SRR413984_chr1.candidates.sorted.bam data/bams/chr1/SRR413984_chr1.candidates.bam
samtools index data/bams/chr1/SRR413984_chr1.candidates.sorted.bam

Heres a script to do this for individual chromosomes:

bash src/utils/shell_scripts/sort_and_index_candidates_by_chr.sh chr< replace_with_your_chr_number > < sample_id >

Now we can run the Realign step. This step takes the queryname sorted BAM file and the original sorted BAM file as input, and outputs a BED file with the eccDNA coordinates.

Circle-Map Realign \
  -i data/bams/chr1/SRR413984_chr1.candidates.sorted.bam \
  -qbam data/bams/chr1/SRR413984_chr1.querysorted.bam \
  -sbam data/bams/chr1/SRR413984_chr1.sorted.bam \
  -fasta data/human_genome/chr1/chr1.fna \
  -o data/beds/chr1/SRR413984_chr1.eccdna.bed \
  --split 2 \
  --threads 8

Post-processing

Run the following script to clean the output files:

python3 src/utils/clean_bed.py \
  -i data/beds/chr1/SRR413984_chr1.eccdna.bed \
  -o data/beds/chr1/SRR413984_chr1.eccdna.cleaned.bed

Annotation

Annotate the eccDNA using the Gencode annotation file.

We also need to sort our data/beds/chr1/SRR413984_chr1.eccdna.cleaned.ucsc.bed to give closest the proper inputs:
```
  bedtools sort -i data/beds/chr1/SRR413984_chr1.eccdna.cleaned.ucsc.bed > data/beds/chr1/SRR413984_chr1.eccdna.cleaned.ucsc.sorted.bed
```
Convert the eccDNA BED file to UCSC format. This is necessary for the bedtools intersect command to work. The Gencode annotation uses chr1, chr2, etc. as chromosome names, while the CircleMap output uses NC_000001.10, NC_000002.11, etc. as chromosome names.
```
  sed 's/^NC_000001.10/chr1/' data/beds/chr1/SRR413984_chr1.eccdna.cleaned.bed > data/beds/chr1/SRR413984_chr1.eccdna.cleaned.ucsc.bed
```

Intersect the eccDNA BED file with the Gencode annotation BED file using bedtools intersect. This will create a new BED file with the eccDNA coordinates and their corresponding gene annotations.

  bedtools closest \
    -a data/beds/chr1/SRR413984_chr1.eccdna.cleaned.ucsc.sorted.bed \
    -b data/beds/chr1/gencode_chr1_only_genes.bed \       
    -d > data/beds/chr1/SRR413984_chr1.eccdna.closest_genes.bed

Generate a clean output of the eccDNA with gene annotations using the annotate_eccdna_genes.py script. This will create a new TSV file with the eccDNA coordinates and their corresponding gene annotations.
```
  python3 src/utils/annotate_eccdna_closest.py
```

Results

The final output file data/SRR413969.eccdna.annotated.tsv contains the eccDNA coordinates and their corresponding gene annotations. The file is tab-separated and contains the following columns:

ecc_id: eccDNA ID
chrom: Chromosome name
start: Start position of the eccDNA
end: End position of the eccDNA
gene_id: Gene ID of the corresponding gene
gene_name: Gene name of the corresponding gene
gene_type: Gene type of the corresponding gene

To get a guick summary of the eccDNA to see which genes are present, you can run the following command:

cut -f7 data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv | sort | uniq -c | sort -nr

This provides counts on the number of eccDNA associated with each gene type. Then you can extract each gene type and save it to a separate file as follows:

awk '$7 == "protein_coding"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_protein_coding_genes.tsv
awk '$7 == "lncRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_lncRNA_genes.tsv
awk '$7 == "miRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_miRNA_genes.tsv
awk '$7 == "snRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_snRNA_genes.tsv
awk '$7 == "snoRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_snoRNA_genes.tsv
awk '$7 == "rRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_rRNA_genes.tsv
awk '$7 == "pseudogene"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_pseudogene_genes.tsv
awk '$7 == "miRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_miRNA_genes.tsv
awk '$7 == "tRNA"' data/results/chr1/SRR413984_chr1_intersected.eccdna.annotated.tsv > data/results/chr1/SRR413984_chr1_tRNA_genes.tsv

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
figures		figures
src/utils		src/utils
useful_commands		useful_commands
.gitignore		.gitignore
README.md		README.md
circles_igv_session_chr1.xml		circles_igv_session_chr1.xml
commands.txt		commands.txt
environment.yml		environment.yml
fijiREADME.md		fijiREADME.md
igv_session_chr1.xml		igv_session_chr1.xml
requirements.txt		requirements.txt
windowsREADME.md		windowsREADME.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

microDNA-Detection-and-Analysis

Table of Contents

Overview

Prerequisites

Installation

Data Acquisition

Alignment and Indexing

eccDNA Detection

Post-processing

Annotation

Results

About

Uh oh!

Releases

Packages

Uh oh!

Languages

JasonHunter95/microDNA-Detection-and-Analysis

Folders and files

Latest commit

History

Repository files navigation

microDNA-Detection-and-Analysis

Table of Contents

Overview

Prerequisites

Installation

Data Acquisition

Alignment and Indexing

eccDNA Detection

Post-processing

Annotation

Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages