-
Notifications
You must be signed in to change notification settings - Fork 4
Assembly
Most sequencing machines are unable to produce long reads, and even when they do they don't produce a single read per chromosome. Moreover, the sequencers currently able to produce long reads (Oxford Nanopore and PacBio) usually have a higher error-rate (Nanopore), a higher cost-per-base (PacBio) and a lower throughput if compared with the more commonly available Illumina sequencers.
The sequencing of small genomes usually involves the whole genome shotgun approach, that will produce a stack of short reads from the genome.
Programs able to perform a de novo genome assembly are then used to create a consensus, usually becoming a set of contiguous sequences (known as contigs).
A popular compression tool used in Linux servers is called gzip. Unlike other compression systems, it's only able to compress a single file at the time, but this has some advantages, as we'll see.
Go to the phage directory we found yesterday. If you don't remember where it is:
find ~ -name "phage" -type d
Now that you found it:
cd ~/learn_bash/phage/
let's check the size of some files:
ls -lh *f
now let's compress the very same files (they will get a .gz suffix):
gzip *f
ls -lh *f.gz
To decompress them:
gunzip *.gz
To start, create a directory called denovo, placed inside your home directory:
mkdir ~/denovo
Now we want to locate input reads. Usual extensions from reads from a machine are .fastq or .fq, but being usually large files, they are compressed with gzip:
find ~/learn_bash -name "*.fastq.gz" | sort
You should find at least 6 files (3 pairs):
/home/ubuntu/learn_bash/phage/reads/sample1_R1.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample1_R2.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample2_R1.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample2_R2.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample3_R1.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample3_R2.fastq.gzThese are paired reads, meaning that R1 and R2 files are coming from the same library and contain the forward and reverse read, respectively.
Now try to:
- Decompress them
- Have a look at their structure (using
less -S) - Count the lines
- Re-compress them
We can use one of the many programs to calculate statistics about FASTQ files:
seqkit stats ~/learn_bash/phage/reads/*fastq.gz
seqfu count ~/learn_bash/phage/reads/*fastq.gz
In our server there is a version of SPAdes pre-installed. To learn more we can check the online documentation, but also try to see if the program comes with some help (it is usually the case). Try typing spades.py, and hit enter, without any parameter.
- Is the help page printed in the standard output or in the standard error?
- Save the help screen in a file called "spades.txt" in your freshly created denovo directory.
As you could see from the manual, to assemble paired ends the (minimal) syntax is:
spades.py -1 {first_pair_fastq} -2 {second_pair_fastq} -o {new_output_directory}For example:
spades.py -t 4 -1 ~/learn_bash/phage/reads/sample2_R1.fastq.gz \
-2 ~/learn_bash/phage/reads/sample2_R2.fastq.gz -o ~/assembly- What file have been created in the output directory? Ar there FASTA files?
- How many sequences are there in contigs.fasta?
- How long are they [the size is printed in the sequence header]?
All the samples have been preassembled in the /data directory, so we can run a program that will check the statistics for all the assemblies:
seqkit stats /data/assembly-*/contigs.fasta
· Bioinformatics at the Command Line - Andrea Telatin, 2017-2020
Menu