diff --git a/README.md b/README.md index aef41e9..7872d42 100644 --- a/README.md +++ b/README.md @@ -1,48 +1,6 @@ -# RNAseq - -## Instructions for analysis - -The goal of this analysis is to perform a differential expression analysis at gene level (mRNA isoforms are *not* taken into account). - -The following instructions are short in purpose: additional information or tools can be deduced, or retrieve from public databases. - -1. Download 4 sequence datasets deposited to the EBI [ENA](http://www.ebi.ac.uk/ena): - - `ERR990557` - - `ERR990558` - - `ERR990559` - - `ERR990560` - -These datasets have been generated in a work published in Embo Reports : - -Molla-Herman A, Vallés AM, Ganem-Elbaz C, Antoniewski C, Huynh J-R. tRNA processing defects induce replication stress and Chk2-dependent disruption of piRNA transcription. EMBO J. 2015;34: 3009–3027. doi:10.15252/embj.201591006 - -2. Extract fastq files -3. for each file, select 8,000,000 (8 millions) of sequence reads and generate the following sample files: - - `ERR990557s.fastq` - - `ERR990558s.fastq` - - `ERR990559s.fastq` - - `ERR990560s.fastq` - -4. Align these read datasets to the reference genome by any appropriate mean, and generate a sorted bam alignment file. -5. Count reads aligning to genome's genes by any appropriate mean -6. Perform a statistical differential expression analysis and report using any appropriate figure(s)/graph(s) -7. select a list of genes likely to be differentially expressed with a p-adj value < 0.01 -8. Code a simple script that parse the table of differential expressions (from *6.*) and return the genes with a p-adj value < 0.01 for rejection of H0 (non differential expression) - -## Reporting - -Each analyst will report her/is analysis by any mean s/he feels appropriate (pdf, text, markedown, jpg, URL, etc.). - -The only constraint is that analysis outputs will be deposited in a personal [fork](https://help.github.com/articles/fork-a-repo/) of this repository in a *new* directory named analysis.01, analysis.02, etc. (see analysis.00 for an example). Keep track of the analysis.xx directories already existing and chose another name for your directory. - -Final submission of the results will be made through a [pull request](https://help.github.com/articles/creating-a-pull-request/) from the analyst to the [original repo](https://github.com/drosofff/RNAseq.git). - -Please do not neglect the reporting and follow the requested process, it is a part of the analysis. +# analysis.01 + +## RNA-seq analysis + +You can find my report in the html document (rna_seq) in the same directory. +Best regards, L. Bellenger \ No newline at end of file diff --git a/rna_seq.html b/rna_seq.html new file mode 100644 index 0000000..ecaac9f --- /dev/null +++ b/rna_seq.html @@ -0,0 +1,412 @@ + + + + + + + + + + + + + + + +RNAseq analysis + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + +
+
+
+
+
+ +
+ + + + + + + +
+

Data describing

+

The data came from a work published in Embo Reports :

+

Molla-Herman A, Vallés AM, Ganem-Elbaz C, Antoniewski C, Huynh J-R. tRNA processing defects induce replication stress and Chk2-dependent disruption of piRNA transcription. EMBO J. 2015;34: 3009–3027. doi:10.15252/embj.201591006

+

There is 4 differents datasets :

+
+

ERR990557
+ERR990558
+ERR990559
+ERR990560

+
+
+
+

File manipulations

+

Once the fastq files of these datasets are obtained, we subsample them thanks to seqtk in order to perform the analysis on 8,000,000 reads.

+
seqtk sample -s100 8000000 ERR990557.fastq > ERR990557s.fastq  
+...  
+
+

Then several tools are used for mapping reads to the genome of reference and generate a bam file :

+
bowtie2-build genome/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna genome/droso
+bowtie2-align -p 5 -U ERR990557s.fastq -x genome/droso > ERR990557s.bowtie  
+samtools view -h -Sb ERR990559s.bowtie -o ERR990557s.bam  
+samtools sort ERR990559s.bam ERR990557s.sorted   
+...  
+
+

the different results of bowtie2 for the subsampled datasets are presented below:

+
# ERR990557  
+8000000 reads; of these:  
+  8000000 (100.00%) were unpaired; of these:  
+    1829226 (22.87%) aligned 0 times  
+    5189931 (64.87%) aligned exactly 1 time  
+    980843 (12.26%) aligned >1 times  
+77.13% overall alignment rate  
+
# ERR990558  
+8000000 reads; of these:  
+  8000000 (100.00%) were unpaired; of these:  
+    1118967 (13.99%) aligned 0 times  
+    5842674 (73.03%) aligned exactly 1 time  
+    1038359 (12.98%) aligned >1 times  
+86.01% overall alignment rate  
+
# ERR990559  
+8000000 reads; of these:  
+  8000000 (100.00%) were unpaired; of these:  
+    1288699 (16.11%) aligned 0 times  
+    5017495 (62.72%) aligned exactly 1 time  
+    1693806 (21.17%) aligned >1 times  
+83.89% overall alignment rate  
+
# ERR990560 
+8000000 reads; of these:  
+  8000000 (100.00%) were unpaired; of these: 
+    878293 (10.98%) aligned 0 times  
+    5702321 (71.28%) aligned exactly 1 time  
+    1419386 (17.74%) aligned >1 times  
+89.02% overall alignment rate  
+

For each dataset, HTseq-count is used in order to count the number of reads mapping genes.

+
htseq-count -f bam -i gene ERR990557s.sorted.bam genome/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gff > ERR990557s_exon_gene.bam.count.txt  
+...
+
+
+

Differential analysis

+

Now, the purpose is to visualize if there are differentially expressed genes between two conditions.

+
+

ERR990557/ERR990558 (Condition A) VS ERR990559/ERR990560 (Condition B)

+
+

For this analysis, we are going to use count.table that contains 17733 rows corresponding to genes, and 4 columns to samples. The value contained in the table represent for each gene the number of times a read mapped this gene. It’s corresponding to the output of HTseq.

+
## Load the files content in an R data.frame
+count.table <- read.table(file="C:/Users/belle/Downloads/count_table.txt", sep="\t", header=TRUE, row.names =1)
+head(count.table)
+
          ERR990557 ERR990558 ERR990559 ERR990560
+gene0          1173       791      9120      9483
+gene1             5         5       160        72
+gene10          117       126        69       113
+gene100         142       183       107       113
+gene1000          0         0         1         2
+gene10000         0         0         0         0
+

Some genes were not detected at all in these samples, they have been discard, now we work on 2936 detected genes.

+

Thanks to DESeq2 R package, the analysis can be performed as described below :
+The nbinomWaldTest can be used to test for differential expression. The output is a data.frame which contains nominal p-values, as well as FDR values (correction for multiple tests computed with the Benjamini–Hochberg procedure).

+
## Install the library if needed then load it
+if(!require("DESeq2")){
+  source("http://bioconductor.org/biocLite.R")
+  biocLite("DESeq2")
+}
+library("DESeq2")
+
+condition <- factor(c("A","A","B","B"))
+dds0 <- DESeq2::DESeqDataSetFromMatrix(count.table, DataFrame(condition), ~ condition)
+dds.norm <- estimateSizeFactors(dds0)
+dds.disp <- estimateDispersions(dds.norm)
+alpha <- 0.01
+wald.test <- nbinomWaldTest(dds.disp)
+res.DESeq2 <- results(wald.test, alpha=alpha, pAdjustMethod="BH")
+res.DESeq2 <- res.DESeq2[order(res.DESeq2$padj),]
+
+

Volcano plot

+

This representation allows to highlight the relevant genes.On the volcano plot, genes that exceeding the thresholds of biological and statistical significance (lines) are flagged.

+

+
+
+

Looking at the results with a MA plot

+

One popular diagram in DNA chip analysis is the M versus A plot (MA plot) between two conditions \(A\) and \(B\). Points will be colored red if the adjusted p value is less than 0.0001. Points which fall out of the window are plotted as open triangles pointing either up or down.

+

+
+
+

Hierachical clustering

+

In order to be sure that significative genes are well differentially expressed between the two conditions, we will perform a hierachical clustering using the heatmap.2() function from the gplots library.

+

We selected genes based on FDR (1%), then retrieve the normalized counts for gene of interest and perform the hierarchical clustering with a distance based on Pearson-correlation coefficient and average linkage clustering as agglomeration criteria.

+
## We select gene names based on FDR (1%)
+gene.kept <- rownames(res.DESeq2)[res.DESeq2$padj <= alpha & !is.na(res.DESeq2$padj)]
+
+## we retreive the count for gene of interest
+head(count.table[gene.kept, ])
+
          ERR990557 ERR990558 ERR990559 ERR990560
+gene463          43        55      3998      3108
+gene10723        11        16      1667      1565
+gene11362        22        19      1348      1135
+gene1799          7        18       724      1326
+gene2848          3        13       543      1175
+gene629         136       191      1518      1911
+

+

It corresponds to 1600 genes and we see clearly that our two conditions stand out well.

+
+
+
+

Conclusion

+

There is a differential expression between the two conditions. The aim now would be the functionnal annotation to understand why these genes are co-expressed, are the involve in the same biological process ? Are they co-located? We can use several tools and databases in order to understand better the relationships between those genes like gprofiler, Gene Onthology, String db, etc…

+
+ + + +
+
+ +
+ + + + + + + +