diff --git a/README.md b/README.md index aef41e9..7872d42 100644 --- a/README.md +++ b/README.md @@ -1,48 +1,6 @@ -# RNAseq - -## Instructions for analysis - -The goal of this analysis is to perform a differential expression analysis at gene level (mRNA isoforms are *not* taken into account). - -The following instructions are short in purpose: additional information or tools can be deduced, or retrieve from public databases. - -1. Download 4 sequence datasets deposited to the EBI [ENA](http://www.ebi.ac.uk/ena): - - `ERR990557` - - `ERR990558` - - `ERR990559` - - `ERR990560` - -These datasets have been generated in a work published in Embo Reports : - -Molla-Herman A, Vallés AM, Ganem-Elbaz C, Antoniewski C, Huynh J-R. tRNA processing defects induce replication stress and Chk2-dependent disruption of piRNA transcription. EMBO J. 2015;34: 3009–3027. doi:10.15252/embj.201591006 - -2. Extract fastq files -3. for each file, select 8,000,000 (8 millions) of sequence reads and generate the following sample files: - - `ERR990557s.fastq` - - `ERR990558s.fastq` - - `ERR990559s.fastq` - - `ERR990560s.fastq` - -4. Align these read datasets to the reference genome by any appropriate mean, and generate a sorted bam alignment file. -5. Count reads aligning to genome's genes by any appropriate mean -6. Perform a statistical differential expression analysis and report using any appropriate figure(s)/graph(s) -7. select a list of genes likely to be differentially expressed with a p-adj value < 0.01 -8. Code a simple script that parse the table of differential expressions (from *6.*) and return the genes with a p-adj value < 0.01 for rejection of H0 (non differential expression) - -## Reporting - -Each analyst will report her/is analysis by any mean s/he feels appropriate (pdf, text, markedown, jpg, URL, etc.). - -The only constraint is that analysis outputs will be deposited in a personal [fork](https://help.github.com/articles/fork-a-repo/) of this repository in a *new* directory named analysis.01, analysis.02, etc. (see analysis.00 for an example). Keep track of the analysis.xx directories already existing and chose another name for your directory. - -Final submission of the results will be made through a [pull request](https://help.github.com/articles/creating-a-pull-request/) from the analyst to the [original repo](https://github.com/drosofff/RNAseq.git). - -Please do not neglect the reporting and follow the requested process, it is a part of the analysis. +# analysis.01 + +## RNA-seq analysis + +You can find my report in the html document (rna_seq) in the same directory. +Best regards, L. Bellenger \ No newline at end of file diff --git a/rna_seq.html b/rna_seq.html new file mode 100644 index 0000000..ecaac9f --- /dev/null +++ b/rna_seq.html @@ -0,0 +1,412 @@ + + + + +
+ + + + + + + + + + +The data came from a work published in Embo Reports :
+Molla-Herman A, Vallés AM, Ganem-Elbaz C, Antoniewski C, Huynh J-R. tRNA processing defects induce replication stress and Chk2-dependent disruption of piRNA transcription. EMBO J. 2015;34: 3009–3027. doi:10.15252/embj.201591006
+There is 4 differents datasets :
+++ERR990557
+
+ERR990558
+ERR990559
+ERR990560
Once the fastq files of these datasets are obtained, we subsample them thanks to seqtk in order to perform the analysis on 8,000,000 reads.
seqtk sample -s100 8000000 ERR990557.fastq > ERR990557s.fastq
+...
+
+Then several tools are used for mapping reads to the genome of reference and generate a bam file :
+bowtie2-build genome/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna genome/droso
+bowtie2-align -p 5 -U ERR990557s.fastq -x genome/droso > ERR990557s.bowtie
+samtools view -h -Sb ERR990559s.bowtie -o ERR990557s.bam
+samtools sort ERR990559s.bam ERR990557s.sorted
+...
+
+the different results of bowtie2 for the subsampled datasets are presented below:
+# ERR990557
+8000000 reads; of these:
+ 8000000 (100.00%) were unpaired; of these:
+ 1829226 (22.87%) aligned 0 times
+ 5189931 (64.87%) aligned exactly 1 time
+ 980843 (12.26%) aligned >1 times
+77.13% overall alignment rate
+# ERR990558
+8000000 reads; of these:
+ 8000000 (100.00%) were unpaired; of these:
+ 1118967 (13.99%) aligned 0 times
+ 5842674 (73.03%) aligned exactly 1 time
+ 1038359 (12.98%) aligned >1 times
+86.01% overall alignment rate
+# ERR990559
+8000000 reads; of these:
+ 8000000 (100.00%) were unpaired; of these:
+ 1288699 (16.11%) aligned 0 times
+ 5017495 (62.72%) aligned exactly 1 time
+ 1693806 (21.17%) aligned >1 times
+83.89% overall alignment rate
+# ERR990560
+8000000 reads; of these:
+ 8000000 (100.00%) were unpaired; of these:
+ 878293 (10.98%) aligned 0 times
+ 5702321 (71.28%) aligned exactly 1 time
+ 1419386 (17.74%) aligned >1 times
+89.02% overall alignment rate
+For each dataset, HTseq-count is used in order to count the number of reads mapping genes.
+htseq-count -f bam -i gene ERR990557s.sorted.bam genome/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gff > ERR990557s_exon_gene.bam.count.txt
+...
+Now, the purpose is to visualize if there are differentially expressed genes between two conditions.
+++ERR990557/ERR990558 (Condition A) VS ERR990559/ERR990560 (Condition B)
+
For this analysis, we are going to use count.table that contains 17733 rows corresponding to genes, and 4 columns to samples. The value contained in the table represent for each gene the number of times a read mapped this gene. It’s corresponding to the output of HTseq.
## Load the files content in an R data.frame
+count.table <- read.table(file="C:/Users/belle/Downloads/count_table.txt", sep="\t", header=TRUE, row.names =1)
+head(count.table) ERR990557 ERR990558 ERR990559 ERR990560
+gene0 1173 791 9120 9483
+gene1 5 5 160 72
+gene10 117 126 69 113
+gene100 142 183 107 113
+gene1000 0 0 1 2
+gene10000 0 0 0 0
+Some genes were not detected at all in these samples, they have been discard, now we work on 2936 detected genes.
+Thanks to DESeq2 R package, the analysis can be performed as described below :
+The nbinomWaldTest can be used to test for differential expression. The output is a data.frame which contains nominal p-values, as well as FDR values (correction for multiple tests computed with the Benjamini–Hochberg procedure).
## Install the library if needed then load it
+if(!require("DESeq2")){
+ source("http://bioconductor.org/biocLite.R")
+ biocLite("DESeq2")
+}
+library("DESeq2")
+
+condition <- factor(c("A","A","B","B"))
+dds0 <- DESeq2::DESeqDataSetFromMatrix(count.table, DataFrame(condition), ~ condition)
+dds.norm <- estimateSizeFactors(dds0)
+dds.disp <- estimateDispersions(dds.norm)
+alpha <- 0.01
+wald.test <- nbinomWaldTest(dds.disp)
+res.DESeq2 <- results(wald.test, alpha=alpha, pAdjustMethod="BH")
+res.DESeq2 <- res.DESeq2[order(res.DESeq2$padj),]This representation allows to highlight the relevant genes.On the volcano plot, genes that exceeding the thresholds of biological and statistical significance (lines) are flagged.
+One popular diagram in DNA chip analysis is the M versus A plot (MA plot) between two conditions \(A\) and \(B\). Points will be colored red if the adjusted p value is less than 0.0001. Points which fall out of the window are plotted as open triangles pointing either up or down.
+In order to be sure that significative genes are well differentially expressed between the two conditions, we will perform a hierachical clustering using the heatmap.2() function from the gplots library.
We selected genes based on FDR (1%), then retrieve the normalized counts for gene of interest and perform the hierarchical clustering with a distance based on Pearson-correlation coefficient and average linkage clustering as agglomeration criteria.
+## We select gene names based on FDR (1%)
+gene.kept <- rownames(res.DESeq2)[res.DESeq2$padj <= alpha & !is.na(res.DESeq2$padj)]
+
+## we retreive the count for gene of interest
+head(count.table[gene.kept, ]) ERR990557 ERR990558 ERR990559 ERR990560
+gene463 43 55 3998 3108
+gene10723 11 16 1667 1565
+gene11362 22 19 1348 1135
+gene1799 7 18 724 1326
+gene2848 3 13 543 1175
+gene629 136 191 1518 1911
+It corresponds to 1600 genes and we see clearly that our two conditions stand out well.
+There is a differential expression between the two conditions. The aim now would be the functionnal annotation to understand why these genes are co-expressed, are the involve in the same biological process ? Are they co-located? We can use several tools and databases in order to understand better the relationships between those genes like gprofiler, Gene Onthology, String db, etc…
+