Here is where I house updated scripts for the Sheynkman Lab LRP pipeline &. This repository is VERY ACTIVELY being modified. If you are using this as a guide, please contact Emily Watts ([email protected]) for assistance. There may be updates happening that have not been pushed to this repository yet
At this time, this pipeline can be run with multiple biological replicates for two conditions. It is not yet set up to run with multiple biological replicates for more than two conditions.
I have not updated any of the mass spectrometry-dependent modules yet. We have been focused on the RNA-seq modules for the past ~2 years.
In most use cases, I skip the following modules: 15_accession_mapping
, 15_MS_file_convert
, 16_Metamorpheus
, 17_novel_peptides
, 17_peptide_analysis
, and 17_protein_group_compare
.
If you are in the Sheynkman Lab, my most recent LRP run can be found here. It contains all the correct file paths for Dockers and programs stored on Rivanna.
Make file structure in your working directory to make this pipeline run easily by cloning this repository
The generic scripts in this repository assume that your directory is organized in this manner and that you are working from your working directory (ie, don't change directories at each step).
If you are in the Sheynkman Lab, do not use a 00_input_data
folder. Instead, use the raw files in the directory /project/sheynkman/raw_data/
and direct the scripts to these files (it saves space in our project storage).
module load git-lfs/2.10.0
git clone https://github.com/efwatts/LRP_Troubleshooting.git
cd LRP_Troubleshooting
Each module lists the required modules and either has a .yml
file to create the environment needed (eventually all will have these) or instructs you on how to create the environment.
If you have Kinnex data, which is now the standard, these are the files you will need. If you have older PacBio data, you will need to run a few earlier steps in the Iso-Seq pipeline to be ready for the LRP pipeline.
- raw_reads.flnc.bam from your PacBio data
- from Gencode:
- gencode_gtf - Comprehensive gene annotation (regions: CHR)
gencode.v46.annotation.gtf
- gencode_transcript_fasta - Protein-coding transcript sequences (regions: CHR)
gencode.v46_pc_transcripts.fa
- gencode_translation_fasta - Protein-coding transcript translation sequences (regions: CHR)
gencode.v46_pc_translations.fa
- genome_fasta - Genome sequence, primary assembly (GRCh38) (regions: PRI)
GRCh38.primary_assembly.genome.fa
- gencode_gtf - Comprehensive gene annotation (regions: CHR)
- Human_Hexamer.tsv reference file
- Human_logitModel.RData reference file
- Optional: kallisto.tsv from your data
- Optional (for Modules 15-17): MS search files.raw
- Optional (for Modules 16-17): UniProt reviewed.fasta from UniProt database
I typically use this README.md file to keep track of the modules I have run and the order I have run them in.
I also use it to make notes on the particular LRP run I am working on.
If you clone this repository, you can erase the contents of this file and use it for your own notes.
A typical README.md file in my working directory looks like this:
This repository is being used to applyf the Long Read Proteogenomics pipeline to my project.
Include important information about your dataset here.
You can also include information about where your data is stored and how to access it.
Be sure to include the GENCDOE version you are using.
module load git-lfs/2.10.0
git clone https://github.com/efwatts/LRP_Troubleshooting.git
rename 's/^LRP_Troubleshooting$/project_name/' LRP_Troubleshooting
cd project_name
I typically load these modules at the beginning of a run, because my environment is prepared to do minor data manipulation outside of the LRP pipeline.
module load gcc/11.4.0 openmpi/4.1.4 python/3.11.4 miniforge/24.3.0-py3.11 samtools/1.17 R/4.5.0
sbatch 00_scripts/01_isoseq.sh
sbatch 00_scripts/01_make_reference_tables.sh
sbatch 00_scripts/02_sqanti.sh
sbatch 00_scripts/02_make_gencode_database.sh
sbatch 00_scripts/03_filter_sqanti.sh
sbatch 00_scripts/04_cpat.sh
sbatch 00_scripts/04_transcriptome_summary.sh
This is where we are splitting the two conditions.
sbatch 00_scripts/05_orf_calling.sh
sbatch 00_scripts/06_refine_orf_database.sh
sbatch 00_scripts/07_make_cds_gtf.sh
sbatch 00_scripts/08_rename_cds_to_exon.sh
sbatch 00_scripts/09_sqanti_protein.sh
sbatch 00_scripts/10_5p_utr.sh
sbatch 00_scripts/11_protein_classification.sh
sbatch 00_scripts/12_protein_gene_rename.sh
sbatch 00_scripts/13_protein_filter.sh
sbatch 00_scripts/14_protein_hybrid_database.sh
sbatch 00_scripts/17_track_visualization.sh
sbatch 00_scripts/18_suppa.sh
Make gene counts for the edgeR analysis.
python 00_scripts/01_isoseq_gene_counts.py 01_isoseq/collapse/merged.collapsed.flnc_count.txt 01_isoseq/gene_level_counts.txt
Now run 19_LRP_summary/edgeR.R
to get the edgeR results required for the next script.
Rscript 19_LRP_summary/edgeR.R
Now run the LRP summary script to get the final results.
sbacth 00_scripts/19_LRP_summary.sh
These are in an R script that I typically run in RStudio, but you can also run it like this.
Rscript 00_scripts/20_DTE_DGE_DTU.R