Snakemake workflow: rna-seq-star-deseq2 (using pcluster-slurm executor) `0.1.7`

This is my adoption of the original forked repo's rna seq start deseq2 worflow, but to use AWS Parallel Cluster, via the pcluser slurm snakemake executor.

This workflow performs a differential gene expression analysis with STAR and Deseq2.

Prerequisites

Have an `AWS Parallel Cluster` ( using slurm as the scheduler ) Running.

from scratch Use AWS Parallel Cluster (less reccomended)

AWS PC

pre-configured Use `daylily-ephemeral-cluster` (reccomended)

Follow the setup instructions found here: https://github.com/Daylily-Informatics/daylily-ephemeral-cluster.

Conda

If you have installed daylily-ehpemeral-cluster, once you log into the headnode, conda should be activated.
If you roll your own, you'll need to install miniconda, and activate. And you may do so as follows: bash bin/install_miniconda, which should work on most linux'y systems.

Usage

Clone Repo (it includes sample data)

git clone [email protected]:Daylily-Informatics/rna-seq-star-deseq2.git
cd rna-seq-star-deseq2

Build The Snakemake (v8.*) Conda Env

conda create -n snakemake -c conda-forge  snakemake==9.5.1 snakedeploy tabulate yaml
conda activate snakemake
pip install snakemake-executor-plugin-pcluster-slurm==0.0.31

conda activate snakemake
snakemake --version
# 9.5.1

Run Test Data Workflow

you are advised to run the following in a tmux or screen session

Prepare Cache and TMPDIR

conda activate snakemake

# Set your cache dir for saving resources useful across other jobs, snakemake uses this when the `--cache` flag is set.

mkdir /fsx/resources/environments/containers/ubuntu/rnaseq_cache/
export SNAKEMAKE_OUTPUT_CACHE=/fsx/resources/environments/containers/ubuntu/rnaseq_cache/
export TMPDIR=/fsx/scratch/

Prepare `units.tsv`

cp config/units.tsv.template config/units.tsv
[[ "$(uname)" == "Darwin" ]] && sed -i "" "s|REGSUB_PWD|$PWD|g" config/units.tsv || sed -i "s|REGSUB_PWD|$PWD|g" config/units.tsv

Build Conda Env Caches

this can take ~1hr

# I set partitions relevant to my AWS parallel cluster, but if you specify nothing, you will get an error along the lines of <could not find appropriate nodes>.

snakemake --use-conda --use-singularity   \
--singularity-prefix /fsx/resources/environments/containers/ubuntu/ \
--singularity-args "  -B /tmp:/tmp -B /fsx:/fsx  -B /home/$USER:/home/$USER -B $PWD/:$PWD" \
--conda-prefix /fsx/resources/environments/containers/ubuntu/ \
--executor pcluster-slurm \
--default-resources slurm_partition=i128,i192 runtime=86400 mem_mb=36900 tmpdir=/fsx/scratch \
--cache -p \
--verbose -k \
--max-threads 20000 \
--restart-times 2 \
--cores 20000 -j 14 -n   \
--conda-create-envs-only

there seems to be a bug which requires you to run with --conda-create-envs-only first, then once all envs are built, run the command.
another bug with how snakemake detects max allowd threads per job limits the threads to the nproc of your head node. Setting --max-threads 20000 --cores 20000 gets around this crudely.

Prepare To Run The Command

Remove the -n flag, and run not in dryrun mode.
-j sets the max jobs slurm will allow active at one time.
Watch your running nodes/jobs using squeue (also, q cluster commands work, but not reliably and are not supported).

What Partitions Are Available?

Use sinfo to learn about your cluster (note, sinfo reports on all potential and active compute nodes. Read the docs to interpret which are active, which are not yet requested spot instances, etc). Below is what the daylily AWS parallel cluster looks like.

sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
i8*          up   infinite     12  idle~ i8-dy-gb64-[1-12]
i64          up   infinite     16  idle~ i64-dy-gb256-[1-8],i64-dy-gb512-[1-8]
i128         up   infinite     28  idle~ i128-dy-gb256-[1-8],i128-dy-gb512-[1-10],i128-dy-gb1024-[1-10]
i192         up   infinite     30  idle~ i192-dy-gb384-[1-10],i192-dy-gb768-[1-10],i192-dy-gb1536-[1-10]

Use the strings in PARTITION, ie: i192 in the slurm_partition= config passed to snakemake.

Budgets, and the `--comment` sbatch flag

daylily makes extensive use of Cost allocation tags with AWS ParallelCluster in the daylily omics analysis framework $3 30x WGS analysis to track AWS cluster usage costs in realtime, and impose limits where appropriate (by user and project). This makes use of overriding the --comment flag to hold project/budget tags applied to ephemeral AWS resources, and thus enabling cost tracking/controls.

To change the --comment flag in v0.0.8 of the pcluster-slurm plugin, set the comment flag value in the envvar SMK_SLURM_COMMENT=RandD (RandD is the default).

Run The Command

snakemake --use-conda --use-singularity   \
--singularity-prefix /fsx/resources/environments/containers/ubuntu/ \
--singularity-args "  -B /tmp:/tmp -B /fsx:/fsx  -B /home/$USER:/home/$USER -B $PWD/:$PWD" \
--conda-prefix /fsx/resources/environments/containers/ubuntu/ \
--executor pcluster-slurm \
--default-resources slurm_partition=i128,i192 runtime=86400 mem_mb=36900 tmpdir=/fsx/scratch \
--cache -p \
--verbose -k \
--restart-times 2 \
--max-threads 20000 \
--cores 20000 -j 14

You can watch progress with watch squeue.

Run w/Your Data

Update the config/units.tsv (holds sample data location and other details) and config/samples.tsv (holds sample annotations).
Edit config/config.yaml to change aspects of the pipeline.
Run the snakemake command above with -n, tweak -j as needed, and if all looks good, run w/out -n.

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
.github/workflows		.github/workflows
.test		.test
bin		bin
config		config
workflow		workflow
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.snakemake-workflow-catalog.yml		.snakemake-workflow-catalog.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Snakemake workflow: rna-seq-star-deseq2 (using pcluster-slurm executor) `0.1.7`

Prerequisites

Have an `AWS Parallel Cluster` ( using slurm as the scheduler ) Running.

from scratch Use AWS Parallel Cluster (less reccomended)

pre-configured Use `daylily-ephemeral-cluster` (reccomended)

Conda

Usage

Clone Repo (it includes sample data)

Build The Snakemake (v8.*) Conda Env

Run Test Data Workflow

Prepare Cache and TMPDIR

Prepare `units.tsv`

Build Conda Env Caches

Prepare To Run The Command

What Partitions Are Available?

Budgets, and the `--comment` sbatch flag

Run The Command

Run w/Your Data

About

Uh oh!

Releases 3

Packages

Languages

License

Daylily-Informatics/rna-seq-star-deseq2

Folders and files

Latest commit

History

Repository files navigation

Snakemake workflow: rna-seq-star-deseq2 (using pcluster-slurm executor) 0.1.7

Prerequisites

Have an AWS Parallel Cluster ( using slurm as the scheduler ) Running.

from scratch Use AWS Parallel Cluster (less reccomended)

pre-configured Use daylily-ephemeral-cluster (reccomended)

Conda

Usage

Clone Repo (it includes sample data)

Build The Snakemake (v8.*) Conda Env

Run Test Data Workflow

Prepare Cache and TMPDIR

Prepare units.tsv

Build Conda Env Caches

Prepare To Run The Command

What Partitions Are Available?

Budgets, and the --comment sbatch flag

Run The Command

Run w/Your Data

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Snakemake workflow: rna-seq-star-deseq2 (using pcluster-slurm executor) `0.1.7`

Have an `AWS Parallel Cluster` ( using slurm as the scheduler ) Running.

pre-configured Use `daylily-ephemeral-cluster` (reccomended)

Prepare `units.tsv`

Budgets, and the `--comment` sbatch flag

Packages