This is my adoption of the original forked repo's rna seq start deseq2 worflow, but to use AWS Parallel Cluster, via the pcluser slurm snakemake executor.
This workflow performs a differential gene expression analysis with STAR and Deseq2.
- Follow the setup instructions found here: https://github.com/Daylily-Informatics/daylily-ephemeral-cluster.
- If you have installed daylily-ehpemeral-cluster, once you log into the headnode, conda should be activated.
- If you roll your own, you'll need to install miniconda, and activate. And you may do so as follows:
bash bin/install_miniconda
, which should work on most linux'y systems.
git clone [email protected]:Daylily-Informatics/rna-seq-star-deseq2.git
cd rna-seq-star-deseq2
conda create -n snakemake -c conda-forge snakemake==9.5.1 snakedeploy tabulate yaml
conda activate snakemake
pip install snakemake-executor-plugin-pcluster-slurm==0.0.31
conda activate snakemake
snakemake --version
# 9.5.1
you are advised to run the following in a tmux or screen session
conda activate snakemake
# Set your cache dir for saving resources useful across other jobs, snakemake uses this when the `--cache` flag is set.
mkdir /fsx/resources/environments/containers/ubuntu/rnaseq_cache/
export SNAKEMAKE_OUTPUT_CACHE=/fsx/resources/environments/containers/ubuntu/rnaseq_cache/
export TMPDIR=/fsx/scratch/
cp config/units.tsv.template config/units.tsv
[[ "$(uname)" == "Darwin" ]] && sed -i "" "s|REGSUB_PWD|$PWD|g" config/units.tsv || sed -i "s|REGSUB_PWD|$PWD|g" config/units.tsv
this can take ~1hr
# I set partitions relevant to my AWS parallel cluster, but if you specify nothing, you will get an error along the lines of <could not find appropriate nodes>.
snakemake --use-conda --use-singularity \
--singularity-prefix /fsx/resources/environments/containers/ubuntu/ \
--singularity-args " -B /tmp:/tmp -B /fsx:/fsx -B /home/$USER:/home/$USER -B $PWD/:$PWD" \
--conda-prefix /fsx/resources/environments/containers/ubuntu/ \
--executor pcluster-slurm \
--default-resources slurm_partition=i128,i192 runtime=86400 mem_mb=36900 tmpdir=/fsx/scratch \
--cache -p \
--verbose -k \
--max-threads 20000 \
--restart-times 2 \
--cores 20000 -j 14 -n \
--conda-create-envs-only
- there seems to be a bug which requires you to run with
--conda-create-envs-only
first, then once all envs are built, run the command. - another bug with how snakemake detects max allowd threads per job limits the threads to the
nproc
of your head node. Setting--max-threads 20000 --cores 20000
gets around this crudely.
- Remove the
-n
flag, and run not in dryrun mode. -j
sets the max jobs slurm will allow active at one time.- Watch your running nodes/jobs using
squeue
(also,q
cluster commands work, but not reliably and are not supported).
Use sinfo
to learn about your cluster (note, sinfo
reports on all potential and active compute nodes. Read the docs to interpret which are active, which are not yet requested spot instances, etc). Below is what the daylily AWS parallel cluster looks like.
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
i8* up infinite 12 idle~ i8-dy-gb64-[1-12]
i64 up infinite 16 idle~ i64-dy-gb256-[1-8],i64-dy-gb512-[1-8]
i128 up infinite 28 idle~ i128-dy-gb256-[1-8],i128-dy-gb512-[1-10],i128-dy-gb1024-[1-10]
i192 up infinite 30 idle~ i192-dy-gb384-[1-10],i192-dy-gb768-[1-10],i192-dy-gb1536-[1-10]
- Use the strings in
PARTITION
, ie:i192
in theslurm_partition=
config passed to snakemake.
daylily
makes extensive use of Cost allocation tags with AWS ParallelCluster in the daylily omics analysis framework $3 30x WGS analysis to track AWS cluster usage costs in realtime, and impose limits where appropriate (by user and project). This makes use of overriding the --comment
flag to hold project/budget
tags applied to ephemeral AWS resources, and thus enabling cost tracking/controls.
- To change the --comment flag in v
0.0.8
of the pcluster-slurm plugin, set the comment flag value in the envvarSMK_SLURM_COMMENT=RandD
(RandD is the default).
snakemake --use-conda --use-singularity \
--singularity-prefix /fsx/resources/environments/containers/ubuntu/ \
--singularity-args " -B /tmp:/tmp -B /fsx:/fsx -B /home/$USER:/home/$USER -B $PWD/:$PWD" \
--conda-prefix /fsx/resources/environments/containers/ubuntu/ \
--executor pcluster-slurm \
--default-resources slurm_partition=i128,i192 runtime=86400 mem_mb=36900 tmpdir=/fsx/scratch \
--cache -p \
--verbose -k \
--restart-times 2 \
--max-threads 20000 \
--cores 20000 -j 14
- You can watch progress with
watch squeue
.
- Update the
config/units.tsv
(holds sample data location and other details) andconfig/samples.tsv
(holds sample annotations). - Edit
config/config.yaml
to change aspects of the pipeline. - Run the
snakemake
command above with-n
, tweak-j
as needed, and if all looks good, run w/out-n
.