-
Notifications
You must be signed in to change notification settings - Fork 1
simplifying code and adding resources.yml #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
49 commits
Select commit
Hold shift + click to select a range
9464d2b
simplifying code and adding resources.yml
antgonza a6ffcc5
export ENVIRONMEN sooner in the script
antgonza b371545
add ENVIRONMENT in qp-pacbio yml
antgonza a2077b0
afterok -> afterany
antgonza 77fc60e
CONDA_ENVIRONMENT
antgonza 6e05385
fix tests
antgonza 766bec5
CONDA_ENVIRONMENT
antgonza b6032b8
-J me_
antgonza ffdae88
adding missing params for merge
antgonza 43a931f
data
antgonza 5cf7de8
mv data to qp_pacbio
antgonza d11ba2a
find_base_path
antgonza 8e6267a
--ignore=qp_pacbio/data
antgonza 0071e1c
"results": result_fp,
antgonza 80373a0
results -> result_fp
antgonza 50a46cf
add completed
antgonza 9a44417
output -> out_dir
antgonza fb42f44
SLURM_ARRAY_JOB_ID->SLURM_ARRAY_TASK_ID
antgonza bdb82df
rm extra hifiasm_meta
antgonza 0cd35e7
validate failed_steps
antgonza a58a1f2
rm shopt
antgonza d2b227c
add file check FILES=(*.fa)
antgonza 741e242
save small LCGs
antgonza 52d76b2
update databases
antgonza 6af6c17
forgot 1 update
antgonza f98217e
rm extra /
antgonza af437dd
update minimap2 woltka command
antgonza b9b5c76
nprocs -> 16
antgonza e58bb16
biom_merge_pacbio
antgonza 9098888
woltka & biom
antgonza 15bbe41
micov
antgonza a2775aa
pip https -> git
antgonza c6193a5
90 ->150
antgonza 23fdf69
fix test
antgonza 9d7fcb8
readd lcg_folder
antgonza e82c2f3
add finish_qp_pacbio to woltka
antgonza 2286019
missing new line
antgonza 813d03b
_small_LCGs -> _small_LCG
antgonza 4a1fd74
09 -> 11 and default_params_set
antgonza 20be910
default params should be a dict
antgonza 75344ca
fixes after more testing
antgonza f694914
fix tests
antgonza b203403
rm >
antgonza 5c9b92d
rm extras from if
antgonza 5969f77
improve folder
antgonza 5b0e933
improvements for clarity
antgonza d8b67aa
reorganize README.rst [skip ci]
antgonza fb9aa8d
add missing ** [skip ci]
antgonza dddd009
add missing ** [skip ci]
antgonza File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,13 +4,15 @@ build-backend = "setuptools.build_meta" | |
|
|
||
| [tool.setuptools] | ||
| packages = ["qp_pacbio"] | ||
| include-package-data = true | ||
|
|
||
| [tool.setuptools.package-data] | ||
| "qp_pacbio" = ["data/*"] | ||
|
|
||
| [project] | ||
| name = "qp_pacbio" | ||
| # version strings must comply with PEP 440: | ||
| # https://peps.python.org/pep-0440/ | ||
| version = "2025.09" | ||
| version = "2025.11" | ||
| authors = [{ name = "Qiita Development Team", email = "[email protected]" }] | ||
| description = "Qiita Plugin: PacBio Processing" | ||
| readme = "README.rst" | ||
|
|
@@ -39,11 +41,15 @@ dependencies = [ | |
| 'pytest-cov', | ||
| 'numpy', | ||
| 'Jinja2', | ||
| 'PyYAML', | ||
| 'micov', | ||
| "qiita-files@https://github.com/qiita-spots/qiita-files/archive/master.zip", | ||
| "qiita_client@https://github.com/qiita-spots/qiita_client/archive/master.zip", | ||
| "woltka@git+https://github.com/qiyunzhu/woltka.git#egg=woltka", | ||
| ] | ||
|
|
||
| [project.scripts] | ||
| configure_qp_pacbio = "qp_pacbio.scripts:config" | ||
| start_qp_pacbio = "qp_pacbio.scripts:execute" | ||
| finish_qp_pacbio = "qp_pacbio.scripts:finish_qp_pacbio" | ||
| biom_merge_pacbio = "qp_pacbio.scripts:biom_merge" | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| PacBio processing: | ||
| step-1: | ||
| node_count: 1 | ||
| nprocs: 16 | ||
| wall_time_limit: 1-00:00:00 | ||
| mem_in_gb: 200 | ||
| max_tasks: 16 | ||
| step-2: | ||
| node_count: 1 | ||
| nprocs: 1 | ||
| wall_time_limit: 00:10:00 | ||
| mem_in_gb: 2 | ||
| max_tasks: 16 | ||
| step-3: | ||
| node_count: 1 | ||
| nprocs: 8 | ||
| wall_time_limit: 01:00:00 | ||
| mem_in_gb: 10 | ||
| max_tasks: 16 | ||
| step-4: | ||
| node_count: 1 | ||
| nprocs: 8 | ||
| wall_time_limit: 01:00:00 | ||
| mem_in_gb: 6 | ||
| max_tasks: 16 | ||
| step-5: | ||
| node_count: 1 | ||
| nprocs: 8 | ||
| wall_time_limit: 00:30:00 | ||
| mem_in_gb: 2 | ||
| max_tasks: 16 | ||
| step-6: | ||
| node_count: 1 | ||
| nprocs: 8 | ||
| wall_time_limit: 00:30:00 | ||
| mem_in_gb: 2 | ||
| max_tasks: 16 | ||
| step-7: | ||
| node_count: 1 | ||
| nprocs: 8 | ||
| wall_time_limit: 01:00:00 | ||
| mem_in_gb: 50 | ||
| max_tasks: 16 | ||
| finish: | ||
| node_count: 1 | ||
| nprocs: 1 | ||
| wall_time_limit: 00:10:00 | ||
| mem_in_gb: 10 | ||
| Woltka v0.1.7, minimap2: | ||
| minimap2: | ||
| node_count: 1 | ||
| nprocs: 16 | ||
| wall_time_limit: 10:00:00 | ||
| mem_in_gb: 60 | ||
| max_tasks: 16 | ||
| merge: | ||
| node_count: 1 | ||
| nprocs: 16 | ||
| wall_time_limit: 1-00:00:00 | ||
| mem_in_gb: 120 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| #!/bin/bash | ||
| #SBATCH -J {{job_name}} | ||
| #SBATCH -p qiita | ||
| #SBATCH -N {{node_count}} | ||
| #SBATCH -n {{nprocs}} | ||
| #SBATCH --time {{wall_time_limit}} | ||
| #SBATCH --mem {{mem_in_gb}}G | ||
| #SBATCH -o {{output}}/step-2/logs/%x-%A_%a.out | ||
| #SBATCH -e {{output}}/step-2/logs/%x-%A_%a.err | ||
| #SBATCH --array {{array_params}} | ||
|
|
||
| source ~/.bashrc | ||
| set -e | ||
| {{conda_environment}} | ||
| cd {{output}}/step-1 | ||
|
|
||
| step=${SLURM_ARRAY_TASK_ID} | ||
| input=$(head -n $step {{output}}/sample_list.txt | tail -n 1) | ||
| sample_name=`echo $input | awk '{print $1}'` | ||
| filename=`echo $input | awk '{print $2}'` | ||
| fn=`basename ${filename}` | ||
|
|
||
| # updating the GUI when task 1 runs | ||
| if [[ "$step" == "1" ]]; then | ||
| python -c "from qp_pacbio.util import client_connect; qclient = client_connect('{{url}}'); qclient.update_job_step('{{qjid}}', 'Running step 2: ${SLURM_ARRAY_JOB_ID}')" | ||
| fi | ||
|
|
||
| cat ${sample_name}.p_ctg.gfa | awk '$1=="S" && ($2 ~ /.c$/) {printf ">%s\n%s\n", $2, $3} ' > ../step-2/${sample_name}_circ.fa | ||
| seqkit split --by-id ../step-2/${sample_name}_circ.fa -O ../step-2/${sample_name}_split | ||
|
|
||
| ### get all contigs for each sample | ||
| cat ${sample_name}.p_ctg.gfa | awk '$1=="S" {printf ">%s\n%s\n", $2, $3} ' > ../step-2/${sample_name}_all_contigs.fa | ||
|
|
||
| cd ../step-2/${sample_name}_split | ||
| # making a copy of the small_LCG before they are removed | ||
| mkdir -p {{output}}/step-2/${sample_name}_small_LCG | ||
| find . -maxdepth 1 -type f -size -512k -print0 | xargs -0 -r cp -t ../${sample_name}_small_LCG | ||
| ### remove small circular genomes | ||
| find . -type f -size -512k -exec rm -f {} + | ||
|
|
||
| # this can result on not having any files left so | ||
| # making sure we have files left | ||
| # | ||
| # extract fasta id for all the genomes in the split folder | ||
| FILES=(*.fa) | ||
| if [ -f $FILES ]; then | ||
| for f in *.fa; do | ||
| k=${f##*/} | ||
| n=${f%.*} | ||
| grep -E "^>" $f >> circular_id.txt | ||
| done | ||
| sed -i 's/>//' circular_id.txt | ||
| seqkit grep -v -f circular_id.txt ../${sample_name}_all_contigs.fa > ../${sample_name}_noLCG.fa | ||
| else | ||
| cp ../${sample_name}_all_contigs.fa ../${sample_name}_noLCG.fa | ||
| fi | ||
|
|
||
| lcg_folder={{result_fp}}/${sample_name}/LCG/ | ||
| mkdir -p ${lcg_folder} | ||
| FILES=({{output}}/step-2/${sample_name}_split/*.fa) | ||
| if [ -f $FILES ]; then | ||
| for f in `ls {{output}}/step-2/${sample_name}_split/*.fa`; do | ||
| sn=`basename ${f/_circ/}`; | ||
| sn=${sn/part_/}; | ||
| cat $f | gzip > ${lcg_folder}/${sn/.fa/.fna}.gz; | ||
| done | ||
| fi | ||
|
|
||
| mkdir -p {{result_fp}}/${sample_name}/ | ||
| if [ -f {{output}}/step-2/${sample_name}_noLCG.fa ]; then | ||
| cat {{output}}/step-2/${sample_name}_noLCG.fa | gzip > {{result_fp}}/${sample_name}/${sample_name}.noLCG.fna.gz | ||
| fi | ||
|
|
||
| touch {{output}}/step-2/completed_${SLURM_ARRAY_TASK_ID}.log | ||
| # if the files don't exist, it means that this step didn't generate any | ||
| # inputs for the next step; thus generating all the completed files | ||
| if [[ ! -f "$FILES" && ! -f "{{output}}/step-2/${sample_name}_noLCG.fa" ]]; then | ||
| touch {{output}}/step-3/completed_${SLURM_ARRAY_TASK_ID}.log | ||
| touch {{output}}/step-4/completed_${SLURM_ARRAY_TASK_ID}.log | ||
| touch {{output}}/step-5/completed_${SLURM_ARRAY_TASK_ID}.log | ||
| touch {{output}}/step-6/completed_${SLURM_ARRAY_TASK_ID}.log | ||
| touch {{output}}/step-7/completed_${SLURM_ARRAY_TASK_ID}.log | ||
| fi | ||
|
|
||
| # saving small LCG, note that these are not processed downstrem so not | ||
| # relevant to the "complete" files | ||
| small_lcg_folder={{result_fp}}/${sample_name}/small_LCG/ | ||
| mkdir -p ${small_lcg_folder} | ||
| FILES=({{output}}/step-2/${sample_name}_small_LCG/*.fa) | ||
| if [ -f $FILES ]; then | ||
| for f in `ls {{output}}/step-2/${sample_name}_small_LCG/*.fa`; do | ||
| sn=`basename ${f/_circ/}`; | ||
| sn=${sn/part_/}; | ||
| cat $f | gzip > ${small_lcg_folder}/${sn/.fa/.fna}.gz; | ||
| done | ||
| fi | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So small_LCG is defined by files in size < 512kb? Probably important to note in the documentation for the PacBio workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting small_LCG to be defined by total genome size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianshu93, can you comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by file size for now. Can be optimized, they are proportational to total genome size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approximate 515,000 bases (half a million), because one character takes one byte approximately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding this to the readme.