Pre-Commit Procedure Linting pre-commit procedure prevents unnecessary CI/CD failures, but testing procedure is necessary as tests marked slow will not run in CI/CD. These must be run in pre-commit.
- Linting
>black viral_seq
>ruff check viral_seq --fix
>mypy -p viral_seq
- Testing
>cd /tmp
>python3 -m pytest --pyargs viral_seq
Running the Workflow When running workflow for the first time, skip to step 2.
- Uninstall viral_seq with
python3 -m pip uninstall viral_seq
- Ensure large files have been pulled down with
git lfs pull
from root directory (forgit-lfs
installation instructions see https://git-lfs.com/) - Install viral_seq with
python3 -m pip install .
from root directory - It is advised to create and run the workflow from a fresh working directory to keep artifacts from different runs isolated
- Run the workflow with the following commands, replacing [relative_path] as appropriate for your working directory:
Using stored cache:
>python3 [relative_path]/viral_seq/run_workflow.py
Pulling down the cache at runtime:
>python3 [relative_path]/viral_seq/run_workflow.py --cache 3
Workflow Testing
As the full workflow is not automatically tested; it should be occasionally tested locally following the above procedure, but with the --debug
flag for viral_seq/run_workflow.py
which will run the entire workflow with assertions on generated data which are not designed to be performative. It is pertinent to test both workflow options as they require different assertions.
At the time of writing it can take longer than an hour to run the full workflow that supports the paper. However, generation of the phylogenetic heatmaps of viral family representation in training and test datasets, and the corresponding relative entropy calculation for those distributions, can be done quickly with an incantation like the one below. This will error out, but will produce the heatmap and a printout of the relative entropy before it does:
> python ../viral_seq/run_workflow.py --cache 0 --features 0 --feature-selection skip -tr Mollentze_Training_Shuffled.csv -ts Mollentze_Holdout_Shuffled.csv
(and similarly for other training and test datasets)
The ROC AUC violin plot in the LDRD manuscript, which compares relative ML model performances for human, primate, and mammal targets (and with vs. without shuffling/rebalancing the data), can be regenerated by running the six pertinent LDRD workflows and then running the post-processing code.
For example, after installing the project locally and confirming that the testsuite is passing, six subfolders for the different conditions might be created, and the workflow incantations initiated in those directories as follows (10 random seeds were combined for the manuscript results):
# (1) at subdirectory human_fixed:
python ../viral_seq/run_workflow.py -tr Relabeled_Train.csv -ts Relabeled_Test.csv -tc "human" -c "extract" -n 2 -cp 10
# (2) at subdirectory human_shuffled:
python ../viral_seq/run_workflow.py -tr Relabeled_Train_Human_Shuffled.csv -ts Relabeled_Test_Human_Shuffled.csv -tc "human" -c "extract" -n 2 -cp 10
# (3) at subdirectory primate_fixed:
python ../viral_seq/run_workflow.py -tr Relabeled_Train.csv -ts Relabeled_Test.csv -tc "primate" -c "extract" -n 2 -cp 10
# (4) at subdirectory primate_shuffled:
python ../viral_seq/run_workflow.py -tr Relabeled_Train_Primate_Shuffled.csv -ts Relabeled_Test_Primate_Shuffled.csv -tc "primate" -c "extract" -n 2 -cp 10
# (5) at subdirectory mammal_fixed:
python ../viral_seq/run_workflow.py -tr Relabeled_Train.csv -ts Relabeled_Test.csv -tc "mammal" -c "extract" -n 2 -cp 10
# (6) at subdirectory mammal_shuffled:
python ../viral_seq/run_workflow.py -tr Relabeled_Train_Mammal_Shuffled.csv -ts Relabeled_Test_Mammal_Shuffled.csv -tc "mammal" -c "extract" -n 2 -cp 10
After verifying that each workflow has completed normally (i.e., exit code of 0
,
no error message), the post-processing code can be started to generate the violin
plot and its associated raw data:
from viral_seq.data.make_target_comparison_plot import plot_target_comparison
plot_target_comparison(
human_fixed_predictions="human_fixed/data_calculated/predictions",
human_shuffled_predictions="human_shuffled/data_calculated/predictions",
primate_fixed_predictions="primate_fixed/data_calculated/predictions",
primate_shuffled_predictions="primate_shuffled/data_calculated/predictions",
mammal_fixed_predictions="mammal_fixed/data_calculated/predictions",
mammal_shuffled_predictions="mammal_shuffled/data_calculated/predictions",
)
At the time of writing we are currently bound to the copyleft GPL-3.0
license because we leverage taxonomy-ranks
in the small corner of our
workflow that deals with phylogenetic heatmaps,
and taxonomy-ranks
itself depends on the GPL-3.0
licensed
ete
project. Given the minor role these libraries play in our workflow,
we'd appreciate help in finding a more liberally-licensed alternative so
that we can avoid copyleft requirements in the future.