ldrd_virus_work (LANL copyright assertion ID: O# (O4909))

LDRD DR Computational Work Repo

Pre-Commit Procedure Linting pre-commit procedure prevents unnecessary CI/CD failures, but testing procedure is necessary as tests marked slow will not run in CI/CD. These must be run in pre-commit.

Linting

>black viral_seq
>ruff check viral_seq --fix
>mypy -p viral_seq

Testing

>cd /tmp
>python3 -m pytest --pyargs viral_seq

Running the Workflow When running workflow for the first time, skip to step 2.

Uninstall viral_seq with python3 -m pip uninstall viral_seq
Ensure large files have been pulled down with git lfs pull from root directory (for git-lfs installation instructions see https://git-lfs.com/)
Install viral_seq with python3 -m pip install . from root directory
It is advised to create and run the workflow from a fresh working directory to keep artifacts from different runs isolated
Run the workflow with the following commands, replacing [relative_path] as appropriate for your working directory:

Using stored cache:

>python3 [relative_path]/viral_seq/run_workflow.py

Pulling down the cache at runtime:

>python3 [relative_path]/viral_seq/run_workflow.py --cache 3

Workflow Testing As the full workflow is not automatically tested; it should be occasionally tested locally following the above procedure, but with the --debug flag for viral_seq/run_workflow.py which will run the entire workflow with assertions on generated data which are not designed to be performative. It is pertinent to test both workflow options as they require different assertions.

Generating Heatmaps and Calculating Relative Entropy Quickly

At the time of writing it can take longer than an hour to run the full workflow that supports the paper. However, generation of the phylogenetic heatmaps of viral family representation in training and test datasets, and the corresponding relative entropy calculation for those distributions, can be done quickly with an incantation like the one below. This will error out, but will produce the heatmap and a printout of the relative entropy before it does:

> python ../viral_seq/run_workflow.py --cache 0 --features 0 --feature-selection skip -tr Mollentze_Training_Shuffled.csv -ts Mollentze_Holdout_Shuffled.csv

(and similarly for other training and test datasets)

Producing the Violin Plot for LDRD manuscript (and related data)

The ROC AUC violin plot in the LDRD manuscript, which compares relative ML model performances for human, primate, and mammal targets (and with vs. without shuffling/rebalancing the data), can be regenerated by running the six pertinent LDRD workflows and then running the post-processing code.

For example, after installing the project locally and confirming that the testsuite is passing, six subfolders for the different conditions might be created, and the workflow incantations initiated in those directories as follows (10 random seeds were combined for the manuscript results):

# (1) at subdirectory human_fixed:
python ../viral_seq/run_workflow.py -tr Relabeled_Train.csv -ts Relabeled_Test.csv -tc "human" -c "extract" -n 2 -cp 10
# (2) at subdirectory human_shuffled:
python ../viral_seq/run_workflow.py -tr Relabeled_Train_Human_Shuffled.csv -ts Relabeled_Test_Human_Shuffled.csv -tc "human" -c "extract" -n 2 -cp 10
# (3) at subdirectory primate_fixed:
python ../viral_seq/run_workflow.py -tr Relabeled_Train.csv -ts Relabeled_Test.csv -tc "primate" -c "extract" -n 2 -cp 10
# (4) at subdirectory primate_shuffled:
python ../viral_seq/run_workflow.py -tr Relabeled_Train_Primate_Shuffled.csv -ts Relabeled_Test_Primate_Shuffled.csv -tc "primate" -c "extract" -n 2 -cp 10
# (5) at subdirectory mammal_fixed:
python ../viral_seq/run_workflow.py -tr Relabeled_Train.csv -ts Relabeled_Test.csv -tc "mammal" -c "extract" -n 2 -cp 10
# (6) at subdirectory mammal_shuffled:
python ../viral_seq/run_workflow.py -tr Relabeled_Train_Mammal_Shuffled.csv -ts Relabeled_Test_Mammal_Shuffled.csv -tc "mammal" -c "extract" -n 2 -cp 10

After verifying that each workflow has completed normally (i.e., exit code of 0, no error message), the post-processing code can be started to generate the violin plot and its associated raw data:

from viral_seq.data.make_target_comparison_plot import plot_target_comparison

plot_target_comparison(
    human_fixed_predictions="human_fixed/data_calculated/predictions",
    human_shuffled_predictions="human_shuffled/data_calculated/predictions",
    primate_fixed_predictions="primate_fixed/data_calculated/predictions",
    primate_shuffled_predictions="primate_shuffled/data_calculated/predictions",
    mammal_fixed_predictions="mammal_fixed/data_calculated/predictions",
    mammal_shuffled_predictions="mammal_shuffled/data_calculated/predictions",
)

About Licensing

At the time of writing we are currently bound to the copyleft GPL-3.0 license because we leverage taxonomy-ranks in the small corner of our workflow that deals with phylogenetic heatmaps, and taxonomy-ranks itself depends on the GPL-3.0 licensed ete project. Given the minor role these libraries play in our workflow, we'd appreciate help in finding a more liberally-licensed alternative so that we can avoid copyleft requirements in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 532 Commits
.github/workflows		.github/workflows
.gitlab/merge_request_templates		.gitlab/merge_request_templates
ci		ci
viral_seq		viral_seq
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
LICENSE		LICENSE
README.md		README.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ldrd_virus_work (LANL copyright assertion ID: O# (O4909))

LDRD DR Computational Work Repo

Generating Heatmaps and Calculating Relative Entropy Quickly

Producing the Violin Plot for LDRD manuscript (and related data)

About Licensing

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

lanl/ldrd_virus_work

Folders and files

Latest commit

History

Repository files navigation

ldrd_virus_work (LANL copyright assertion ID: O# (O4909))

LDRD DR Computational Work Repo

Generating Heatmaps and Calculating Relative Entropy Quickly

Producing the Violin Plot for LDRD manuscript (and related data)

About Licensing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages