Skip to content

lanl/ldrd_virus_work

Repository files navigation

ldrd_virus_work (LANL copyright assertion ID: O# (O4909))

LDRD DR Computational Work Repo

Pre-Commit Procedure Linting pre-commit procedure prevents unnecessary CI/CD failures, but testing procedure is necessary as tests marked slow will not run in CI/CD. These must be run in pre-commit.

  • Linting
>black viral_seq
>ruff check viral_seq --fix
>mypy -p viral_seq
  • Testing
>cd /tmp
>python3 -m pytest --pyargs viral_seq

Running the Workflow When running workflow for the first time, skip to step 2.

  1. Uninstall viral_seq with python3 -m pip uninstall viral_seq
  2. Ensure large files have been pulled down with git lfs pull from root directory (for git-lfs installation instructions see https://git-lfs.com/)
  3. Install viral_seq with python3 -m pip install . from root directory
  4. It is advised to create and run the workflow from a fresh working directory to keep artifacts from different runs isolated
  5. Run the workflow with the following commands, replacing [relative_path] as appropriate for your working directory:

Using stored cache:

>python3 [relative_path]/viral_seq/run_workflow.py

Pulling down the cache at runtime:

>python3 [relative_path]/viral_seq/run_workflow.py --cache 3

Workflow Testing As the full workflow is not automatically tested; it should be occasionally tested locally following the above procedure, but with the --debug flag for viral_seq/run_workflow.py which will run the entire workflow with assertions on generated data which are not designed to be performative. It is pertinent to test both workflow options as they require different assertions.

Generating Heatmaps and Calculating Relative Entropy Quickly

At the time of writing it can take longer than an hour to run the full workflow that supports the paper. However, generation of the phylogenetic heatmaps of viral family representation in training and test datasets, and the corresponding relative entropy calculation for those distributions, can be done quickly with an incantation like the one below. This will error out, but will produce the heatmap and a printout of the relative entropy before it does:

> python ../viral_seq/run_workflow.py --cache 0 --features 0 --feature-selection skip -tr Mollentze_Training_Shuffled.csv -ts Mollentze_Holdout_Shuffled.csv

(and similarly for other training and test datasets)

Producing the Violin Plot for LDRD manuscript (and related data)

The ROC AUC violin plot in the LDRD manuscript, which compares relative ML model performances for human, primate, and mammal targets (and with vs. without shuffling/rebalancing the data), can be regenerated by running the six pertinent LDRD workflows and then running the post-processing code.

For example, after installing the project locally and confirming that the testsuite is passing, six subfolders for the different conditions might be created, and the workflow incantations initiated in those directories as follows (10 random seeds were combined for the manuscript results):

# (1) at subdirectory human_fixed:
python ../viral_seq/run_workflow.py -tr Relabeled_Train.csv -ts Relabeled_Test.csv -tc "human" -c "extract" -n 2 -cp 10
# (2) at subdirectory human_shuffled:
python ../viral_seq/run_workflow.py -tr Relabeled_Train_Human_Shuffled.csv -ts Relabeled_Test_Human_Shuffled.csv -tc "human" -c "extract" -n 2 -cp 10
# (3) at subdirectory primate_fixed:
python ../viral_seq/run_workflow.py -tr Relabeled_Train.csv -ts Relabeled_Test.csv -tc "primate" -c "extract" -n 2 -cp 10
# (4) at subdirectory primate_shuffled:
python ../viral_seq/run_workflow.py -tr Relabeled_Train_Primate_Shuffled.csv -ts Relabeled_Test_Primate_Shuffled.csv -tc "primate" -c "extract" -n 2 -cp 10
# (5) at subdirectory mammal_fixed:
python ../viral_seq/run_workflow.py -tr Relabeled_Train.csv -ts Relabeled_Test.csv -tc "mammal" -c "extract" -n 2 -cp 10
# (6) at subdirectory mammal_shuffled:
python ../viral_seq/run_workflow.py -tr Relabeled_Train_Mammal_Shuffled.csv -ts Relabeled_Test_Mammal_Shuffled.csv -tc "mammal" -c "extract" -n 2 -cp 10

After verifying that each workflow has completed normally (i.e., exit code of 0, no error message), the post-processing code can be started to generate the violin plot and its associated raw data:

from viral_seq.data.make_target_comparison_plot import plot_target_comparison

plot_target_comparison(
    human_fixed_predictions="human_fixed/data_calculated/predictions",
    human_shuffled_predictions="human_shuffled/data_calculated/predictions",
    primate_fixed_predictions="primate_fixed/data_calculated/predictions",
    primate_shuffled_predictions="primate_shuffled/data_calculated/predictions",
    mammal_fixed_predictions="mammal_fixed/data_calculated/predictions",
    mammal_shuffled_predictions="mammal_shuffled/data_calculated/predictions",
)   

About Licensing

At the time of writing we are currently bound to the copyleft GPL-3.0 license because we leverage taxonomy-ranks in the small corner of our workflow that deals with phylogenetic heatmaps, and taxonomy-ranks itself depends on the GPL-3.0 licensed ete project. Given the minor role these libraries play in our workflow, we'd appreciate help in finding a more liberally-licensed alternative so that we can avoid copyleft requirements in the future.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages