DVC repository containing a pipeline to run MLIP training on two different energy and force fidelities, as well as on their difference.
Configurations are organized using Hydra. You can choose between an sGDML model or a GPR model based on GPyTorch.
Clone the repository and run
pip install -e .inside it. Note that the GPR_MLIP package has to be installed.
Before running experiments, the data has to be stored appropriately. To run the experiments with the datasets used in the paper, create a directory named datasets in the same directory as this repository. Inside the datasets directory, include the following:
-
For the rMD17 dataset:
- Download and extract the
.zipfile from rMD17 on Figshare. - Place the extracted
rmd17directory directly inside thedatasetsdirectory, i.e.,datasets/rmd17.
- Download and extract the
-
For the WS22 dataset:
- Download the
.npzfiles for the individual molecules from WS22 on Zenodo. - Create a directory
WS22inside thedatasetsdirectory and place all.npzfiles there, i.e.,datasets/WS22/*.npz.
- Download the
If your datasets are stored in different locations, or if you want to use other datasets, you can specify the dataset paths by overriding the corresponding Hydra configurations.
Inside the repository run
dvc exp runThe pipeline consists of six stages:
calc_new_data: Calculate energies and forces with XTB as the low-fidelity method.evaluate_data: Calculate the MAD (mean absolute deviation) for both fidelities and create plots showing the differences between the energy values of the two fidelities.prepare_data: Randomly shuffle and split the data into train, test, and active learning sets.train: Train the model on the training data.predict: Predict energies and uncertainties for the test and active learning sets.evaluate: Generate extended reliability diagrams on the active learning set and calculate errors on the test set.
With the model configuration, you can choose between different models.
You can either select an sGDML model or a GPR model based on GPyTorch, which can be:
- trained on the single high-fidelity data,
- trained separately on both fidelities, or
- trained on the difference between the fidelities.
It is also possible to use a model that predicts the high-fidelity energy by adding the difference between the means of the training energies of the two fidelities as a bias to the low-fidelity energy value.
Depending on the selected option, different uncertainties are calculated.
The GPR models based on GPyTorch are only trained on energies.
The hyperparameters are loaded from the gpr_models directory; currently, the models are not trained within the repository.
Three different model options are available:
dvc exp run -S model=single_fidelity_gprHere, the GPR standard deviation is used as the uncertainty.
dvc exp run -S model=delta_gprHere, the GPR standard deviation is used as the uncertainty as well.
dvc exp run -S model=separate_fidelities_gprHere, the uncertainty is calculated as the difference between the prediction of the low-fidelity energy and the actual low-fidelity energy.
sGDML models are trained on the forces but can also predict energies. Uncertainties are returned for energies as well as for force components. Three different model options are available:
dvc exp run -S model=single_fidelity_sgdmlHere, random values are returned as the uncertainty.
dvc exp run -S model=delta_sgdmlHere, the uncertainty is calculated as the difference between the prediction of the low-fidelity energy and the actual low-fidelity energy.
dvc exp run -S model=separate_fidelities_sgdmlHere, the uncertainty is calculated as the difference between the prediction of the low-fidelity energy and the actual low-fidelity energy.
dvc exp run -S model=add_deltaHere random values are returned as uncertainty.