Description:
The formulation-selector tool, aka Regionalization and Formulation Testing & Selection (RaFTS), is under development. For more information, see the Wiki.
As NOAA OWP builds the model-agnostic NextGen framework, the hydrologic modeling community will need to know how to optimally select model formulations and estimate parameter values across ungauged catchments. This problem becomes intractable when considering the unique combinations of current and future model formulations combined with the innumerable possible parameter combinations across the continent. To simplify the model selection problem, we apply an analytical tool that predicts hydrologic formulation performance (Bolotin et al., 2022, Liu et al., 2022) using community-generated data. The regionalization and formulation testing and selection (RaFTS) tool provides a framework to predict hydrologically-relevant response variables based on catchment-wide predictor variables using supervised machine learning algorithms. Examples of possible response variables include i) hydrologic model formulation performance metrics, 2) hydrologic model process sensitivities, and 3) hydrologic signatures. RaFTS as a decision support tool is designed such that as the hydrologic modeling community generates more results/datasets, better decisions can be made on where formulations would be best suited.
RaFTS provides an approach consisting of four key steps:
- Standardize response variables into a standardized format that the rest of the RaFTS processing steps expect.
- Acquire user-defined catchment attributes from publicly available databases.
- Train and evaluate supervised machine learning algorithms that predict the response variable(s) of interest based on catchment attribute data.
- Predict response variables across user-defined catchments of interest.
Presentations:
- Refer to the Litt et al 2024 AGU poster for more details on how RaFTS predicts hydrologic model formulation metrics.
- Refer to the Litt et al 2025 AMS oral presentation for an example of how RaFTS may be used to predict Sobol' Sensitivity of hydrologic model processes per Mai et al, 2022
Technology stack:
- Python: The features of the formulation-selector that ingest model results and catchment attributes to predict model performances based on catchment attributes is written in Python.
- R: The features of the formulation-selector that acquire catchment attributes that feed into the model prediction algorithm (which, as noted above, is in Python) are written in R to promote compatibility with the NOAA-OWP/hydrofabric.
Status: Preliminary development. CHANGELOG.
- Technology stack: python. The formulation-selection decision support tool is intended to be a standalone analysis, though integration with pre-existing formulation evaluation metrics tools will eventually occur.
- Status: Preliminary development. CHANGELOG.
- Links to production or demo instances
- Describe what sets this apart from related-projects. Linking to another doc or page is OK if this can't be expressed in a sentence or two.
Screenshot: If the software has visual components, place a screenshot after the description; e.g., N/A
Thus far, formulation-selector
has been developed in and tested with Python versions 3.11 and 3.12, so these are currently the recommended versions.
You may consider creating a new virtual environment for employing formulation-selector
with the following packages:
- pynhd
- dask
- joblib
- netcdf4
- numpy
- pandas
- pyyaml
- scikit_learn
- setuptools
- xarray
- NOAA-OWP/hydrofabric
- Note that the arrow package needs
arrow::arrow_with_s3() == TRUE
. IfFALSE
, consider downloading arrow via apache's r-universe - Steps to install hydrofabric: Refer to wiki
- Note that the arrow package needs
- USGS nhdplusTools
- pynhd
- Install
fs_prep
packagepip install /path/to/pkg/fs_prep/fs_prep/.
- Build a yaml config file
/sripts/eval_metrics/name_of_dataset_here/name_of_dataset_schema.yaml
refer to this template - Create a script that reads in the data and runs the standardization processing. Example script here
- Then run the following:
cd /path/to/scripts/eval_metrics/name_of_dataset_here/
python proc_name_of_dataset_here_metrics.py "name_of dataset_here_schema.yaml"
> cd /path/to/pkg/fs_prep/fs_prep
> pip install .
Ingesting raw data describing model metrics (e.g. KGE, NSE) from modeling simulations requires two tasks:
- Creating a custom configuration schema as a .yaml file
- Modify a dataset ingest script
We track these tasks inside formulation-selector/scripts/eval_ingest/_name_of_raw_dataset_here_/
The data schema yaml file contains the following fields:
col_schema
: required column mappings in the evaluation metrics dataset. These describe the column names in the raw data and how they'll map to standardized column names.- for
metric_mappings
refer to the the fs_categories.yaml
- for
file_io
: The location of the input data and desired save location. Also specifies the save file format.formulation_metadata
: Descriptive traits of the model formulation that generated the metrics. Some of these are required fields while others are optional.references
: Optional but very helplful metadata describing where the data came from.
The script that converts the raw data into the desired format. This performs the following tasks:
- Read in the data schema yaml file (standardized)
- Ingest the raw data (standardized)
- Modify the raw data to become wide-format where columns consist of the gage id and separate columns for each formulation evaluation metric (user-developed munging)
- Call the
fs_prep.proc_col_schema()
to standardize the dataset into a common format (standardized function call)
If the software is configurable, describe it in detail, either here or in other documentation to which you link.
cd /path/to/scripts/eval_metrics/name_of_dataset_here/
python proc_name_of_dataset_here_metrics.py "name_of dataset_here_schema.yaml"
You may also run unit tests on fs_prep
:
> cd /path/to/formulation-selection/pkg/fs_prep/fs_prep/tests
> python -m unittest test_proc_eval_metrics.py
To assess code coverage:
python -m coverage run -m unittest
python -m coverage report
Document any known significant shortcomings with the software.
If you have questions, concerns, bug reports, etc, please file an issue in this repository's Issue Tracker.
This section should detail why people should get involved and describe key areas you are currently focusing on; e.g., trying to get feedback on features, fixing certain bugs, building important pieces, etc.
General instructions on how to contribute should be stated with a link to CONTRIBUTING.
Description:
Attributes from non-standardized datasets may need to be acquired for RaFTS modeling and prediction. The R package proc.attr.hydfab
performs the attribute grabbing.
Run flow.install.proc.attr.hydfab.R
to install the package. Note that a user may need to modify the section that creates the fs_dir
for their custom path to this repo's directory.
The following is an example script that runs the attribute grabber: fs_attrs_grab
.
This script grabs attribute data corresponding to locations of interest, and saves those attribute data inside a directory as multiple parquet files. The proc.attr.hydfab::retrieve_attr_exst()
function may then efficiently query and then retrieve desired data by variable name and comid from those parquet files.
Note that this script was designed to process data that have already been generated by the fs_prep
python package, but users may want to grab attributes from additional locations that have not been processed by fs_prep
(e.g. attributes from ungaged basins to use for prediction).
- To independently process attributes for locations without running
fs_prep
python package beforehand: A user may ignore previously-processed data by settingRetr_Params$datasets
asNULL
and specifying the path to a file containing gage_ids inside theRetr_Params$loc_id_read
list. - In the context of reading in a processed dataset from
fs_prep
or reading in a separate file specifying locations of interest, theRetr_Params$datasets
uses directory names insideinput/user_data_std/
(or simplyall
to process all datasets). The additional file with location ids may also be read in, or ignored entirely. In summary, the either-or or both approaches are options, and is defined by how theRetr_Params
parameter list object is populated. - The 'independent' data file for processing attributes has been tested for .csv and .parquet file formats. Other formats compatible with
arrow::open_dataset()
should be possible but have not been tested.
Bolotin, LA, Haces-Garcia F, Liao, M, Liu, Q, Frame, J, Ogden FL (2022). Data-driven Model Selection in the Next Generation Water Resources Modeling Framework. In Deardorff, E., A. Modaresi Rad, et al. (2022). National Water Center Innovators Program - Summer Institute, CUAHSI Technical Report, HydroShare, http://www.hydroshare.org/resource/096e7badabb44c9f8c29751098f83afa
Liu, Q, Bolotin, L, Haces-Garcia, F, Liao, M, Ogden, FL, Frame JM (2022) Automated Decision Support for Model Selection in the Nextgen National Water Model. Abstract (H45I-1503) presented at 2022 AGU Fall Meeting 12-16 Dec.