Best practice for associating units of measurement with simple metadata in extensions

A few of us discussed at our meeting this week how best to associate units of measurement with simple attribute metadata, often about devices or protocols (e.g., `emission_lambda` (in nm), `grid_spacing` (in um), `camera_width` (in pixels),`pulse_length` (in ms), `injection_volume` (in mL)), where it makes sense to fix the unit to a particular value. I'm documenting this discussion here. It would also be good to collect input from others.

Any changes we make on this to nwb-schema would break backward compatibility. But we can provide best practices for extensions and later look at improving the core nwb schema. (tl;dr at bottom)

## Field naming

Currently we have several methods for doing this in NWB:

1. a dataset with a `unit` attribute that is fixed to a particular value (we also have cases where the `unit` attribute has a default, recommended value that can be overridden, but let's consider only the fixed value cases here) 
2. an attribute with the unit described in the API docstring and schema doc

A problem with these methods is that when browsing the data naively without looking at the docs, or when writing data naively without looking at the docs, a user may guess incorrectly what the units of measurement are. In general, we recommend using SI base units, but most people don't know that and for some fields, like `emission_lambda`, which is almost always communicated in nanometers, it is unintuitive to use the SI base unit (meters). This has resulted in incorrectly written data. 

One pattern recommended by the LinkML group who has extensive experience modeling data from different fields is to put the unit abbreviation in the field name itself: https://linkml.io/linkml/howtos/model-measurements.html#simple-explicit-scalar-pattern (they also have other suggested approaches but this is the simplest). For example, `emission_lambda_in_nm`, `camera_width_in_px`. We agreed that this approach would be best because it is clear and explicit, at the cost of being a little more verbose. We should also still have NWB inspector check to make sure the values are reasonable.

## Use of non-base SI units

As mentioned above, for some metadta, it is unintuitive to use one of the seven SI base units (e.g., meters, liters, seconds) because it differs from the unit that is widely used to communicate the metadata in the community (e.g., nanometers, microliters, milliseconds). I propose that we recommend extension writers to use the units that are already widely used. When it is not clear what it widely used, we should try to poll the community and just pick one. Using a fixed value is better than allowing people to enter a value because they are unlikely to enter the value in a standard form (across current dandisets, the set of all entered `grid_spacing.unit` values is `{"microns", "micrometers", "millimeters", "meters", "microns per pixel", "mm"}`)

## Abbreviation in field names

I propose that unit abbreviations should come from CMIXF-12, with the modification that because `/` and `^` are not allowed in Python and MATLAB variable names (e.g., for W/m^2), `/` should be replaced with `_` and `^` with nothing (e.g., `intensity_in_W_m2`). The true unit abbreviation must be written in the docs. In context, I think it would make sense and confused users would consult the docs. (See also usage of CMIXF-12 in [BIDS](https://bids-specification.readthedocs.io/en/stable/common-principles.html#units) and [relevant discussion and links](https://github.com/NeurodataWithoutBorders/nwb-schema/pull/446).)

## Data format

Should users use a dataset (option 1 above) or an attribute (option 2 above)? Attributes are generally preferred for small metadata that are more properties than measurements, especially scalar values, so I think option 2 is best, but I don't think we settled on this. 

An exception is `DynamicTable` columns that hold these metadata, for example, when parameters of a stimulus like `pulse_length_ms` change across trials/epochs. Columns are datasets. I propose that having an attribute named "unit" with a fixed value on the dataset is optional but recommended.

Note: This discussion is related to, but distinct from, the discussion on using a `MeasurementData` type that has attributes for `unit`, `conversion`, `offset`, and `resolution` (see https://github.com/NeurodataWithoutBorders/nwb-schema/pull/493). That type is designed for measured data from a data acquisition system. The best practices suggested here are mostly for small metadata (usually scalar properties of a device or protocol) where there is no conversion factor, offset, or resolution. 

## Summary

To summarize, I propose that for extensions, the best practice is that fields that represent metadata whose units should be a fixed value should be schematized as **attributes** where the **unit abbreviation is in the field name**, e.g., 
- `emission_lambda_in_nm` for `OpticalChannel`
- `grid_spacing_in_um` for `ImagingPlane`
- `camera_width_in_px` for a behavioral video
- `pulse_length_in_ms` for stimulation
- `injection_volume_in_mL` for viruses
- `laser_power_in_mW` for optogenetics
- `core_diameter_in_um` for optic fibers
- `ap_in_mm` for brain coordinates
- `theta_filter_phase_in_deg` for a closed loop feedback protocol
- `titer_in_vg_mL` (vg/mL. In context, it should make sense and the unit will be well described in the docs)
- `intensity_in_W_m2` (W/m^2) for a light source
- `diameter_in_um` for spiral scanning stimulation

that the **unit is the one widely used** by the community, and the abbreviation follows a **modified CMIXF-12 convention** as described above.

cc @oruebel @bendichter @CodyCBakerPhD @alessandratrapani

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best practice for associating units of measurement with simple metadata in extensions #569

Field naming

Use of non-base SI units

Abbreviation in field names

Data format

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Best practice for associating units of measurement with simple metadata in extensions #569

Description

Field naming

Use of non-base SI units

Abbreviation in field names

Data format

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions