Skip to content

Best practice for associating units of measurement with simple metadata in extensions #569

@rly

Description

@rly

A few of us discussed at our meeting this week how best to associate units of measurement with simple attribute metadata, often about devices or protocols (e.g., emission_lambda (in nm), grid_spacing (in um), camera_width (in pixels),pulse_length (in ms), injection_volume (in mL)), where it makes sense to fix the unit to a particular value. I'm documenting this discussion here. It would also be good to collect input from others.

Any changes we make on this to nwb-schema would break backward compatibility. But we can provide best practices for extensions and later look at improving the core nwb schema. (tl;dr at bottom)

Field naming

Currently we have several methods for doing this in NWB:

  1. a dataset with a unit attribute that is fixed to a particular value (we also have cases where the unit attribute has a default, recommended value that can be overridden, but let's consider only the fixed value cases here)
  2. an attribute with the unit described in the API docstring and schema doc

A problem with these methods is that when browsing the data naively without looking at the docs, or when writing data naively without looking at the docs, a user may guess incorrectly what the units of measurement are. In general, we recommend using SI base units, but most people don't know that and for some fields, like emission_lambda, which is almost always communicated in nanometers, it is unintuitive to use the SI base unit (meters). This has resulted in incorrectly written data.

One pattern recommended by the LinkML group who has extensive experience modeling data from different fields is to put the unit abbreviation in the field name itself: https://linkml.io/linkml/howtos/model-measurements.html#simple-explicit-scalar-pattern (they also have other suggested approaches but this is the simplest). For example, emission_lambda_in_nm, camera_width_in_px. We agreed that this approach would be best because it is clear and explicit, at the cost of being a little more verbose. We should also still have NWB inspector check to make sure the values are reasonable.

Use of non-base SI units

As mentioned above, for some metadta, it is unintuitive to use one of the seven SI base units (e.g., meters, liters, seconds) because it differs from the unit that is widely used to communicate the metadata in the community (e.g., nanometers, microliters, milliseconds). I propose that we recommend extension writers to use the units that are already widely used. When it is not clear what it widely used, we should try to poll the community and just pick one. Using a fixed value is better than allowing people to enter a value because they are unlikely to enter the value in a standard form (across current dandisets, the set of all entered grid_spacing.unit values is {"microns", "micrometers", "millimeters", "meters", "microns per pixel", "mm"})

Abbreviation in field names

I propose that unit abbreviations should come from CMIXF-12, with the modification that because / and ^ are not allowed in Python and MATLAB variable names (e.g., for W/m^2), / should be replaced with _ and ^ with nothing (e.g., intensity_in_W_m2). The true unit abbreviation must be written in the docs. In context, I think it would make sense and confused users would consult the docs. (See also usage of CMIXF-12 in BIDS and relevant discussion and links.)

Data format

Should users use a dataset (option 1 above) or an attribute (option 2 above)? Attributes are generally preferred for small metadata that are more properties than measurements, especially scalar values, so I think option 2 is best, but I don't think we settled on this.

An exception is DynamicTable columns that hold these metadata, for example, when parameters of a stimulus like pulse_length_ms change across trials/epochs. Columns are datasets. I propose that having an attribute named "unit" with a fixed value on the dataset is optional but recommended.

Note: This discussion is related to, but distinct from, the discussion on using a MeasurementData type that has attributes for unit, conversion, offset, and resolution (see #493). That type is designed for measured data from a data acquisition system. The best practices suggested here are mostly for small metadata (usually scalar properties of a device or protocol) where there is no conversion factor, offset, or resolution.

Summary

To summarize, I propose that for extensions, the best practice is that fields that represent metadata whose units should be a fixed value should be schematized as attributes where the unit abbreviation is in the field name, e.g.,

  • emission_lambda_in_nm for OpticalChannel
  • grid_spacing_in_um for ImagingPlane
  • camera_width_in_px for a behavioral video
  • pulse_length_in_ms for stimulation
  • injection_volume_in_mL for viruses
  • laser_power_in_mW for optogenetics
  • core_diameter_in_um for optic fibers
  • ap_in_mm for brain coordinates
  • theta_filter_phase_in_deg for a closed loop feedback protocol
  • titer_in_vg_mL (vg/mL. In context, it should make sense and the unit will be well described in the docs)
  • intensity_in_W_m2 (W/m^2) for a light source
  • diameter_in_um for spiral scanning stimulation

that the unit is the one widely used by the community, and the abbreviation follows a modified CMIXF-12 convention as described above.

cc @oruebel @bendichter @CodyCBakerPhD @alessandratrapani

Metadata

Metadata

Assignees

No one assigned

    Labels

    category: proposalproposed enhancements or new featurespriority: lowalternative solution already working and/or relevant to only specific user(s)topic: docsIssues related to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions