Skip to content

Conversation

@Ariana-B
Copy link
Contributor

@Ariana-B Ariana-B commented Sep 16, 2025

Reason for this pull request

See #2059

Proposed changes

  • Add 'optional' keyword to measurement definitions to enable optional or conditionally optional measurements

  • Rework check_dataset_consistent logic

  • Closes #xxxx

  • Tests added / passed

  • Pull Request Title will make sense in ODC Release Notes


📚 Documentation preview 📚: https://opendatacube--2183.org.readthedocs.build/en/2183/

if opt == "maybe":
cond_measurements.add(name)

if len(req_measurements) != 0 and dataset.measurements is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think dataset.measurements can be None, so the old code (and this) looks wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will be an empty dict at worst.


if len(req_measurements) != 0 and dataset.measurements is None:
return False, "The dataset does not define any of the required measurements."
if not req_measurements.issubset(dataset.measurements.keys()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Morning coffee still hasn't kicked in, but would len(req_measurements) > 0 and not req.measurements.issubset(..) eliminate the need for the if-statement preceding this if-statement?

doc["product"]["name"] = ga_s1_full_product.name

doc2ds = Doc2Dataset(index)
_ds, _err = doc2ds(doc, uri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something has changed in Ruff 0.13, but it now considers a leading underscore in the variable name to signal that the variable isn't used (which is consistent with ESLint).

So rename _ds to ds since that is used, and just put a _ instead of _err for the unused component of the tuple. (And same fix for the other used/unused variables below.)

if opt == "maybe":
cond_measurements.add(name)

if len(req_measurements) != 0 and dataset.measurements is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will be an empty dict at worst.

items:
type: string
optional:
enum: ["yes", "no", "maybe"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would optional: maybe mean? Surely a measurement is optional or it isn't?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK as per discussion optional:maybe is intended to cover the case: These are optional BUT AT LEAST ONE (of the "maybe" measurements) must be present.

"maybe" is a very unclear way of expressing this. Maybe "alternate"/"alt"? Or "oneof"?

I think I'd rather not support this use case at all than have an option this unclear.

(i.e. Caitlin's use case can be handled with "optional: yes", the additional logic is not really required.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, "maybe" is a placeholder because I recognise it's really not ideal. "oneof" may be an option, I'd welcome other suggestions as well. I'm not sure how the requirements as described in the original issue can be covered with "optional: yes" though.
@caitlinadams what are your thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried to understand Caitlin's use-case, but my immediate reflection to "oneof" is that it sounds like an option that requires an additional parameter so it knows among which choices there should be one of.

Or was the implicit context that OneOf was supposed to end up above the enum, so it's:

OneOf: 
  - enum["yes", "no"]
  - property1
  - property2

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give some more context, there are two ways I can see us using optional measurements for SAR Backscatter:

Option 1: All polarisations are handled in one product (HH, HV, VV, VH). In this case, I would suggest the following:

  • One of (HH, VV) must be present
  • VH and HV have optional=yes

Option 2: Vertical transmitted polarisations are handled in one product (VV, VH) and horizontal transmitted polarisations are handled in another product (HH, HV). In this case, I would suggest the following:

  • In the vertical product, VV has optional=no, VH has optional=yes
  • In the horizontal product, HH has optional=no, HV has optional=yes

(also, I'm wondering if the language should focus on "required" rather than "optional"? I think it's more intuitive to talk about whether bands are required, rather than whether they are optional, in terms of how the logic might work when it comes to checking if a dataset is valid for indexing).

I think either option should work for Sentinel-1 and NISAR, but I haven't considered other systems. If possible, I would like to see if we can support option 1, since we can always choose option 2 if option 1 is supported. I suspect that option 2 is a lot easier, and also possibly more intuitive for users and data managers, so in the event we can't find a way to support option 1, I don't see a major issue with only supporting optional=yes/no only. @mpaget -- I know you're also interested in optional measurements for SAR, would you mind weighing in on how you'd imagined supporting this for CSIRO's SAR products? Were you thinking option 1 or option 2, or something entirely different?

Regarding option 1, I don't know if there's going to be a clear way to do this using per-band flags. Especially if you try and extend the logic to a case where perhaps there are multiple sets of "one of" bands. I don't think we have an existing scenario that meets this case, but it's worth thinking about as a hypothetical -- what if a satellite always captures red, green, and blue (required), but then captures one of near-infrared or far-infrared, as well as one of two pixel quality bands (let's imagine there are different on-board processing algorithms for pixel quality depending on if you're over the polar regions or not). We can represent whether a dataset is valid using the following logic:
valid_dataset = required_list AND one_of_list_infrared AND one_of_list_pq
valid_dataset = [red AND green AND blue] AND [near-infrared OR far-infrared] AND [pq-polar OR pq-nonpolar]

Under the above scenario, even if we implemented a flag that suggests one-of a set of measurements is required, I think you'd need some other flag to indicate that there are multiple categories that might have one-of requirements. Is this something that it's worth supporting? Do we need to go to a different way of specifying dataset/product level metadata to effectively allow us to specify lists of required and optional bands?

In terms of thinking about our current model, we're just looking for a better word than "maybe" to indicate the "one-of" category. I think "oneof" is ok, but I suspect it will lead to further confusion later if we ever need to support multiple "one-of" categories. Possibly the concept of sticking the "one-of" under the "optional" flag is making the issue more challenging, and we should consider whether there's a way to specify the explicit rules for a product that allows relationships between bands, instead of focussing on band-level properties.

Copy link
Contributor

@SpacemanPaul SpacemanPaul Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi all, I was overseas on leave when this fantastic discussion happened, so adding my thoughts now. I think we basically have a solution here, we just need to pull it all together.

@mpaget 's contribution was the most thorough (and also pretty close to hitting the right solutions) so I'll follow his structure:

  1. Validation and indexing.

a) new flags on product measurements:

I lean towards:
i) a new optional flag "yes" or "no" - (the backwards compatible default), AND
ii) a "requires" field. The requires field is only applicable for optional bands and says that this band can only present if the required band(s) are also present.

More complex logic like "atLeastOneOf" seems unnecessarily difficult to understand and implement - and doesn't actually support any additional cases.

b) Enforcing len(dataset.measurements) >=1:

I think enforcing this rule at indexing time only makes sense if len(product.measurements) >= 1

But a product (with datasets) with no measurements could have the following use cases:

i) An historic use case: telemetry products/datasets - no raster data, no geospatial metadata
ii) A potential future use case: vector products/datasets - geospatial metadata, but no raster data.

  1. dc.load() behaviour.

I lean towards:
a) measurements=None should return all of the required/non-optional measurements. If all measurements are optional, an error should be raised asking user to explicitly specify measurements. This seems to me to be less likely to surprise users.
b) Requesting optional measurements where they don't exist would return a NaN array.

Are we converging on a mutually agreeable solution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SpacemanPaul I'm not sure I understand all details, could you please write out the complete schema addition/change you suggest? Is this schema addition/change sufficient to express what is required to let @mpaget collapse all products into a single novasar_scd product? And is @caitlinadams's use-case a subset of @mpaget's requirements?

I have never used dc.load(), but my guess would be that users want a datacube with all data of a particular polarity. Loading 4x (?) as much data as desired and getting NaN arrays for the data with the wrong polarity doesn't sound like it's necessary what an end-user would expect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding to this discussion, @SpacemanPaul!

I'm well aligned with Matt at this stage in terms of the indexing approach, and think the general model of "if all measurements are optional, then there must be at least one measurement present to be indexed" is the right way to go for the SAR use case. It gives us plenty of flexibility and meets both Matt's and my use cases. I agree that the "atLeastOneOf" logic doesn't help us and is overly complicated.

To your point, @SpacemanPaul, of a product with datasets but no measurements, would it be possible to write a condition for len(product.measurements) == 0, along with a condition for len(product.measurements) >= 1? Or does that not make a lot of sense?

For dc.load(), I think the suggestion of throwing an error if all measurements are optional sounds reasonable, so long as the error message specifies that the user should supply at least one of the available optional measurements (listing them explicitly).

Handling the NaNs if a user makes a request that doesn't align well with the data availability is going to be pretty challenging, I think... I suspect that the only real way to get around this is user education, and building very clear documentation about the behaviour of dc.load() for the given product and measurement combinations (along with building product defintions that align well with how users might interact with the data). Peter is right in articulating that getting a whole bunch of NaN arrays back is likely to be counter-intuitive and unexpected for most users, especially if they're used to working with optical data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for len(product.measurements) >= 1 would work if all the measurements are optional, but if there are any required measurements defined, you'd need to check that the dataset contains at least one optional measurement to achieve that 'one of' behaviour. This is sufficient for the SAR use case where we know that either VV or HH must always be present, but wouldn't be extensible to products that don't have the same requirements, e.g. a product where all optional measurements are fully optional. I don't see any way around including some sort of grouping logic if we want to account for such use cases.

With regards to products with no measurements, I imagine the len(product.measurements) == 0 condition would simply be to check that the dataset does not include any measurements either.

Copy link
Contributor

@SpacemanPaul SpacemanPaul Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had further conversations with Ariana, and this is my current thinking:

  1. Specifying measurement rules in the product document.

Current behaviour: All measurements listed in the product are required to be present in all datasets belonging to that product.

Proposed behaviour:

a) A product measurement MAY contain an optional field, which may be "yes" or "no"

"no" (or the optional field being missing) means that the measurement is required to be present in all datasets belonging to that product.

All datasets must have at least one measurement, even if all measurements are optional.

Note that this is sufficient to support all the use cases discussed so far, but is not sufficient to properly enforce all of the business rules discussed so far.

b) An optional product measurement (optional: yes) MAY also contain a "requires" field, which (if supplied) should contain the canonical name of another optional measurement defined for the product.

E.g.

measurements:
     - name: hh
       optional: yes
       ...
     - name: hv
       optional: yes
       requires: hh
       ...

In this example, both hh and hv are optional, but hv can only be present if hh is present as well.

This prevents illegal combinations of measurements but it only allows enforcement rules like "we must have at least one of hh and vv" in simple cases where all measurements are optional. i.e. it fully supports both the specific cases raised by @mpaget and @caitlinadams - but would fall short if we had similar cases that also had an additional required band (e.g. some sort of qa/pq band, or a telemetry band like azimuth_angle)

c) An optional product measurement (optional: yes) MAY contain a "requirement_group" field, which is an arbitrary string.

At least one measurement in any defined requirement_group must be present. E.g.:

measurements:
     - name: hh
       optional: yes
       requirement_group: same_polarisation
       ...
     - name: hv
       optional: yes
       requires: hh
       ...
     - name: vv
       optional: yes
       requirement_group: same_polarisation
       ...
     - name: vh
       optional: yes
       requires: vv
       ...

This says:

  • vh requires vv, and
  • hv requires hh, and
  • at least one of vv and hh

This could also be achieved with Ariana's suggestion of atLeastOneOf above. But as the measurements section is a list of measurement definitions, it would need to go in another (possibly new) section of the product document. I therefore prefer the approach of specifying requirement groups at the individual measurement level.

Either approach also raises the following question: What does having both requires and requirement_group mean? Is this simply invalid, or does it provide a way of specifying e.g. "If superband is supplied, must also have at least one of subband-a and subband-b?

We have no clear requirement for the latter, so suggest the former for now.

  1. dc.load() behaviour

a) Currently: If a list of measurements is supplied, the xr.Dataset contains all requested measurements.

Proposed: No change.

b) Currently: measurements=None causes all defined measurements to be returned in the xr.Dataset.

Proposed: measurements=None causes all required measurements to be returned in the xr.Dataset. If there are NO required measurements, an error is raised - user MUST explicitly specify which measurements they want.

c) Currently: A load request that has no matching datasets returns an empty Dataset (no bands)

Proposed:

  • A load request that has no matching datasets returns an empty Dataset (no bands)
  • A load request that has no matching datasets with the requested bands returns an empty Dataset (no bands)

That much is clear I think. I'm less sure about the case where some requested bands are available in the matching datasets, but some requested bands are not available in the matching datasets. Do we return all requested bands if we have data for any of them? Or do we suppress bands for which no data is available? The former seems preferable to me, both in terms of ease of implementation and principle of least surprise.

d) Should we be indexing which optional bands are available, so this can be checked at the database level? (IMO yes, but not necessarily in this PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants