-
Notifications
You must be signed in to change notification settings - Fork 185
Support for optional measurements #2183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
| if opt == "maybe": | ||
| cond_measurements.add(name) | ||
|
|
||
| if len(req_measurements) != 0 and dataset.measurements is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think dataset.measurements can be None, so the old code (and this) looks wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will be an empty dict at worst.
|
|
||
| if len(req_measurements) != 0 and dataset.measurements is None: | ||
| return False, "The dataset does not define any of the required measurements." | ||
| if not req_measurements.issubset(dataset.measurements.keys()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Morning coffee still hasn't kicked in, but would len(req_measurements) > 0 and not req.measurements.issubset(..) eliminate the need for the if-statement preceding this if-statement?
| doc["product"]["name"] = ga_s1_full_product.name | ||
|
|
||
| doc2ds = Doc2Dataset(index) | ||
| _ds, _err = doc2ds(doc, uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something has changed in Ruff 0.13, but it now considers a leading underscore in the variable name to signal that the variable isn't used (which is consistent with ESLint).
So rename _ds to ds since that is used, and just put a _ instead of _err for the unused component of the tuple. (And same fix for the other used/unused variables below.)
| if opt == "maybe": | ||
| cond_measurements.add(name) | ||
|
|
||
| if len(req_measurements) != 0 and dataset.measurements is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will be an empty dict at worst.
| items: | ||
| type: string | ||
| optional: | ||
| enum: ["yes", "no", "maybe"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would optional: maybe mean? Surely a measurement is optional or it isn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK as per discussion optional:maybe is intended to cover the case: These are optional BUT AT LEAST ONE (of the "maybe" measurements) must be present.
"maybe" is a very unclear way of expressing this. Maybe "alternate"/"alt"? Or "oneof"?
I think I'd rather not support this use case at all than have an option this unclear.
(i.e. Caitlin's use case can be handled with "optional: yes", the additional logic is not really required.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, "maybe" is a placeholder because I recognise it's really not ideal. "oneof" may be an option, I'd welcome other suggestions as well. I'm not sure how the requirements as described in the original issue can be covered with "optional: yes" though.
@caitlinadams what are your thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried to understand Caitlin's use-case, but my immediate reflection to "oneof" is that it sounds like an option that requires an additional parameter so it knows among which choices there should be one of.
Or was the implicit context that OneOf was supposed to end up above the enum, so it's:
OneOf:
- enum["yes", "no"]
- property1
- property2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give some more context, there are two ways I can see us using optional measurements for SAR Backscatter:
Option 1: All polarisations are handled in one product (HH, HV, VV, VH). In this case, I would suggest the following:
- One of (HH, VV) must be present
- VH and HV have optional=yes
Option 2: Vertical transmitted polarisations are handled in one product (VV, VH) and horizontal transmitted polarisations are handled in another product (HH, HV). In this case, I would suggest the following:
- In the vertical product, VV has optional=no, VH has optional=yes
- In the horizontal product, HH has optional=no, HV has optional=yes
(also, I'm wondering if the language should focus on "required" rather than "optional"? I think it's more intuitive to talk about whether bands are required, rather than whether they are optional, in terms of how the logic might work when it comes to checking if a dataset is valid for indexing).
I think either option should work for Sentinel-1 and NISAR, but I haven't considered other systems. If possible, I would like to see if we can support option 1, since we can always choose option 2 if option 1 is supported. I suspect that option 2 is a lot easier, and also possibly more intuitive for users and data managers, so in the event we can't find a way to support option 1, I don't see a major issue with only supporting optional=yes/no only. @mpaget -- I know you're also interested in optional measurements for SAR, would you mind weighing in on how you'd imagined supporting this for CSIRO's SAR products? Were you thinking option 1 or option 2, or something entirely different?
Regarding option 1, I don't know if there's going to be a clear way to do this using per-band flags. Especially if you try and extend the logic to a case where perhaps there are multiple sets of "one of" bands. I don't think we have an existing scenario that meets this case, but it's worth thinking about as a hypothetical -- what if a satellite always captures red, green, and blue (required), but then captures one of near-infrared or far-infrared, as well as one of two pixel quality bands (let's imagine there are different on-board processing algorithms for pixel quality depending on if you're over the polar regions or not). We can represent whether a dataset is valid using the following logic:
valid_dataset = required_list AND one_of_list_infrared AND one_of_list_pq
valid_dataset = [red AND green AND blue] AND [near-infrared OR far-infrared] AND [pq-polar OR pq-nonpolar]
Under the above scenario, even if we implemented a flag that suggests one-of a set of measurements is required, I think you'd need some other flag to indicate that there are multiple categories that might have one-of requirements. Is this something that it's worth supporting? Do we need to go to a different way of specifying dataset/product level metadata to effectively allow us to specify lists of required and optional bands?
In terms of thinking about our current model, we're just looking for a better word than "maybe" to indicate the "one-of" category. I think "oneof" is ok, but I suspect it will lead to further confusion later if we ever need to support multiple "one-of" categories. Possibly the concept of sticking the "one-of" under the "optional" flag is making the issue more challenging, and we should consider whether there's a way to specify the explicit rules for a product that allows relationships between bands, instead of focussing on band-level properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi all, I was overseas on leave when this fantastic discussion happened, so adding my thoughts now. I think we basically have a solution here, we just need to pull it all together.
@mpaget 's contribution was the most thorough (and also pretty close to hitting the right solutions) so I'll follow his structure:
- Validation and indexing.
a) new flags on product measurements:
I lean towards:
i) a new optional flag "yes" or "no" - (the backwards compatible default), AND
ii) a "requires" field. The requires field is only applicable for optional bands and says that this band can only present if the required band(s) are also present.
More complex logic like "atLeastOneOf" seems unnecessarily difficult to understand and implement - and doesn't actually support any additional cases.
b) Enforcing len(dataset.measurements) >=1:
I think enforcing this rule at indexing time only makes sense if len(product.measurements) >= 1
But a product (with datasets) with no measurements could have the following use cases:
i) An historic use case: telemetry products/datasets - no raster data, no geospatial metadata
ii) A potential future use case: vector products/datasets - geospatial metadata, but no raster data.
- dc.load() behaviour.
I lean towards:
a) measurements=None should return all of the required/non-optional measurements. If all measurements are optional, an error should be raised asking user to explicitly specify measurements. This seems to me to be less likely to surprise users.
b) Requesting optional measurements where they don't exist would return a NaN array.
Are we converging on a mutually agreeable solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SpacemanPaul I'm not sure I understand all details, could you please write out the complete schema addition/change you suggest? Is this schema addition/change sufficient to express what is required to let @mpaget collapse all products into a single novasar_scd product? And is @caitlinadams's use-case a subset of @mpaget's requirements?
I have never used dc.load(), but my guess would be that users want a datacube with all data of a particular polarity. Loading 4x (?) as much data as desired and getting NaN arrays for the data with the wrong polarity doesn't sound like it's necessary what an end-user would expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding to this discussion, @SpacemanPaul!
I'm well aligned with Matt at this stage in terms of the indexing approach, and think the general model of "if all measurements are optional, then there must be at least one measurement present to be indexed" is the right way to go for the SAR use case. It gives us plenty of flexibility and meets both Matt's and my use cases. I agree that the "atLeastOneOf" logic doesn't help us and is overly complicated.
To your point, @SpacemanPaul, of a product with datasets but no measurements, would it be possible to write a condition for len(product.measurements) == 0, along with a condition for len(product.measurements) >= 1? Or does that not make a lot of sense?
For dc.load(), I think the suggestion of throwing an error if all measurements are optional sounds reasonable, so long as the error message specifies that the user should supply at least one of the available optional measurements (listing them explicitly).
Handling the NaNs if a user makes a request that doesn't align well with the data availability is going to be pretty challenging, I think... I suspect that the only real way to get around this is user education, and building very clear documentation about the behaviour of dc.load() for the given product and measurement combinations (along with building product defintions that align well with how users might interact with the data). Peter is right in articulating that getting a whole bunch of NaN arrays back is likely to be counter-intuitive and unexpected for most users, especially if they're used to working with optical data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking for len(product.measurements) >= 1 would work if all the measurements are optional, but if there are any required measurements defined, you'd need to check that the dataset contains at least one optional measurement to achieve that 'one of' behaviour. This is sufficient for the SAR use case where we know that either VV or HH must always be present, but wouldn't be extensible to products that don't have the same requirements, e.g. a product where all optional measurements are fully optional. I don't see any way around including some sort of grouping logic if we want to account for such use cases.
With regards to products with no measurements, I imagine the len(product.measurements) == 0 condition would simply be to check that the dataset does not include any measurements either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had further conversations with Ariana, and this is my current thinking:
- Specifying measurement rules in the product document.
Current behaviour: All measurements listed in the product are required to be present in all datasets belonging to that product.
Proposed behaviour:
a) A product measurement MAY contain an optional field, which may be "yes" or "no"
"no" (or the optional field being missing) means that the measurement is required to be present in all datasets belonging to that product.
All datasets must have at least one measurement, even if all measurements are optional.
Note that this is sufficient to support all the use cases discussed so far, but is not sufficient to properly enforce all of the business rules discussed so far.
b) An optional product measurement (optional: yes) MAY also contain a "requires" field, which (if supplied) should contain the canonical name of another optional measurement defined for the product.
E.g.
measurements:
- name: hh
optional: yes
...
- name: hv
optional: yes
requires: hh
...
In this example, both hh and hv are optional, but hv can only be present if hh is present as well.
This prevents illegal combinations of measurements but it only allows enforcement rules like "we must have at least one of hh and vv" in simple cases where all measurements are optional. i.e. it fully supports both the specific cases raised by @mpaget and @caitlinadams - but would fall short if we had similar cases that also had an additional required band (e.g. some sort of qa/pq band, or a telemetry band like azimuth_angle)
c) An optional product measurement (optional: yes) MAY contain a "requirement_group" field, which is an arbitrary string.
At least one measurement in any defined requirement_group must be present. E.g.:
measurements:
- name: hh
optional: yes
requirement_group: same_polarisation
...
- name: hv
optional: yes
requires: hh
...
- name: vv
optional: yes
requirement_group: same_polarisation
...
- name: vh
optional: yes
requires: vv
...
This says:
vhrequiresvv, andhvrequireshh, and- at least one of
vvandhh
This could also be achieved with Ariana's suggestion of atLeastOneOf above. But as the measurements section is a list of measurement definitions, it would need to go in another (possibly new) section of the product document. I therefore prefer the approach of specifying requirement groups at the individual measurement level.
Either approach also raises the following question: What does having both requires and requirement_group mean? Is this simply invalid, or does it provide a way of specifying e.g. "If superband is supplied, must also have at least one of subband-a and subband-b?
We have no clear requirement for the latter, so suggest the former for now.
- dc.load() behaviour
a) Currently: If a list of measurements is supplied, the xr.Dataset contains all requested measurements.
Proposed: No change.
b) Currently: measurements=None causes all defined measurements to be returned in the xr.Dataset.
Proposed: measurements=None causes all required measurements to be returned in the xr.Dataset. If there are NO required measurements, an error is raised - user MUST explicitly specify which measurements they want.
c) Currently: A load request that has no matching datasets returns an empty Dataset (no bands)
Proposed:
- A load request that has no matching datasets returns an empty
Dataset(no bands) - A load request that has no matching datasets with the requested bands returns an empty
Dataset(no bands)
That much is clear I think. I'm less sure about the case where some requested bands are available in the matching datasets, but some requested bands are not available in the matching datasets. Do we return all requested bands if we have data for any of them? Or do we suppress bands for which no data is available? The former seems preferable to me, both in terms of ease of implementation and principle of least surprise.
d) Should we be indexing which optional bands are available, so this can be checked at the database level? (IMO yes, but not necessarily in this PR)
Reason for this pull request
See #2059
Proposed changes
Add 'optional' keyword to measurement definitions to enable optional or conditionally optional measurements
Rework
check_dataset_consistentlogicCloses #xxxx
Tests added / passed
Pull Request Title will make sense in ODC Release Notes
📚 Documentation preview 📚: https://opendatacube--2183.org.readthedocs.build/en/2183/