-
Notifications
You must be signed in to change notification settings - Fork 187
Support for optional measurements #2183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -116,27 +116,35 @@ def check_dataset_consistent(dataset: Dataset) -> tuple[bool, str | None]: | |
| """ | ||
| :return: (Is consistent, [error message|None]) | ||
| """ | ||
| product_measurements = set(dataset.product.measurements.keys()) | ||
| product_measurements = dataset.product.measurements | ||
|
|
||
| if len(product_measurements) == 0: | ||
| return True, None | ||
|
|
||
| if dataset.measurements is None: | ||
| return False, "No measurements defined for a dataset" | ||
|
|
||
| # It the type expects measurements, ensure our dataset contains them all. | ||
| if not product_measurements.issubset(dataset.measurements.keys()): | ||
| # Exclude 3D measurements since it's just a mapping to 2D measurements | ||
| not_measured = { | ||
| m | ||
| for m in product_measurements - set(dataset.measurements.keys()) | ||
| if "extra_dim" not in dataset.product.measurements.get(m, []) | ||
| } | ||
|
|
||
| if not_measured: | ||
| msg = "The dataset is not specifying all of the measurements in this product.\n" | ||
| msg += "Missing fields are;\n" + str(not_measured) | ||
| return False, msg | ||
| req_measurements, cond_measurements = set(), set() | ||
| for name, measurement in product_measurements.items(): | ||
| if ( | ||
| "extra_dim" in measurement | ||
| ): # Exclude 3D measurements since it's just a mapping to 2D measurements | ||
| continue | ||
| if (opt := measurement.get("optional", "no")) == "no": | ||
| req_measurements.add(name) | ||
| if opt == "maybe": | ||
| cond_measurements.add(name) | ||
|
|
||
| if len(req_measurements) != 0 and dataset.measurements is None: | ||
| return False, "The dataset does not define any of the required measurements." | ||
| if not req_measurements.issubset(dataset.measurements.keys()): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Morning coffee still hasn't kicked in, but would |
||
| not_measured = req_measurements - set(dataset.measurements.keys()) | ||
| return ( | ||
| False, | ||
| f"The dataset does not specify the following required measurements: {', '.join(not_measured)}", | ||
| ) | ||
| if not any(m in dataset.measurements for m in cond_measurements): | ||
| return ( | ||
| False, | ||
| f"The dataset must define at least one of the following measurements: {', '.join(cond_measurements)}", | ||
| ) | ||
|
|
||
| return True, None | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -145,6 +145,8 @@ definitions: | |
| description: Captures information about extra dimensions, e.g. `[y, x, wavelength]` | ||
| items: | ||
| type: string | ||
| optional: | ||
| enum: ["yes", "no", "maybe"] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK as per discussion "maybe" is a very unclear way of expressing this. Maybe "alternate"/"alt"? Or "oneof"? I think I'd rather not support this use case at all than have an option this unclear. (i.e. Caitlin's use case can be handled with "optional: yes", the additional logic is not really required.)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, "maybe" is a placeholder because I recognise it's really not ideal. "oneof" may be an option, I'd welcome other suggestions as well. I'm not sure how the requirements as described in the original issue can be covered with "optional: yes" though.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I haven't tried to understand Caitlin's use-case, but my immediate reflection to "oneof" is that it sounds like an option that requires an additional parameter so it knows among which choices there should be one of. Or was the implicit context that ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To give some more context, there are two ways I can see us using optional measurements for SAR Backscatter: Option 1: All polarisations are handled in one product (HH, HV, VV, VH). In this case, I would suggest the following:
Option 2: Vertical transmitted polarisations are handled in one product (VV, VH) and horizontal transmitted polarisations are handled in another product (HH, HV). In this case, I would suggest the following:
(also, I'm wondering if the language should focus on "required" rather than "optional"? I think it's more intuitive to talk about whether bands are required, rather than whether they are optional, in terms of how the logic might work when it comes to checking if a dataset is valid for indexing). I think either option should work for Sentinel-1 and NISAR, but I haven't considered other systems. If possible, I would like to see if we can support option 1, since we can always choose option 2 if option 1 is supported. I suspect that option 2 is a lot easier, and also possibly more intuitive for users and data managers, so in the event we can't find a way to support option 1, I don't see a major issue with only supporting optional=yes/no only. @mpaget -- I know you're also interested in optional measurements for SAR, would you mind weighing in on how you'd imagined supporting this for CSIRO's SAR products? Were you thinking option 1 or option 2, or something entirely different? Regarding option 1, I don't know if there's going to be a clear way to do this using per-band flags. Especially if you try and extend the logic to a case where perhaps there are multiple sets of "one of" bands. I don't think we have an existing scenario that meets this case, but it's worth thinking about as a hypothetical -- what if a satellite always captures red, green, and blue (required), but then captures one of near-infrared or far-infrared, as well as one of two pixel quality bands (let's imagine there are different on-board processing algorithms for pixel quality depending on if you're over the polar regions or not). We can represent whether a dataset is valid using the following logic: Under the above scenario, even if we implemented a flag that suggests one-of a set of measurements is required, I think you'd need some other flag to indicate that there are multiple categories that might have one-of requirements. Is this something that it's worth supporting? Do we need to go to a different way of specifying dataset/product level metadata to effectively allow us to specify lists of required and optional bands? In terms of thinking about our current model, we're just looking for a better word than "maybe" to indicate the "one-of" category. I think "oneof" is ok, but I suspect it will lead to further confusion later if we ever need to support multiple "one-of" categories. Possibly the concept of sticking the "one-of" under the "optional" flag is making the issue more challenging, and we should consider whether there's a way to specify the explicit rules for a product that allows relationships between bands, instead of focussing on band-level properties.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SpacemanPaul I'm not sure I understand all details, could you please write out the complete schema addition/change you suggest? Is this schema addition/change sufficient to express what is required to let @mpaget collapse all products into a single novasar_scd product? And is @caitlinadams's use-case a subset of @mpaget's requirements? I have never used dc.load(), but my guess would be that users want a datacube with all data of a particular polarity. Loading 4x (?) as much data as desired and getting NaN arrays for the data with the wrong polarity doesn't sound like it's necessary what an end-user would expect.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for adding to this discussion, @SpacemanPaul! I'm well aligned with Matt at this stage in terms of the indexing approach, and think the general model of "if all measurements are optional, then there must be at least one measurement present to be indexed" is the right way to go for the SAR use case. It gives us plenty of flexibility and meets both Matt's and my use cases. I agree that the "atLeastOneOf" logic doesn't help us and is overly complicated. To your point, @SpacemanPaul, of a product with datasets but no measurements, would it be possible to write a condition for For Handling the NaNs if a user makes a request that doesn't align well with the data availability is going to be pretty challenging, I think... I suspect that the only real way to get around this is user education, and building very clear documentation about the behaviour of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Checking for With regards to products with no measurements, I imagine the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've had further conversations with Ariana, and this is my current thinking:
Current behaviour: All measurements listed in the product are required to be present in all datasets belonging to that product. Proposed behaviour: a) A product measurement MAY contain an
All datasets must have at least one measurement, even if all measurements are optional. Note that this is sufficient to support all the use cases discussed so far, but is not sufficient to properly enforce all of the business rules discussed so far. b) An optional product measurement ( E.g. In this example, both This prevents illegal combinations of measurements but it only allows enforcement rules like "we must have at least one of c) An optional product measurement ( At least one measurement in any defined This says:
This could also be achieved with Ariana's suggestion of Either approach also raises the following question: What does having both We have no clear requirement for the latter, so suggest the former for now.
a) Currently: If a list of measurements is supplied, the Proposed: No change. b) Currently: Proposed: c) Currently: A load request that has no matching datasets returns an empty Proposed:
That much is clear I think. I'm less sure about the case where some requested bands are available in the matching datasets, but some requested bands are not available in the matching datasets. Do we return all requested bands if we have data for any of them? Or do we suppress bands for which no data is available? The former seems preferable to me, both in terms of ease of implementation and principle of least surprise. d) Should we be indexing which optional bands are available, so this can be checked at the database level? (IMO yes, but not necessarily in this PR)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking at 1b and 1c, I feel that there needs to be robust checking so typos are caught and get rejected by a failure when trying to add the product. This is not inherently difficult, but it's always a bit tedious to implement the dull error checking/reporting parts of the functionality. I might be dense but I don't understand the second half of 2c, can you give an example for each of the two possibilities? |
||
|
|
||
| required: | ||
| - name | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| name: ga_s1_full | ||
| description: > | ||
| Radiometric Terrain Corrected (RTC) Backscatter for Sentinel-1 generated from the COMPASS-ISCE3 pipeline. | ||
| This product contains scenes with all combinations of measurements (VV, VH, HH, HV). | ||
| This product uses the gamma_0 linear backscatter measurement convention. | ||
| metadata_type: eo3_s1_ard | ||
|
|
||
| metadata: | ||
| product: | ||
| name: ga_s1_full | ||
|
|
||
| measurements: | ||
| - name: VV | ||
| aliases: | ||
| - vv | ||
| dtype: float32 | ||
| nodata: NaN | ||
| units: '1' | ||
| optional: maybe | ||
| - name: VH | ||
| aliases: | ||
| - vh | ||
| dtype: float32 | ||
| nodata: NaN | ||
| units: '1' | ||
| optional: yes | ||
| - name: HH | ||
| aliases: | ||
| - hh | ||
| dtype: float32 | ||
| nodata: NaN | ||
| units: '1' | ||
| optional: maybe | ||
| - name: HV | ||
| aliases: | ||
| - hv | ||
| dtype: float32 | ||
| nodata: NaN | ||
| units: '1' | ||
| optional: yes | ||
| - name: mask | ||
| dtype: uint8 | ||
| nodata: 255 | ||
| units: '1' | ||
| flags_definition: | ||
| mask: | ||
| bits: [0,1,2,3,4,5,6,7] | ||
| values: | ||
| 1: shadow | ||
| 2: layover | ||
| 3: shadow and layover | ||
| description: shadow layover data mask |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -218,6 +218,34 @@ def test_dataset_add_not_eo3(index: Index, ls8_eo3_product, eo3_wo_dataset_doc) | |
| assert isinstance(_err, BadMatch) | ||
|
|
||
|
|
||
| def test_dataset_opt_measurements( | ||
| index: Index, s1_dataset_doc, ga_s1_full_product | ||
| ) -> None: | ||
| doc, uri = s1_dataset_doc | ||
| doc["product"]["name"] = ga_s1_full_product.name | ||
|
|
||
| doc2ds = Doc2Dataset(index) | ||
| _ds, _err = doc2ds(doc, uri) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think something has changed in Ruff 0.13, but it now considers a leading underscore in the variable name to signal that the variable isn't used (which is consistent with ESLint). So rename |
||
| assert _ds is not None | ||
| # remove optional measurement | ||
| doc["measurements"].pop("VH") | ||
| _ds, _err = doc2ds(doc, uri) | ||
| assert _ds is not None | ||
| # remove conditional measurement | ||
| vv = doc["measurements"].pop("VV") | ||
| _ds, _err = doc2ds(doc, uri) | ||
| assert "The dataset must define at least one of the following measurements" in _err | ||
| assert "VV" in _err # order of these two isn't consistent | ||
| assert "HH" in _err | ||
| # remove required measurement | ||
| doc["measurements"]["VV"] = vv | ||
| doc["measurements"].pop("mask") | ||
| _ds, _err = doc2ds(doc, uri) | ||
| assert ( | ||
| _err == "The dataset does not specify the following required measurements: mask" | ||
| ) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("datacube_env_name", ("datacube", "datacube3")) | ||
| def test_dataset_eo3_no_schema( | ||
| dataset_add_configs, index_empty, clirunner, caplog | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think
dataset.measurementscan beNone, so the old code (and this) looks wrong.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will be an empty dict at worst.