Skip to content

Conversation

max-sixty
Copy link
Collaborator

@max-sixty max-sixty commented Sep 9, 2025

as discussed in #3891

Done with lots of help from Claude Code:


This change modifies xarray's default behavior to preserve attributes (attrs) across all operations, including computational, binary, and data manipulation functions. Previously, attributes were dropped by default unless keep_attrs=True was explicitly set. This new default aligns xarray with common scientific workflows where metadata preservation is crucial.

The keep_attrs option now defaults to True for most operations. For binary operations, attributes are preserved from the left-hand operand.

This is a breaking change

@github-actions github-actions bot added the topic-DataTree Related to the implementation of a DataTree class label Sep 9, 2025
@keewis
Copy link
Collaborator

keewis commented Sep 9, 2025

This is indeed a breaking change, which might require a deprecation cycle. Not sure, though, we could also decide that the benefits outweigh the disruption.

For binary operations, attributes are preserved from the left-hand operand.

should binary operations support combine_attrs strategy names / a function? apply_ufunc already supports that for keep_attrs (added in #5041), so it wouldn't be anything new for functions using it.

(doesn't have to be this PR, though)

@max-sixty
Copy link
Collaborator Author

This is indeed a breaking change, which might require a deprecation cycle. Not sure, though, we could also decide that the benefits outweigh the disruption.

it's difficult to do a deprecation cycle here — we would need to do something like "give a warning anytime we run an operation that drops an attrs" — could be quite noisy. some discussion on this in #3891. not impossible though...

I think it would probably be OK to start propagating more attrs by default as a breaking change. There's no easy way to roll this out incrementally, and I doubt too many users are relying upon metadata disappearing when they do xarray operations, given the somewhat inconsistent state of the current rules.

BREAKING CHANGE: Change keep_attrs default from False to True

This changes the default behavior of xarray operations to preserve
attributes by default, which better aligns with user expectations
and scientific workflows where metadata preservation is critical.

Migration guide:
- To restore previous behavior globally: xr.set_options(keep_attrs=False)
- To restore for specific operations: use keep_attrs=False parameter
- Alternative: use .drop_attrs() method after operations

Closes pydata#3891, pydata#4510, pydata#9920
max-sixty added a commit to max-sixty/xarray that referenced this pull request Sep 10, 2025
Integrated coordinate preservation feature from main with
our keep_attrs changes.
…alse

The merge incorrectly preserved coordinate attributes even when
keep_attrs=False. Now coordinates have their attrs cleared when
keep_attrs=False, consistent with data variables.
- When keep_attrs=True: restore attrs from original coords (func may have dropped them)
- When keep_attrs=False: clear all attrs
- More efficient than previous implementation
Group attribute operations by keep_attrs value for cleaner,
more readable code with identical functionality.
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks on the right track, though Claude seems to be inserting a lot of specific comments about "now" that will get stale very quickly and should be removed.

max-sixty and others added 4 commits September 12, 2025 14:36
Per Stefan's review, remove 'now' from comments that describe behavior
changes, as these become stale over time. Replace with timeless
descriptions that simply state the current behavior.
- Remove PR-specific comment from variable.py that only makes sense in context
- Remove redundant comment from test_computation.py
- Clarify comment in test_dataarray.py about argmin preserving attrs
@max-sixty
Copy link
Collaborator Author

Generally looks on the right track, though Claude seems to be inserting a lot of specific comments about "now" that will get stale very quickly and should be removed.

yes agree, thanks for checking

now removed!


is everyone OK to move forward? please raise if not!

@max-sixty max-sixty added plan to merge Final call for comments and removed topic-DataTree Related to the implementation of a DataTree class labels Sep 12, 2025
@shoyer
Copy link
Member

shoyer commented Sep 12, 2025

I wonder if a better default would be combine_attrs="drop_conflicts" from xarray.merge? Currently keep_attrs=True means combine_attrs="override".

"drop_conflicts" would resolve @Illviljan's concerns about binary arithmetic with mismatched units, which was likely part of the original motivation for the default keep_attrs=False.

The main limitation of drop_conflicts is that it assumes that every attribute value can be cheaply compared for equality, which may not be true (e.g., if someone puts a dask.array in an attrs). But I suppose we can have a back-up policy (either "drop" or "override") for rare cases where __eq__ does not return a boolean.

@shoyer shoyer added needs discussion and removed plan to merge Final call for comments labels Sep 13, 2025
Instead of only preserving the left operand's attributes, binary operations
now combine attributes from both operands using the drop_conflicts strategy:
- Matching attributes (same key, same value) are kept
- Conflicting attributes (same key, different values) are dropped
- Non-conflicting attributes from both operands are preserved

This provides more intuitive behavior when combining data with partially
overlapping metadata.
@github-actions github-actions bot added the topic-DataTree Related to the implementation of a DataTree class label Sep 13, 2025
max-sixty and others added 2 commits September 13, 2025 16:28
For backward compatibility, when one operand has no attributes (None),
keep the left operand's attributes instead of merging. This maintains
the existing test expectations while still providing drop_conflicts
behavior when both operands have attributes.
When one operand has no attributes (either None or empty dict), the
result should have no attributes. This maintains backward compatibility
and fixes test_1d_math failures.

The issue was compounded by a side effect in the attrs property getter
that mutates _attrs from None to {} when accessed.
max-sixty added a commit to max-sixty/xarray that referenced this pull request Sep 14, 2025
When merging attrs with drop_conflicts, objects whose __eq__ returns
non-bool values (e.g., numpy arrays) would raise ValueError. Now these
are caught and treated as non-equivalent, causing the attribute to be
dropped rather than raising an error.

Fixes regression from pydata#10726
Use attrs property directly instead of checking _attrs, since the
property normalizes None to {}. This simplifies the logic while
maintaining the same behavior.
Variable uses None for no attrs, Dataset uses {} for no attrs.
Updated comments to make this distinction clear.
Previously we were dropping all attrs if either operand had no attrs.
Now we properly merge attrs and only drop conflicting ones, which is
what drop_conflicts should do:
- If one has {"a": 1} and other has {}, result is {"a": 1}
- If one has {"a": 1} and other has {"a": 2}, result is {}
- If one has {"a": 1} and other has {"b": 2}, result is {"a": 1, "b": 2}

Updated tests to reflect this correct behavior.
Variable constructor already normalizes empty dict to None internally,
so the explicit conversion is redundant.
@max-sixty
Copy link
Collaborator Author

max-sixty commented Sep 14, 2025

The main limitation of drop_conflicts is that it assumes that every attribute value can be cheaply compared for equality, which may not be true (e.g., if someone puts a dask.array in an attrs). But I suppose we can have a back-up policy (either "drop" or "override") for rare cases where __eq__ does not return a boolean.

great, #10741 for that change!

edit: and merged

max-sixty added a commit that referenced this pull request Sep 16, 2025
* Fix drop_conflicts to handle non-bool returns from __eq__

When merging attrs with drop_conflicts, objects whose __eq__ returns
non-bool values (e.g., numpy arrays) would raise ValueError. Now these
are caught and treated as non-equivalent, causing the attribute to be
dropped rather than raising an error.

Fixes regression from #10726

* Add changelog entry

* Suppress DeprecationWarning for ambiguous array truth values

The CI was failing because NumPy raises a DeprecationWarning before
ValueError when evaluating ambiguous truth values. Since we're properly
handling the ValueError, we suppress the warning in this specific context.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Move DeprecationWarning suppression to equivalent function

Rather than suppressing the warning in merge_attrs, handle it at the
source in equivalent() where the 'or' operation happens. This is cleaner
and includes a clear comment about when the suppression can be removed
(when minimum numpy version >= 2.0).

The warning only occurs in numpy < 2.0; numpy 2.0+ raises ValueError
directly, which we already handle properly in merge_attrs.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Move DeprecationWarning suppression from library to test

The warning only occurs in our specific test case with custom objects that
return numpy arrays from __eq__. This is a very edge case scenario, so it's
more appropriate to suppress the warning in the test rather than in the
library code.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Simplify drop_conflicts implementation

Based on code review, simplified the logic to:
- Process attributes in a single loop instead of nested operations
- Eliminate intermediate dictionary creation
- Make the control flow clearer and more efficient

* Update whats-new.rst

Co-authored-by: Justus Magin <[email protected]>

* Update whats-new.rst

Co-authored-by: Justus Magin <[email protected]>

* Implement equivalent_attrs function as suggested by @keewis

- Create dedicated equivalent_attrs function that wraps equivalent()
- Returns False when ValueError is raised (e.g., for numpy arrays)
- Simplifies merge_attrs logic by encapsulating error handling
- Makes future behavior changes easier

* Add pathological test cases suggested by @shoyer

- Test NumPy object arrays with nested arrays
- Test xarray.Dataset objects as attributes
- Test Pandas Series as attributes
- Update equivalent_attrs to catch TypeError in addition to ValueError
- Ensure equivalent_attrs always returns a boolean (not Dataset/array-like)

These cases all properly get dropped when they can't be reliably compared.

* Use truthiness in equivalent_attrs for consistency with Python

- Choose truthiness over strict boolean checking for consistency with
  Python's standard 'if a == b:' behavior
- Accept edge cases with non-standard __eq__ methods as a deliberate tradeoff
- Add comprehensive tests documenting the behavior and edge cases
- Add clear documentation explaining the design choice and its implications

This approach is more permissive and works better with numpy scalars and
similar types, while still catching truly ambiguous cases (ValueError/TypeError).

* Implement @shoyer's stricter equivalent() function

- Rewrite equivalent() to reject non-boolean comparison results
- Accept numpy bool scalars (np.bool_) but reject other non-bool types
- Simplify equivalent_attrs() since equivalent() now handles non-bool cases
- Update tests to reflect stricter behavior with non-standard __eq__ methods

This makes comparisons more predictable by rejecting ambiguous cases like
Dataset comparisons, custom objects with weird __eq__, etc. The tradeoff is
being less permissive than Python's standard 'if a == b:' behavior.

* Address @shoyer's review comments

- Fix comment to say 'nested numpy arrays' instead of 'numpy arrays'
- Use del instead of pop() to avoid creating unused return value
- Split long test methods into smaller, focused test methods
- Move all imports (warnings, pandas) to top of file

Co-authored-by: Claude <[email protected]>

* Address @shoyer's review comments

- Split long test methods into smaller, focused test cases:
  - test_merge_attrs_drop_conflicts_ambiguous_array_returns split to add test_merge_attrs_drop_conflicts_all_true_array
  - test_merge_attrs_drop_conflicts_non_boolean_eq_returns split into test_merge_attrs_drop_conflicts_eq_returns_string and test_merge_attrs_drop_conflicts_eq_returns_number
- Ran formatter which removed a blank line

Co-authored-by: Claude <[email protected]>

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Justus Magin <[email protected]>
Co-authored-by: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs discussion topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keep attrs by default? (keep_attrs)
4 participants