Skip to content

Conversation

max-sixty
Copy link
Collaborator

@max-sixty max-sixty commented Sep 14, 2025

From Claude:


When merging attributes with combine_attrs="drop_conflicts", objects whose __eq__ method returns non-bool values (e.g., numpy arrays) would raise ValueError due to ambiguous truth values. This PR fixes the issue by catching the ValueError and treating such comparisons as non-equivalent, causing the attribute to be dropped.

This fixes a regression introduced in #10726 where the merge logic started using direct equality comparisons that can fail for certain attribute types.

What's changed:

  • Added try/except around equivalent() call in merge_attrs for the drop_conflicts case
  • Added comprehensive test coverage for various non-bool __eq__ scenarios

Test coverage includes:

  • Objects returning single-element numpy arrays (unambiguous)
  • Objects returning empty numpy arrays (ambiguous)
  • Objects returning multi-element numpy arrays (ambiguous)
  • Regular numpy arrays as attribute values

max-sixty and others added 9 commits September 13, 2025 17:17
When merging attrs with drop_conflicts, objects whose __eq__ returns
non-bool values (e.g., numpy arrays) would raise ValueError. Now these
are caught and treated as non-equivalent, causing the attribute to be
dropped rather than raising an error.

Fixes regression from pydata#10726
The CI was failing because NumPy raises a DeprecationWarning before
ValueError when evaluating ambiguous truth values. Since we're properly
handling the ValueError, we suppress the warning in this specific context.
Rather than suppressing the warning in merge_attrs, handle it at the
source in equivalent() where the 'or' operation happens. This is cleaner
and includes a clear comment about when the suppression can be removed
(when minimum numpy version >= 2.0).

The warning only occurs in numpy < 2.0; numpy 2.0+ raises ValueError
directly, which we already handle properly in merge_attrs.
The warning only occurs in our specific test case with custom objects that
return numpy arrays from __eq__. This is a very edge case scenario, so it's
more appropriate to suppress the warning in the test rather than in the
library code.
Based on code review, simplified the logic to:
- Process attributes in a single loop instead of nested operations
- Eliminate intermediate dictionary creation
- Make the control flow clearer and more efficient
max-sixty and others added 2 commits September 14, 2025 08:08
Co-authored-by: Justus Magin <[email protected]>
Co-authored-by: Justus Magin <[email protected]>
@max-sixty max-sixty added the plan to merge Final call for comments label Sep 14, 2025
- Create dedicated equivalent_attrs function that wraps equivalent()
- Returns False when ValueError is raised (e.g., for numpy arrays)
- Simplifies merge_attrs logic by encapsulating error handling
- Makes future behavior changes easier
@shoyer
Copy link
Member

shoyer commented Sep 15, 2025

Let me give a few give a few concrete examples of pathological inputs to be sure we handle for attribute equivalence:

  1. NumPy object arrays, e.g.,
x = np.array([None], dtype=object)
x[0] = np.arange(3)
y = np.array([None], dtype=object)
y[0] = np.arange(10, 13)
  1. xarray.Dataset objects (which actually raises TypeError!):
x = xarray.Dataset({'foo': 1})
y = xarray.Dataset({'bar': 1})
  1. Pandas series:
x = pd.Series([1, 2])
y = pd.Series([3, 4])

I think we should re-write equivalent to make it a little safer to use. My suggestion:

def equivalent(first: object, second: object) -> bool:
    if first is second:
        return True

    if isinstance(first, np.ndarray) and isinstance(second, np.ndarray):
        return np.array_equal(first, second, equal_nan=True)

    if isinstance(first, list) and isinstance(second, list):
        return list_equiv(first, second)

    result = first == second
    if not isinstance(result, bool):  # unhandled array-like values
        return False

    return result or (pd.isna(first) and pd.isna(second))  # valid for scalars


def equivalent_attrs(first: object, second: object) -> bool:
    try:
        return equivalent(first, second)
    except ValueError, TypeError:
        return False

Potentially we could make this is a little more aggressive by returning False if first and second have different types.

- Test NumPy object arrays with nested arrays
- Test xarray.Dataset objects as attributes
- Test Pandas Series as attributes
- Update equivalent_attrs to catch TypeError in addition to ValueError
- Ensure equivalent_attrs always returns a boolean (not Dataset/array-like)

These cases all properly get dropped when they can't be reliably compared.
- Choose truthiness over strict boolean checking for consistency with
  Python's standard 'if a == b:' behavior
- Accept edge cases with non-standard __eq__ methods as a deliberate tradeoff
- Add comprehensive tests documenting the behavior and edge cases
- Add clear documentation explaining the design choice and its implications

This approach is more permissive and works better with numpy scalars and
similar types, while still catching truly ambiguous cases (ValueError/TypeError).
@max-sixty max-sixty changed the title Fix drop_conflicts to handle non-bool returns from __eq__ Adjust drop_conflicts to handle non-bool returns from __eq__ Sep 15, 2025
@max-sixty max-sixty changed the title Adjust drop_conflicts to handle non-bool returns from __eq__ Allow drop_conflicts to handle non-bool returns from __eq__ Sep 15, 2025
@max-sixty max-sixty changed the title Allow drop_conflicts to handle non-bool returns from __eq__ Allow drop_conflicts to handle non-bool returns from __eq__ Sep 15, 2025
- Rewrite equivalent() to reject non-boolean comparison results
- Accept numpy bool scalars (np.bool_) but reject other non-bool types
- Simplify equivalent_attrs() since equivalent() now handles non-bool cases
- Update tests to reflect stricter behavior with non-standard __eq__ methods

This makes comparisons more predictable by rejecting ambiguous cases like
Dataset comparisons, custom objects with weird __eq__, etc. The tradeoff is
being less permissive than Python's standard 'if a == b:' behavior.
@max-sixty
Copy link
Collaborator Author

implemented that approach @shoyer . so quite small & simple code changes; lots of tests...

- Fix comment to say 'nested numpy arrays' instead of 'numpy arrays'
- Use del instead of pop() to avoid creating unused return value
- Split long test methods into smaller, focused test methods
- Move all imports (warnings, pandas) to top of file

Co-authored-by: Claude <[email protected]>
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @max-sixty (+ Claude!)

- Split long test methods into smaller, focused test cases:
  - test_merge_attrs_drop_conflicts_ambiguous_array_returns split to add test_merge_attrs_drop_conflicts_all_true_array
  - test_merge_attrs_drop_conflicts_non_boolean_eq_returns split into test_merge_attrs_drop_conflicts_eq_returns_string and test_merge_attrs_drop_conflicts_eq_returns_number
- Ran formatter which removed a blank line

Co-authored-by: Claude <[email protected]>
@max-sixty
Copy link
Collaborator Author

Thanks @max-sixty (+ Claude!)

yw!

was an interesting case study — I give it a 6/10.

one wrinkle is that our standards are higher than most! and these models are great at doing median code, but even with me overseeing them, it still took a few iterations to get to the standard we wanted...

@max-sixty max-sixty merged commit 8694818 into pydata:main Sep 16, 2025
37 checks passed
@max-sixty max-sixty deleted the merge-attrs-drop-eq branch September 16, 2025 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants