-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Allow drop_conflicts
to handle non-bool returns from __eq__
#10741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When merging attrs with drop_conflicts, objects whose __eq__ returns non-bool values (e.g., numpy arrays) would raise ValueError. Now these are caught and treated as non-equivalent, causing the attribute to be dropped rather than raising an error. Fixes regression from pydata#10726
The CI was failing because NumPy raises a DeprecationWarning before ValueError when evaluating ambiguous truth values. Since we're properly handling the ValueError, we suppress the warning in this specific context.
for more information, see https://pre-commit.ci
Rather than suppressing the warning in merge_attrs, handle it at the source in equivalent() where the 'or' operation happens. This is cleaner and includes a clear comment about when the suppression can be removed (when minimum numpy version >= 2.0). The warning only occurs in numpy < 2.0; numpy 2.0+ raises ValueError directly, which we already handle properly in merge_attrs.
for more information, see https://pre-commit.ci
The warning only occurs in our specific test case with custom objects that return numpy arrays from __eq__. This is a very edge case scenario, so it's more appropriate to suppress the warning in the test rather than in the library code.
for more information, see https://pre-commit.ci
Based on code review, simplified the logic to: - Process attributes in a single loop instead of nested operations - Eliminate intermediate dictionary creation - Make the control flow clearer and more efficient
Co-authored-by: Justus Magin <[email protected]>
Co-authored-by: Justus Magin <[email protected]>
- Create dedicated equivalent_attrs function that wraps equivalent() - Returns False when ValueError is raised (e.g., for numpy arrays) - Simplifies merge_attrs logic by encapsulating error handling - Makes future behavior changes easier
Let me give a few give a few concrete examples of pathological inputs to be sure we handle for attribute equivalence:
I think we should re-write def equivalent(first: object, second: object) -> bool:
if first is second:
return True
if isinstance(first, np.ndarray) and isinstance(second, np.ndarray):
return np.array_equal(first, second, equal_nan=True)
if isinstance(first, list) and isinstance(second, list):
return list_equiv(first, second)
result = first == second
if not isinstance(result, bool): # unhandled array-like values
return False
return result or (pd.isna(first) and pd.isna(second)) # valid for scalars
def equivalent_attrs(first: object, second: object) -> bool:
try:
return equivalent(first, second)
except ValueError, TypeError:
return False Potentially we could make this is a little more aggressive by returning |
- Test NumPy object arrays with nested arrays - Test xarray.Dataset objects as attributes - Test Pandas Series as attributes - Update equivalent_attrs to catch TypeError in addition to ValueError - Ensure equivalent_attrs always returns a boolean (not Dataset/array-like) These cases all properly get dropped when they can't be reliably compared.
- Choose truthiness over strict boolean checking for consistency with Python's standard 'if a == b:' behavior - Accept edge cases with non-standard __eq__ methods as a deliberate tradeoff - Add comprehensive tests documenting the behavior and edge cases - Add clear documentation explaining the design choice and its implications This approach is more permissive and works better with numpy scalars and similar types, while still catching truly ambiguous cases (ValueError/TypeError).
drop_conflicts
to handle non-bool returns from __eq__
- Rewrite equivalent() to reject non-boolean comparison results - Accept numpy bool scalars (np.bool_) but reject other non-bool types - Simplify equivalent_attrs() since equivalent() now handles non-bool cases - Update tests to reflect stricter behavior with non-standard __eq__ methods This makes comparisons more predictable by rejecting ambiguous cases like Dataset comparisons, custom objects with weird __eq__, etc. The tradeoff is being less permissive than Python's standard 'if a == b:' behavior.
42e5c9d
to
5557193
Compare
implemented that approach @shoyer . so quite small & simple code changes; lots of tests... |
- Fix comment to say 'nested numpy arrays' instead of 'numpy arrays' - Use del instead of pop() to avoid creating unused return value - Split long test methods into smaller, focused test methods - Move all imports (warnings, pandas) to top of file Co-authored-by: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @max-sixty (+ Claude!)
- Split long test methods into smaller, focused test cases: - test_merge_attrs_drop_conflicts_ambiguous_array_returns split to add test_merge_attrs_drop_conflicts_all_true_array - test_merge_attrs_drop_conflicts_non_boolean_eq_returns split into test_merge_attrs_drop_conflicts_eq_returns_string and test_merge_attrs_drop_conflicts_eq_returns_number - Ran formatter which removed a blank line Co-authored-by: Claude <[email protected]>
yw! was an interesting case study — I give it a 6/10. one wrinkle is that our standards are higher than most! and these models are great at doing median code, but even with me overseeing them, it still took a few iterations to get to the standard we wanted... |
From Claude:
When merging attributes with
combine_attrs="drop_conflicts"
, objects whose__eq__
method returns non-bool values (e.g., numpy arrays) would raiseValueError
due to ambiguous truth values. This PR fixes the issue by catching theValueError
and treating such comparisons as non-equivalent, causing the attribute to be dropped.This fixes a regression introduced in #10726 where the merge logic started using direct equality comparisons that can fail for certain attribute types.
What's changed:
equivalent()
call inmerge_attrs
for thedrop_conflicts
case__eq__
scenariosTest coverage includes: