Skip to content

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Aug 11, 2025

This PR includes several related improvements to netCDF reading/writing:

  1. It adds support for reading and writing in-memory netCDF4 files using netCDF4-Python, including with xarray.DataTree.
  2. It adds extensive tests for reading and writing in-memory and file-like data, across all netCDF backends.
  3. Xarray objects opened from file-like objects with engine='h5netcdf' can now be pickled, as long as the underlying file-like object also support pickle.
  4. Closing Xarray objects opened from file-like objects with engine='scipy' no longer closes the underlying file, consistent the h5netcdf backend.
  5. I've added extensive type annotations throughout xarray.backends.file_manager and xarray.backends.locks.
  6. There is a new PickleableFileManager class, which is used for wrapping file-like objects that do not natively support pickling (e.g., netCDF4.Dataset and h5netcdf.File) in cases where a global cache is not desirable (e.g., for netCDF files opened from bytes in memory, or from existing file objects).
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library topic-DataTree Related to the implementation of a DataTree class io labels Aug 11, 2025
@shoyer
Copy link
Member Author

shoyer commented Aug 11, 2025

  • It refactors the internal structure of DataTree.to_netcdf() and DataTree.to_zarr() to use lower level interfaces, rather than calling Dataset methods. This allows for properly supporting compute=False (and likely various other improvements).

I am thinking I might try to split this into a separate PR, because it's unrelated to the netCDF4 in-memory changes.

This PR includes a handful of significant changes:

1. It refactors the internal structure of `DataTree.to_netcdf()` and
   `DataTree.to_zarr()` to use lower level interfaces, rather than
   calling `Dataset` methods. This allows for properly supporting
   `compute=False` (and likely various other improvements).
2. Reading and writing in-memory data with netCDF4-python is now
   supported, including DataTree.
3. The `engine` argument in `DataTree.to_netcdf()` is now set
   consistently with `Dataset.to_netcdf()`, preferring `netcdf4` to
   `h5netcdf`.
3. Calling `Dataset.to_netcdf()` without a target now always returns a
   `memoryview` object, *including* in the case where `engine='scipy'`
   is used (which currently returns `bytes`). This is a breaking change,
   rather than merely issuing a warning as is done in pydata#10571. I believe
   it probably makes sense to do as a this breaking change because (1)
   it offers significant performance benefits, (2) the default behavior
   without specifying an engine will already change (because `netcdf4`
   is preferred to the `scipy` backend) and (3) restoring previous
   behavior is easy (by wrapping the memoryview with `bytes()`).

mypy
@github-actions github-actions bot added the topic-DataTree Related to the implementation of a DataTree class label Sep 9, 2025
Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 A big win! Thank you so much, Stephan!

@jsignell jsignell linked an issue Sep 11, 2025 that may be closed by this pull request
5 tasks
@shoyer shoyer added the plan to merge Final call for comments label Sep 13, 2025
shoyer and others added 16 commits September 15, 2025 18:02
responsibel for 1400 / 1440 warnings; and don't think there's much we can do about them
* check that weighted ops propagate attrs on coords

* propagate attrs on coords in `map` if keep_attrs

* directly check that `map` propagates attrs on coords

* whats-new
* Add pep-723 style script to bug report issue template

* make suggestion rather than requirement

* Update .github/ISSUE_TEMPLATE/bugreport.yml

Co-authored-by: Maximilian Roos <[email protected]>

* use complete

* Update .github/ISSUE_TEMPLATE/bugreport.yml

Co-authored-by: Maximilian Roos <[email protected]>

* Update .github/ISSUE_TEMPLATE/bugreport.yml

Co-authored-by: Maximilian Roos <[email protected]>

---------

Co-authored-by: Maximilian Roos <[email protected]>
* Allow `mean` with time dtypes

Closes pydata#5897
Closes pydata#6995
Closes pydata#10217

* Allow mean() to work with datetime dtypes

- Changed numeric_only from True to False in mean() aggregations
- Added Dataset.mean() override to filter out string variables while preserving datetime/timedelta types
- Only filters data variables, preserves all coordinates
- Added comprehensive tests for datetime mean with edge cases including NaT handling
- Tests cover Dataset, DataArray, groupby, and mixed type scenarios

This enables mean() to work with datetime64 and timedelta64 types as requested in PR pydata#10227 while preventing errors from string variables.

* Skip cftime test when cftime not available

Add pytest.mark.skipif decorator to test_mean_preserves_non_string_object_arrays
to skip the test in minimal dependency environments where cftime is not installed.

* Fix string/datetime handling in mean() aggregations

- Revert numeric_only back to True for mean() to prevent strings from being included
- Add datetime64/timedelta64 types to numeric_only checks in Dataset.reduce() and flox path
- Also check for object arrays containing datetime-like objects (cftime dates)
- This allows mean() to work with datetime types while excluding strings that would cause errors

* Trigger CI workflow

* Format groupby.py numeric_only condition

Auto-formatted the multi-line condition for better readability

* Apply formatting changes to dataset.py and test_groupby.py

- Auto-formatted multi-line conditions for better readability

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix numeric_only parameter in groupby mean for non-flox path

The generated mean() methods for DatasetGroupBy and ResampleGroupBy
were incorrectly passing numeric_only=False in the non-flox path,
causing string variables to fail during reduction. Changed to
numeric_only=True to match the flox path behavior.

This fixes test_groupby_dataset_reduce_ellipsis failures.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Simplify dtype checking with _is_numeric_aggregatable_dtype helper

Created a canonical helper function to check if a dtype can be used in
numeric aggregations. This replaces complex repeated conditionals in
dataset.py and groupby.py with a single, well-documented function.

The helper checks for:
- Numeric types (int, float, complex)
- Boolean type
- Datetime types (datetime64, timedelta64)
- Object arrays containing datetime-like objects (e.g., cftime)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Address review feedback from @dcherian

- Document datetime mean behavior in generate_aggregations.py
- Simplify test assertion using isnull().item() instead of pd.notna
- Remove redundant test_mean_dataarray_datetime test
- Add comprehensive tests for linked issues:
  - Issue pydata#5897: Test mean with cftime objects
  - Issue pydata#6995: Test groupby_bins with datetime64 mean
  - Issue pydata#10217: Test groupby_bins mean on time series data

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Fix test_mean_with_cftime_objects for non-dask environments

The test was unconditionally using dask operations which caused failures
in the "all-but-dask" CI environment. Now the dask-specific tests are
only run when dask is available.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Use standard @requires_dask decorator for dask-specific tests

Split test_mean_with_cftime_objects into two tests:
- Base test runs without dask
- Dask-specific test uses @requires_dask decorator

This follows xarray's standard pattern for dependency-specific tests
and is cleaner than using if has_dask conditionals.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Add testing guidelines and use proper decorators

- Created xarray/tests/CLAUDE.md with comprehensive guidelines for
  handling optional dependencies in tests
- Updated cftime tests to use @requires_cftime decorator instead of
  pytest.importorskip, following xarray's standard patterns
- This ensures consistent handling across CI environments

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix CLAUDE.md blackdoc formatting

Ensure all Python code blocks have complete, valid syntax for blackdoc.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Simplify cftime check in _is_numeric_aggregatable_dtype

As suggested by @dcherian in review comment r2337966239, directly use
_contains_cftime_datetimes(var._data) instead of the more complex check.
This is cleaner since _contains_cftime_datetimes already handles the
object dtype check internally.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Streamline CLAUDE.md and move import to top of file

- Made CLAUDE.md more concise by removing verbose decorator listings
- Moved _is_numeric_aggregatable_dtype import to top of groupby.py
  as suggested by @dcherian in review comment r2337968851

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Address review comments: clarify numeric_only scope and merge tests

- Move datetime/timedelta note to global _NUMERIC_ONLY_NOTES constant
  (addresses comment r2337970143)
- Merge test_mean_preserves_non_string_object_arrays into
  test_mean_with_cftime_objects (addresses comment r2337128692)
- Both changes address reviewer feedback about simplification

* Clarify that @requires decorators should be used instead of skipif patterns

Update testing guidelines to explicitly discourage pytest.mark.skipif
in parametrize, recommending splitting tests or using @requires decorators

* Fix CLAUDE.md blackdoc formatting

Co-authored-by: Claude <[email protected]>

---------

Co-authored-by: Maximilian Roos <[email protected]>
Co-authored-by: Claude <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Claude <[email protected]>
* TYP: explicit `DTypeLike | None`

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix bugreport.yml

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix to_netcdf(compute=False) with Dask distributed

Fixes pydata#10725

* Silence incorrect mypy error
* cleanup bug report template

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Add type annotations to satisfy mypy's check_untyped_defs requirement
- Fix combine_nested type signature to accept None for concat_dim parameter
- Add minimal type hints where mypy cannot infer types
- No behavioral changes, all tests pass

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Claude <[email protected]>
* Update mypy test requirement to 1.17.1

* Fix mypy 1.17.1 compatibility issues

- Add list-item to type ignore for df.columns assignment
- Remove unnecessary attr-defined type ignores that mypy 1.17.1 now understands

* Restore type ignores needed by mypy 1.17.1 in CI environment

---------

Co-authored-by: Maximilian Roos <[email protected]>
Co-authored-by: Maximilian Roos <[email protected]>
@github-actions github-actions bot added topic-indexing topic-groupby Automation Github bots, testing workflows, release automation topic-NamedArray Lightweight version of Variable labels Sep 16, 2025
@shoyer shoyer enabled auto-merge (squash) September 16, 2025 02:06
@shoyer shoyer merged commit d6d7692 into pydata:main Sep 16, 2025
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Automation Github bots, testing workflows, release automation io plan to merge Final call for comments topic-backends topic-DataTree Related to the implementation of a DataTree class topic-groupby topic-indexing topic-NamedArray Lightweight version of Variable topic-typing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: "h5py objects cannot be pickled" with cloudpickle
10 participants