NetCDF IO cleanup, including in-memory reads/writes with netCDF4 #10624

shoyer · 2025-08-11T19:43:34Z

This PR includes several related improvements to netCDF reading/writing:

It adds support for reading and writing in-memory netCDF4 files using netCDF4-Python, including with xarray.DataTree.
It adds extensive tests for reading and writing in-memory and file-like data, across all netCDF backends.
Xarray objects opened from file-like objects with engine='h5netcdf' can now be pickled, as long as the underlying file-like object also support pickle.
Closing Xarray objects opened from file-like objects with engine='scipy' no longer closes the underlying file, consistent the h5netcdf backend.
I've added extensive type annotations throughout xarray.backends.file_manager and xarray.backends.locks.
There is a new PickleableFileManager class, which is used for wrapping file-like objects that do not natively support pickling (e.g., netCDF4.Dataset and h5netcdf.File) in cases where a global cache is not desirable (e.g., for netCDF files opened from bytes in memory, or from existing file objects).

Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

shoyer · 2025-08-11T19:51:54Z

It refactors the internal structure of DataTree.to_netcdf() and DataTree.to_zarr() to use lower level interfaces, rather than calling Dataset methods. This allows for properly supporting compute=False (and likely various other improvements).

I am thinking I might try to split this into a separate PR, because it's unrelated to the netCDF4 in-memory changes.

This PR includes a handful of significant changes: 1. It refactors the internal structure of `DataTree.to_netcdf()` and `DataTree.to_zarr()` to use lower level interfaces, rather than calling `Dataset` methods. This allows for properly supporting `compute=False` (and likely various other improvements). 2. Reading and writing in-memory data with netCDF4-python is now supported, including DataTree. 3. The `engine` argument in `DataTree.to_netcdf()` is now set consistently with `Dataset.to_netcdf()`, preferring `netcdf4` to `h5netcdf`. 3. Calling `Dataset.to_netcdf()` without a target now always returns a `memoryview` object, *including* in the case where `engine='scipy'` is used (which currently returns `bytes`). This is a breaking change, rather than merely issuing a warning as is done in pydata#10571. I believe it probably makes sense to do as a this breaking change because (1) it offers significant performance benefits, (2) the default behavior without specifying an engine will already change (because `netcdf4` is preferred to the `scipy` backend) and (3) restoring previous behavior is easy (by wrapping the memoryview with `bytes()`). mypy

kmuehlbauer

🚀 A big win! Thank you so much, Stephan!

doc/whats-new.rst

Co-authored-by: Kai Mühlbauer <[email protected]>

responsibel for 1400 / 1440 warnings; and don't think there's much we can do about them

* check that weighted ops propagate attrs on coords * propagate attrs on coords in `map` if keep_attrs * directly check that `map` propagates attrs on coords * whats-new

* Add pep-723 style script to bug report issue template * make suggestion rather than requirement * Update .github/ISSUE_TEMPLATE/bugreport.yml Co-authored-by: Maximilian Roos <[email protected]> * use complete * Update .github/ISSUE_TEMPLATE/bugreport.yml Co-authored-by: Maximilian Roos <[email protected]> * Update .github/ISSUE_TEMPLATE/bugreport.yml Co-authored-by: Maximilian Roos <[email protected]> --------- Co-authored-by: Maximilian Roos <[email protected]>

@dcherian

* Allow `mean` with time dtypes Closes pydata#5897 Closes pydata#6995 Closes pydata#10217 * Allow mean() to work with datetime dtypes - Changed numeric_only from True to False in mean() aggregations - Added Dataset.mean() override to filter out string variables while preserving datetime/timedelta types - Only filters data variables, preserves all coordinates - Added comprehensive tests for datetime mean with edge cases including NaT handling - Tests cover Dataset, DataArray, groupby, and mixed type scenarios This enables mean() to work with datetime64 and timedelta64 types as requested in PR pydata#10227 while preventing errors from string variables. * Skip cftime test when cftime not available Add pytest.mark.skipif decorator to test_mean_preserves_non_string_object_arrays to skip the test in minimal dependency environments where cftime is not installed. * Fix string/datetime handling in mean() aggregations - Revert numeric_only back to True for mean() to prevent strings from being included - Add datetime64/timedelta64 types to numeric_only checks in Dataset.reduce() and flox path - Also check for object arrays containing datetime-like objects (cftime dates) - This allows mean() to work with datetime types while excluding strings that would cause errors * Trigger CI workflow * Format groupby.py numeric_only condition Auto-formatted the multi-line condition for better readability * Apply formatting changes to dataset.py and test_groupby.py - Auto-formatted multi-line conditions for better readability 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix numeric_only parameter in groupby mean for non-flox path The generated mean() methods for DatasetGroupBy and ResampleGroupBy were incorrectly passing numeric_only=False in the non-flox path, causing string variables to fail during reduction. Changed to numeric_only=True to match the flox path behavior. This fixes test_groupby_dataset_reduce_ellipsis failures. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Simplify dtype checking with _is_numeric_aggregatable_dtype helper Created a canonical helper function to check if a dtype can be used in numeric aggregations. This replaces complex repeated conditionals in dataset.py and groupby.py with a single, well-documented function. The helper checks for: - Numeric types (int, float, complex) - Boolean type - Datetime types (datetime64, timedelta64) - Object arrays containing datetime-like objects (e.g., cftime) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Address review feedback from @dcherian - Document datetime mean behavior in generate_aggregations.py - Simplify test assertion using isnull().item() instead of pd.notna - Remove redundant test_mean_dataarray_datetime test - Add comprehensive tests for linked issues: - Issue pydata#5897: Test mean with cftime objects - Issue pydata#6995: Test groupby_bins with datetime64 mean - Issue pydata#10217: Test groupby_bins mean on time series data 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Fix test_mean_with_cftime_objects for non-dask environments The test was unconditionally using dask operations which caused failures in the "all-but-dask" CI environment. Now the dask-specific tests are only run when dask is available. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Use standard @requires_dask decorator for dask-specific tests Split test_mean_with_cftime_objects into two tests: - Base test runs without dask - Dask-specific test uses @requires_dask decorator This follows xarray's standard pattern for dependency-specific tests and is cleaner than using if has_dask conditionals. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Add testing guidelines and use proper decorators - Created xarray/tests/CLAUDE.md with comprehensive guidelines for handling optional dependencies in tests - Updated cftime tests to use @requires_cftime decorator instead of pytest.importorskip, following xarray's standard patterns - This ensures consistent handling across CI environments 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix CLAUDE.md blackdoc formatting Ensure all Python code blocks have complete, valid syntax for blackdoc. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Simplify cftime check in _is_numeric_aggregatable_dtype As suggested by @dcherian in review comment r2337966239, directly use _contains_cftime_datetimes(var._data) instead of the more complex check. This is cleaner since _contains_cftime_datetimes already handles the object dtype check internally. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Streamline CLAUDE.md and move import to top of file - Made CLAUDE.md more concise by removing verbose decorator listings - Moved _is_numeric_aggregatable_dtype import to top of groupby.py as suggested by @dcherian in review comment r2337968851 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Address review comments: clarify numeric_only scope and merge tests - Move datetime/timedelta note to global _NUMERIC_ONLY_NOTES constant (addresses comment r2337970143) - Merge test_mean_preserves_non_string_object_arrays into test_mean_with_cftime_objects (addresses comment r2337128692) - Both changes address reviewer feedback about simplification * Clarify that @requires decorators should be used instead of skipif patterns Update testing guidelines to explicitly discourage pytest.mark.skipif in parametrize, recommending splitting tests or using @requires decorators * Fix CLAUDE.md blackdoc formatting Co-authored-by: Claude <[email protected]> --------- Co-authored-by: Maximilian Roos <[email protected]> Co-authored-by: Claude <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Claude <[email protected]>

* TYP: explicit `DTypeLike | None` * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix bugreport.yml * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Co-authored-by: Claude <[email protected]>

This reverts commit 2b52b95.

* Fix to_netcdf(compute=False) with Dask distributed Fixes pydata#10725 * Silence incorrect mypy error

* cleanup bug report template * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…10750)

- Add type annotations to satisfy mypy's check_untyped_defs requirement - Fix combine_nested type signature to accept None for concat_dim parameter - Add minimal type hints where mypy cannot infer types - No behavioral changes, all tests pass 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Claude <[email protected]>

* Update mypy test requirement to 1.17.1 * Fix mypy 1.17.1 compatibility issues - Add list-item to type ignore for df.columns assignment - Remove unnecessary attr-defined type ignores that mypy 1.17.1 now understands * Restore type ignores needed by mypy 1.17.1 in CI environment --------- Co-authored-by: Maximilian Roos <[email protected]> Co-authored-by: Maximilian Roos <[email protected]>

github-actions bot added topic-backends topic-zarr Related to zarr storage library topic-DataTree Related to the implementation of a DataTree class io labels Aug 11, 2025

shoyer mentioned this pull request Aug 11, 2025

DataTree.to_zarr() is very slow writing to high latency store #9455

Open

shoyer force-pushed the netcdf4-memory branch from 30a759c to 86563b9 Compare August 11, 2025 22:08

Refactor to_netcdf() and to_zarr() internals

cce5477

shoyer mentioned this pull request Aug 12, 2025

Support compute=False from DataTree.to_netcdf #10625

Merged

2 tasks

Merge branch 'main' into to_netcdf-internals

a3689d4

kmuehlbauer mentioned this pull request Aug 12, 2025

Sanitize unlimited_dims when writing to_netcdf #10608

Merged

3 tasks

shoyer added 13 commits August 12, 2025 14:54

Merge branch 'main' into to_netcdf-internals

f5ba356

Merge branch 'main' into to_netcdf-internals

c4d57f1

Fixes per review

22d3387

Clean up comments

e68e186

Merge branch 'main' into to_netcdf-internals

b925465

Fix type for to_netcdf()

6d8ae1e

Add test and whats-new for cross-group redundant computation

d9da973

Fix test failure on CI (and add a better test)

205fdbe

grammar

e82c334

Merge branch 'to_netcdf-internals' into netcdf4-memory

24d3552

Tweaks

ca5feca

Merge branch 'main' into netcdf4-memory

1be418e

post merge fixes

cf22188

shoyer mentioned this pull request Aug 19, 2025

Improve consistency of default engine and return memoryview instead of bytes from to_netcdf() #10656

Merged

2 tasks

shoyer added 4 commits September 5, 2025 12:08

Merge branch 'main' into netcdf4-memory

487a2f7

Fix tests

dc8bf9f

More tests

d6b32cb

Fix bug and add type annotations

0311987

github-actions bot added the topic-DataTree Related to the implementation of a DataTree class label Sep 9, 2025

kmuehlbauer approved these changes Sep 9, 2025

View reviewed changes

doc/whats-new.rst Outdated Show resolved Hide resolved

shoyer and others added 2 commits September 9, 2025 10:27

Update doc/whats-new.rst

cdc7523

Co-authored-by: Kai Mühlbauer <[email protected]>

Merge branch 'main' into netcdf4-memory

2ee2207

jsignell linked an issue Sep 11, 2025 that may be closed by this pull request

Regression: "h5py objects cannot be pickled" with cloudpickle #10712

Closed

5 tasks

shoyer added the plan to merge Final call for comments label Sep 13, 2025

shoyer and others added 16 commits September 15, 2025 18:02

Fix error on Windows

190e962

Silence Zarr v3 warnings (pydata#10731)

4d22468

responsibel for 1400 / 1440 warnings; and don't think there's much we can do about them

propagate attrs on coords in Dataset.map (pydata#10602)

8eedfd8

* check that weighted ops propagate attrs on coords * propagate attrs on coords in `map` if keep_attrs * directly check that `map` propagates attrs on coords * whats-new

TYP: explicit DTypeLike | None (pydata#10738)

145a445

* TYP: explicit `DTypeLike | None` * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

refine Claude instructions (pydata#10744)

bc11b60

fix bugreport.yml (pydata#10746)

917f238

* fix bugreport.yml * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Add co-authorship instruction to CLAUDE.md (pydata#10748)

d9047c2

Co-authored-by: Claude <[email protected]>

Revert "fix bugreport.yml (pydata#10746)" (pydata#10749)

2a020c0

This reverts commit 2b52b95.

Fix to_netcdf(compute=False) with Dask distributed (pydata#10730)

a5f33e8

* Fix to_netcdf(compute=False) with Dask distributed Fixes pydata#10725 * Silence incorrect mypy error

cleanup bug report template (pydata#10752)

f86afc3

* cleanup bug report template * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Ensure that groupby.groups works even with multiple groupers (pydata#…

da93f8c

…10750)

Multiple imports for an import name (pydata#10743)

2d56f6b

github-actions bot added topic-indexing topic-groupby Automation Github bots, testing workflows, release automation topic-NamedArray Lightweight version of Variable labels Sep 16, 2025

shoyer added 2 commits September 15, 2025 18:04

Add note about breaking change

8f9363e

Merge branch 'main' into netcdf4-memory

177db16

shoyer enabled auto-merge (squash) September 16, 2025 02:06

shoyer merged commit d6d7692 into pydata:main Sep 16, 2025
36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

NetCDF IO cleanup, including in-memory reads/writes with netCDF4 #10624

NetCDF IO cleanup, including in-memory reads/writes with netCDF4 #10624

Uh oh!

shoyer commented Aug 11, 2025 •

edited

Loading

Uh oh!

shoyer commented Aug 11, 2025

Uh oh!

kmuehlbauer left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NetCDF IO cleanup, including in-memory reads/writes with netCDF4 #10624

NetCDF IO cleanup, including in-memory reads/writes with netCDF4 #10624

Uh oh!

Conversation

shoyer commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Aug 11, 2025

Uh oh!

kmuehlbauer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shoyer commented Aug 11, 2025 •

edited

Loading