-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
NetCDF IO cleanup, including in-memory reads/writes with netCDF4 #10624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I am thinking I might try to split this into a separate PR, because it's unrelated to the netCDF4 in-memory changes. |
This PR includes a handful of significant changes: 1. It refactors the internal structure of `DataTree.to_netcdf()` and `DataTree.to_zarr()` to use lower level interfaces, rather than calling `Dataset` methods. This allows for properly supporting `compute=False` (and likely various other improvements). 2. Reading and writing in-memory data with netCDF4-python is now supported, including DataTree. 3. The `engine` argument in `DataTree.to_netcdf()` is now set consistently with `Dataset.to_netcdf()`, preferring `netcdf4` to `h5netcdf`. 3. Calling `Dataset.to_netcdf()` without a target now always returns a `memoryview` object, *including* in the case where `engine='scipy'` is used (which currently returns `bytes`). This is a breaking change, rather than merely issuing a warning as is done in pydata#10571. I believe it probably makes sense to do as a this breaking change because (1) it offers significant performance benefits, (2) the default behavior without specifying an engine will already change (because `netcdf4` is preferred to the `scipy` backend) and (3) restoring previous behavior is easy (by wrapping the memoryview with `bytes()`). mypy
2 tasks
3 tasks
kmuehlbauer
approved these changes
Sep 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 A big win! Thank you so much, Stephan!
Co-authored-by: Kai Mühlbauer <[email protected]>
5 tasks
responsibel for 1400 / 1440 warnings; and don't think there's much we can do about them
* check that weighted ops propagate attrs on coords * propagate attrs on coords in `map` if keep_attrs * directly check that `map` propagates attrs on coords * whats-new
* Add pep-723 style script to bug report issue template * make suggestion rather than requirement * Update .github/ISSUE_TEMPLATE/bugreport.yml Co-authored-by: Maximilian Roos <[email protected]> * use complete * Update .github/ISSUE_TEMPLATE/bugreport.yml Co-authored-by: Maximilian Roos <[email protected]> * Update .github/ISSUE_TEMPLATE/bugreport.yml Co-authored-by: Maximilian Roos <[email protected]> --------- Co-authored-by: Maximilian Roos <[email protected]>
* Allow `mean` with time dtypes Closes pydata#5897 Closes pydata#6995 Closes pydata#10217 * Allow mean() to work with datetime dtypes - Changed numeric_only from True to False in mean() aggregations - Added Dataset.mean() override to filter out string variables while preserving datetime/timedelta types - Only filters data variables, preserves all coordinates - Added comprehensive tests for datetime mean with edge cases including NaT handling - Tests cover Dataset, DataArray, groupby, and mixed type scenarios This enables mean() to work with datetime64 and timedelta64 types as requested in PR pydata#10227 while preventing errors from string variables. * Skip cftime test when cftime not available Add pytest.mark.skipif decorator to test_mean_preserves_non_string_object_arrays to skip the test in minimal dependency environments where cftime is not installed. * Fix string/datetime handling in mean() aggregations - Revert numeric_only back to True for mean() to prevent strings from being included - Add datetime64/timedelta64 types to numeric_only checks in Dataset.reduce() and flox path - Also check for object arrays containing datetime-like objects (cftime dates) - This allows mean() to work with datetime types while excluding strings that would cause errors * Trigger CI workflow * Format groupby.py numeric_only condition Auto-formatted the multi-line condition for better readability * Apply formatting changes to dataset.py and test_groupby.py - Auto-formatted multi-line conditions for better readability 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix numeric_only parameter in groupby mean for non-flox path The generated mean() methods for DatasetGroupBy and ResampleGroupBy were incorrectly passing numeric_only=False in the non-flox path, causing string variables to fail during reduction. Changed to numeric_only=True to match the flox path behavior. This fixes test_groupby_dataset_reduce_ellipsis failures. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Simplify dtype checking with _is_numeric_aggregatable_dtype helper Created a canonical helper function to check if a dtype can be used in numeric aggregations. This replaces complex repeated conditionals in dataset.py and groupby.py with a single, well-documented function. The helper checks for: - Numeric types (int, float, complex) - Boolean type - Datetime types (datetime64, timedelta64) - Object arrays containing datetime-like objects (e.g., cftime) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Address review feedback from @dcherian - Document datetime mean behavior in generate_aggregations.py - Simplify test assertion using isnull().item() instead of pd.notna - Remove redundant test_mean_dataarray_datetime test - Add comprehensive tests for linked issues: - Issue pydata#5897: Test mean with cftime objects - Issue pydata#6995: Test groupby_bins with datetime64 mean - Issue pydata#10217: Test groupby_bins mean on time series data 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Fix test_mean_with_cftime_objects for non-dask environments The test was unconditionally using dask operations which caused failures in the "all-but-dask" CI environment. Now the dask-specific tests are only run when dask is available. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Use standard @requires_dask decorator for dask-specific tests Split test_mean_with_cftime_objects into two tests: - Base test runs without dask - Dask-specific test uses @requires_dask decorator This follows xarray's standard pattern for dependency-specific tests and is cleaner than using if has_dask conditionals. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Add testing guidelines and use proper decorators - Created xarray/tests/CLAUDE.md with comprehensive guidelines for handling optional dependencies in tests - Updated cftime tests to use @requires_cftime decorator instead of pytest.importorskip, following xarray's standard patterns - This ensures consistent handling across CI environments 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix CLAUDE.md blackdoc formatting Ensure all Python code blocks have complete, valid syntax for blackdoc. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Simplify cftime check in _is_numeric_aggregatable_dtype As suggested by @dcherian in review comment r2337966239, directly use _contains_cftime_datetimes(var._data) instead of the more complex check. This is cleaner since _contains_cftime_datetimes already handles the object dtype check internally. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Streamline CLAUDE.md and move import to top of file - Made CLAUDE.md more concise by removing verbose decorator listings - Moved _is_numeric_aggregatable_dtype import to top of groupby.py as suggested by @dcherian in review comment r2337968851 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Address review comments: clarify numeric_only scope and merge tests - Move datetime/timedelta note to global _NUMERIC_ONLY_NOTES constant (addresses comment r2337970143) - Merge test_mean_preserves_non_string_object_arrays into test_mean_with_cftime_objects (addresses comment r2337128692) - Both changes address reviewer feedback about simplification * Clarify that @requires decorators should be used instead of skipif patterns Update testing guidelines to explicitly discourage pytest.mark.skipif in parametrize, recommending splitting tests or using @requires decorators * Fix CLAUDE.md blackdoc formatting Co-authored-by: Claude <[email protected]> --------- Co-authored-by: Maximilian Roos <[email protected]> Co-authored-by: Claude <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Claude <[email protected]>
* TYP: explicit `DTypeLike | None` * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix bugreport.yml * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Claude <[email protected]>
This reverts commit 2b52b95.
* Fix to_netcdf(compute=False) with Dask distributed Fixes pydata#10725 * Silence incorrect mypy error
* cleanup bug report template * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Add type annotations to satisfy mypy's check_untyped_defs requirement - Fix combine_nested type signature to accept None for concat_dim parameter - Add minimal type hints where mypy cannot infer types - No behavioral changes, all tests pass 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Claude <[email protected]>
* Update mypy test requirement to 1.17.1 * Fix mypy 1.17.1 compatibility issues - Add list-item to type ignore for df.columns assignment - Remove unnecessary attr-defined type ignores that mypy 1.17.1 now understands * Restore type ignores needed by mypy 1.17.1 in CI environment --------- Co-authored-by: Maximilian Roos <[email protected]> Co-authored-by: Maximilian Roos <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Automation
Github bots, testing workflows, release automation
io
plan to merge
Final call for comments
topic-backends
topic-DataTree
Related to the implementation of a DataTree class
topic-groupby
topic-indexing
topic-NamedArray
Lightweight version of Variable
topic-typing
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR includes several related improvements to netCDF reading/writing:
engine='h5netcdf'
can now be pickled, as long as the underlying file-like object also support pickle.engine='scipy'
no longer closes the underlying file, consistent the h5netcdf backend.xarray.backends.file_manager
andxarray.backends.locks
.PickleableFileManager
class, which is used for wrapping file-like objects that do not natively support pickling (e.g., netCDF4.Dataset and h5netcdf.File) in cases where a global cache is not desirable (e.g., for netCDF files opened from bytes in memory, or from existing file objects).whats-new.rst