-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Implement async support for open_datatree #10742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This looks great! Would it be possible to make the sync path reuse the async methods internally? This would help reduce duplication, increase test coverage and speed up sync workflows. |
Thanks for the suggestion @shoyer! I explored implementing sync-to-async reuse using a universal coroutine runner. The main challenge is handling environments where an event loop is already running (such as Jupyter notebooks), which requires spawning background threads using asyncio.run() fails with However, this approach raises some design concerns:
The tradeoff is between code deduplication vs. user control and predictable behavior. Other major Python libraries (like httpx, requests-async) often keep separate sync/async implementations for similar reasons. What's your take on the threading tradeoff vs. the deduplication benefits? CC @TomNicholas |
I'm pretty sure Zarr v3 uses async internally to implement sync methods. It may be worth taking a look at how Zarr does things, especially given the strong overlap in the contributor communities. Launching a few threads is not particularly resource-intensive, so I'm not worried about that. Thread safety is a potential concern, but we do already take care to ensure that Xarray is thread safe internally, especially for IO backends. I think we can safely say that the vast majority of Xarray users are not familiar with async programming models, so I think they could really benefit from having this work by default. This is quite different from the user base for the web programming libraries you mention. |
@shoyer did you see #10622? I raised that issue to discuss the general problem of how these libraries interact with each other when it comes to concurrency.
Yes zarr manages its own threadpool. |
OK, let's try to reach some initial resolution about the async strategy for Xarary over in #10622 first! |
open_dataset
creates default indexes sequentially, causing significant latency in cloud high-latency stores #10579 and #12whats-new.rst
api.rst