obstore delete_dir #3310

slowjazz · 2025-07-30T16:31:07Z

This approach will eagerly load all keys staged for deletion. I think this is the simplest way to go about this, though let me know if another approach involving batching is more desirable.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.rst
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

codecov · 2025-07-30T16:46:25Z

Codecov Report

❌ Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.74%. Comparing base (e410173) to head (c7c67eb).

Files with missing lines	Patch %	Lines
src/zarr/storage/_obstore.py	77.77%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3310      +/-   ##
==========================================
+ Coverage   60.72%   60.74%   +0.01%     
==========================================
  Files          78       78              
  Lines        9408     9417       +9     
==========================================
+ Hits         5713     5720       +7     
- Misses       3695     3697       +2

Files with missing lines	Coverage Δ
src/zarr/storage/_obstore.py	`60.63% <77.77%> (+0.72%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

d-v-b · 2025-07-31T09:37:04Z

cc @kylebarron

kylebarron · 2025-07-31T14:17:23Z

src/zarr/storage/_obstore.py

+        if prefix != "" and not prefix.endswith("/"):
+            prefix += "/"
+
+        keys = [(k,) async for k in self.list_prefix(prefix)]


If you're not using the async iterator from the list, you could also just call metas = await obs.list(self.store, prefix=prefix).collect_async() and then extract the path from each dict. That might give a tiny bit less async overhead

thanks for the suggestion, seems good to me.

kylebarron · 2025-08-01T15:35:35Z

src/zarr/storage/_obstore.py

+        if prefix != "" and not prefix.endswith("/"):
+            prefix += "/"
+
+        metas = await obs.list(self.store, prefix).collect_async()


By the way, you could potentially make this faster by not needing to wait until the list fully finishes to start deleting files. You could do a concurrent map over each batch returned from the list. You have the tradeoff between needing to wait for the list to fully finish before starting deletes vs needing to run concurrent_map in batches instead of all at once. But note you can also customize the chunk_size of list

I don't know in practice which approach is better.

My initial motivation for this approach was that listing upfront is certainly less complex than interleaving list/delete (granted, the difference is small), while being similarly performant, in theory.

I ran a basic benchmark (courtesy of claude) with this kind of impl and got these results (one run each):

async.concurrency==1000: delete1 (6.9s) vs. delete2 (10.45s)

async.concurrency==100: delete1 (8.18s) vs. delete2 (15.67s)

async.concurrency==20: delete1 (32.37) vs. delete2 (30.61s)

async def delete_dir2(self, prefix: str) -> None: # docstring inherited import obstore as obs self._check_writable() if prefix != "" and not prefix.endswith("/"): prefix += "/" limit = config.get("async.concurrency") async for chunk in obs.list(self.store, prefix, chunk_size=1000): keys = [(x["path"],) for x in chunk] await concurrent_map(keys, self.delete, limit=limit)

import zarr import zarr.storage import obstore as obs import time import asyncio zarr.config.set({"async.concurrency": 1000}) TEST_BUCKET = "my-test-bucket" remote = obs.store.S3Store(bucket=TEST_BUCKET) store = zarr.storage.ObjectStore(remote) async def fill_bucket(remote_store): """Fill bucket with 10000 files prefixed like 'c/0', 'c/1', etc.""" print("Filling bucket with 10000 files...") async def put_file(i): key = f"c/{i}" data = f"test data for file {i}".encode("utf-8") await obs.put_async(remote_store, key, data) # Create tasks for concurrent uploads tasks = [] for i in range(10000): tasks.append(put_file(i)) if len(tasks) >= 100: # Process in batches of 100 await asyncio.gather(*tasks) print(f"Created {i + 1} files...") tasks = [] # Process remaining tasks if tasks: await asyncio.gather(*tasks) print("Finished filling bucket with 10000 files") async def test1(): await fill_bucket(remote) start_time = time.time() await store.delete_dir("c") end_time = time.time() print(f"delete_dir took {end_time - start_time:.2f} seconds") async def test2(): await fill_bucket(remote) start_time = time.time() await store.delete_dir2("c") end_time = time.time() print(f"delete_dir2 took {end_time - start_time:.2f} seconds") async def main(): print("Running test1 (delete_dir)...") await test1() print("\nRunning test2 (delete_dir2)...") await test2() if __name__ == "__main__": asyncio.run(main())

Assuming the alternate impl looks about right, I'd be biased towards the current impl.

these results were a bit surprising. I reran after setting chunk_size=10_000 in the second impl, in which both impls would then have similar perf (~5s). So it seems like calling concurrent_map as few times as possible is desirable. Though in effect, this strategy just converges to fetching all the keys upfront, it seems.

obstore delete_dir

8017d6b

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 30, 2025

add cl

9ebbd58

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Jul 30, 2025

add a delete test

146297c

slowjazz marked this pull request as ready for review July 30, 2025 17:09

dstansby added this to the 3.1.2 milestone Jul 31, 2025

Merge branch 'main' into obstore-delete-dir

2450594

kylebarron approved these changes Jul 31, 2025

View reviewed changes

slowjazz and others added 3 commits July 31, 2025 11:01

use obstore list and collect_async

2e99a4d

Merge branch 'main' into obstore-delete-dir

019f500

Merge branch 'main' into obstore-delete-dir

c7c67eb

kylebarron reviewed Aug 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

obstore delete_dir #3310

obstore delete_dir #3310

slowjazz commented Jul 30, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jul 30, 2025 •

edited

Loading

Uh oh!

d-v-b commented Jul 31, 2025

Uh oh!

kylebarron Jul 31, 2025

Uh oh!

slowjazz Jul 31, 2025

Uh oh!

kylebarron Aug 1, 2025 •

edited

Loading

Uh oh!

slowjazz Aug 1, 2025

Uh oh!

slowjazz Aug 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

obstore delete_dir #3310

Are you sure you want to change the base?

obstore delete_dir #3310

Conversation

slowjazz commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

d-v-b commented Jul 31, 2025

Uh oh!

kylebarron Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

slowjazz Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

kylebarron Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slowjazz Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

slowjazz Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

slowjazz commented Jul 30, 2025 •

edited

Loading

codecov bot commented Jul 30, 2025 •

edited

Loading

kylebarron Aug 1, 2025 •

edited

Loading

slowjazz Aug 1, 2025 •

edited

Loading