Implement async support for open_datatree #10742

aladinor · 2025-09-14T14:26:39Z

Closes open_dataset creates default indexes sequentially, causing significant latency in cloud high-latency stores #10579 and #12
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

… async

…o async-dtreec

shoyer · 2025-09-14T18:44:09Z

This looks great! Would it be possible to make the sync path reuse the async methods internally? This would help reduce duplication, increase test coverage and speed up sync workflows.

aladinor · 2025-09-15T15:53:55Z

Thanks for the suggestion @shoyer! I explored implementing sync-to-async reuse using a universal coroutine runner. The main challenge is handling environments where an event loop is already running (such as Jupyter notebooks), which requires spawning background threads using asyncio.run() fails with "cannot be called from a running event loop."

However, this approach raises some design concerns:

Threading implications: The sync API would internally spawn threads in Jupyter environments, which conflicts with xarray's general avoidance of hidden threading. This can make debugging harder, affect resource management, and surprise users who expect predictable sync behavior.
Maintenance burden: We'd need to maintain the threading utility, handle edge cases across different environments, and ensure thread safety.
User experience: Some users prefer explicit control over when async/threading is used, especially in performance-critical applications.
Alternative benefits: The current approach still provides the main wins - users get significant performance improvements by explicitly choosing open_datatree_async(), and testing the async path covers the core logic.

The tradeoff is between code deduplication vs. user control and predictable behavior. Other major Python libraries (like httpx, requests-async) often keep separate sync/async implementations for similar reasons.

What's your take on the threading tradeoff vs. the deduplication benefits?

CC @TomNicholas

shoyer · 2025-09-15T16:14:33Z

I'm pretty sure Zarr v3 uses async internally to implement sync methods. It may be worth taking a look at how Zarr does things, especially given the strong overlap in the contributor communities.

Launching a few threads is not particularly resource-intensive, so I'm not worried about that. Thread safety is a potential concern, but we do already take care to ensure that Xarray is thread safe internally, especially for IO backends.

I think we can safely say that the vast majority of Xarray users are not familiar with async programming models, so I think they could really benefit from having this work by default. This is quite different from the user base for the web programming libraries you mention.

TomNicholas · 2025-09-15T17:39:38Z

@shoyer did you see #10622? I raised that issue to discuss the general problem of how these libraries interact with each other when it comes to concurrency.

I'm pretty sure Zarr v3 uses async internally to implement sync methods. It may be worth taking a look at how Zarr does things, especially given the strong overlap in the contributor communities.

Yes zarr manages its own threadpool.

shoyer · 2025-09-15T20:51:11Z

OK, let's try to reach some initial resolution about the async strategy for Xarary over in #10622 first!

Changes: - Refactor open_datatree() to use zarr_sync() with async implementation for concurrent dataset and index creation across groups - Add _open_datatree_from_stores_async() helper that opens datasets and creates indexes concurrently using asyncio.gather with a semaphore to limit concurrency (avoids deadlocks with stores like Icechunk) - Add open_datatree_async() method for explicit async API - Remove duplicate _maybe_create_default_indexes_async from zarr.py, now imports from api.py (single source of truth) This significantly improves performance when opening DataTrees from high-latency storage backends (e.g., ~2 seconds vs sequential loading).

Remove the asyncio.Semaphore that was limiting concurrency to 10 concurrent operations. Investigation showed: - Zarr already has built-in concurrency control (async.concurrency=10) - The semaphore only applied to asyncio.to_thread() calls, not zarr I/O - Removing it improves performance by ~30-40% (~2s -> ~1.2-1.4s) The semaphore was defensive code for a problem that doesn't exist - zarr and icechunk handle their own concurrency limits internally.

aladinor · 2025-12-12T21:44:57Z

Hey @TomNicholas and @shoyer,

I've updated the async DataTree implementation based on our previous discussions. Key changes:

User-facing API remains synchronous - no await needed: Users just call the normal sync API
dt = xr.open_datatree("s3://bucket/data.zarr", engine="zarr")

How it works internally:

The zarr backend's open_datatree() now uses zarr.core.sync.sync() (aliased as zarr_sync) to execute async code from the sync context
Internally, _open_datatree_from_stores_async() opens all groups and creates indexes concurrently using asyncio.gather()

Please let me know your thoughs on this.

The async implementation uses zarr.core.sync which only exists in zarr v3. Add a conditional check using _zarr_v3() to: - Use async path with zarr_sync() for zarr v3 (concurrent loading) - Fall back to sequential loading for zarr v2 This fixes CI failures on min-versions environment which uses zarr v2.

TomNicholas · 2025-12-13T09:38:45Z

OK, let's try to reach some initial resolution about the async strategy for Xarray over in #10622 first!

My understanding of that issue is that people thought that it should be zarr's responsiblity to offer API that xarray could use (e.g. open_many_groups_async). But OTOH @aladinor 's implementation looks great, and it's all internal, so shall we just get this merged?

@aladinor have you benchmarked this at scale? Creating a graph like this one would be really interesting.

- Add helper methods _build_group_members and _create_stores_from_members to reduce code duplication between sync and async store opening - Use zarr_sync() to run async index creation in _datatree_from_backend_datatree for zarr engine, making open_datatree fully async behind the scenes - Fix missing chunks validation and source encoding in open_datatree_async - Add tests for chunks validation, source encoding, and chunks parameter

- Add type annotations to nested async functions in _datatree_from_backend_datatree to fix mypy annotation-unchecked notes breaking pytest-mypy-plugins tests - Use os.path.join and os.path.normpath in test_async_source_encoding for cross-platform compatibility on Windows

Add type annotations to _maybe_create_default_indexes_async and its nested functions (load_var, create_index, _create) to satisfy mypy's annotation-unchecked checks. Also add Variable and Hashable imports to the TYPE_CHECKING block. This fixes pytest-mypy-plugins tests that were failing due to mypy emitting annotation-unchecked notes for untyped nested functions.

xarray/tests/test_backends_zarr_async.py

xarray/backends/api.py

- Remove open_datatree_async() from api.py (public API) - Remove open_datatree_async() from zarr.py (backend method) - Keep internal async optimization in _datatree_from_backend_datatree() - Use _zarr_v3() for proper zarr version check instead of ImportError - Update tests to only test internal async functionality - Add test to verify sync open_datatree uses async internally for zarr v3 The async optimization is now internal only - users call the sync open_datatree() which automatically uses async index creation for zarr v3 backends.

xarray/backends/zarr.py

keewis

I'm not an expert on the zarr backend, but this mostly looks good to me.

This is not done for the sync versions either, so I don't think this has to be done in this PR, but logically I think xr.open_datatree(...) == xr.DataTree.from_dict(xr.open_groups(...)), so it might make sense to have _open_datatree_async call _open_groups_as_dict_async?

xarray/backends/api.py

Co-authored-by: Justus Magin <[email protected]>

Benchmarking showed async index creation provides no measurable benefit since it's CPU-bound work. Simplified to sync loop per reviewer feedback.

for more information, see https://pre-commit.ci

aladinor · 2026-01-16T17:52:14Z

@keewis, you're right that open_datatree ≈ DataTree.from_dict(open_groups_as_dict(...)) and refactoring to share code would reduce duplication. Since this affects both sync and async paths, I'll address it in a follow-up PR to keep this one focused. I've noted the differences (semaphore, index creation) that need to be unified.

I'm not an expert on the zarr backend, but this mostly looks good to me.

This is not done for the sync versions either, so I don't think this has to be done in this PR, but logically I think xr.open_datatree(...) == xr.DataTree.from_dict(xr.open_groups(...)), so it might make sense to have _open_datatree_async call _open_groups_as_dict_async?

xarray/backends/api.py

Co-authored-by: Justus Magin <[email protected]>

for more information, see https://pre-commit.ci

xarray/backends/api.py

Co-authored-by: Justus Magin <[email protected]>

for more information, see https://pre-commit.ci

aladinor added 4 commits September 13, 2025 14:01

adding async for datatrees

9916b0b

adding async method to _maybe_create_index

afa42e9

using async as complete instead of gathering results

d469f2e

adding tests for open_group, open_dtree and _maybe_create_index using…

d53498a

… async

github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Sep 14, 2025

aladinor added 4 commits September 14, 2025 09:27

Merge branch 'main' into async-dtreec

3b26dd6

ensuing _maybe_create_default_indexes_async is compatible with zarr v2

b5ab48a

resolving the mypy type errors

94a9efd

Merge branch 'async-dtreec' of https://github.com/aladinor/xarray int…

288a818

…o async-dtreec

aladinor changed the title ~~Async dtreec~~ Implement async support for open_datatre Sep 14, 2025

aladinor changed the title ~~Implement async support for open_datatre~~ Implement async support for open_datatree Sep 14, 2025

attemp 2: resolving mypy type errors

573a700

Merge branch 'main' into async-dtreec

7557261

TomNicholas mentioned this pull request Sep 15, 2025

How should Xarray control asynchronous calls? #10622

Open

7 tasks

aladinor added 5 commits October 7, 2025 21:04

Merge branch 'main' into async-dtreec

3c10a23

Merge branch 'main' into async-dtreec

013804c

Merge branch 'main' into async-dtreec

f4ca679

Merge branch 'main' into async-dtreec

640081b

aladinor added 4 commits January 5, 2026 11:18

Merge branch 'main' into async-dtreec

5c2f62c

Merge branch 'main' into async-dtreec

3a698d2

fix: add type ignore for mypy arg-type error in open_datatree_async

c3ec77e

aladinor force-pushed the async-dtreec branch from 68093dc to c3ec77e Compare January 10, 2026 21:40

aladinor added 3 commits January 10, 2026 16:34

Merge branch 'main' into async-dtreec

6f97d9c

TomNicholas reviewed Jan 14, 2026

View reviewed changes

xarray/tests/test_backends_zarr_async.py Outdated Show resolved Hide resolved

shoyer reviewed Jan 14, 2026

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

aladinor requested review from TomNicholas and shoyer January 14, 2026 17:36

Merge branch 'main' into async-dtreec

c53f0bd

keewis reviewed Jan 16, 2026

View reviewed changes

xarray/backends/zarr.py Outdated Show resolved Hide resolved

xarray/backends/zarr.py Show resolved Hide resolved

aladinor added 2 commits January 16, 2026 09:40

refactor: convert _build_group_members to module-level helper function

5da5adc

fix: add cast for mypy type checking in _build_group_members

1debb7c

aladinor requested a review from keewis January 16, 2026 15:54

Merge branch 'main' into async-dtreec

06bdab4

keewis approved these changes Jan 16, 2026

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

xarray/backends/api.py Outdated Show resolved Hide resolved

aladinor and others added 3 commits January 16, 2026 11:23

Update xarray/backends/api.py

480b872

Co-authored-by: Justus Magin <[email protected]>

refactor: use sync index creation in _maybe_create_default_indexes_async

02ee46b

Benchmarking showed async index creation provides no measurable benefit since it's CPU-bound work. Simplified to sync loop per reviewer feedback.

[pre-commit.ci] auto fixes from pre-commit.com hooks

b008620

for more information, see https://pre-commit.ci

keewis reviewed Jan 16, 2026

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

aladinor and others added 2 commits January 16, 2026 12:21

Update xarray/backends/api.py

422d127

Co-authored-by: Justus Magin <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

179c20c

for more information, see https://pre-commit.ci

keewis reviewed Jan 16, 2026

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

aladinor and others added 2 commits January 16, 2026 12:23

Update xarray/backends/api.py

a4844b5

Co-authored-by: Justus Magin <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b58ffb3

for more information, see https://pre-commit.ci

Uh oh!

Implement async support for open_datatree #10742

Are you sure you want to change the base?

Implement async support for open_datatree #10742

Conversation

aladinor commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Sep 14, 2025

Uh oh!

aladinor commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Sep 15, 2025

Uh oh!

TomNicholas commented Sep 15, 2025

Uh oh!

shoyer commented Sep 15, 2025

Uh oh!

aladinor commented Dec 12, 2025

Uh oh!

TomNicholas commented Dec 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keewis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aladinor commented Jan 16, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aladinor commented Sep 14, 2025 •

edited

Loading

aladinor commented Sep 15, 2025 •

edited

Loading