Support glob patterns in open_datatree(group=...) for selective group loading#11302
Support glob patterns in open_datatree(group=...) for selective group loading#11302aladinor wants to merge 16 commits intopydata:mainfrom
Conversation
Add _is_glob_pattern, _filter_group_paths, and _resolve_group_and_filter to common.py for detecting and applying glob patterns to group paths.
Use _resolve_group_and_filter in open_groups_as_dict to support glob patterns in the group parameter for selective group loading.
Use _resolve_group_and_filter in open_groups_as_dict to support glob patterns in the group parameter for selective group loading.
Use _resolve_group_and_filter in open_groups_as_dict to support glob patterns in the group parameter for selective group loading.
Update docstrings for the group kwarg in open_datatree and open_groups to describe glob metacharacter behavior.
Add integration tests for netCDF4, h5netcdf, and zarr backends, plus unit tests for _is_glob_pattern, _filter_group_paths, and _resolve_group_and_filter covering *, ?, and [] metacharacters.
e892524 to
5fb46e1
Compare
|
@aladinor Thanks, that's great a feature. I'd instantly use it. There might be some pitfalls if group names are containing one or more of the glob meta characters. Will this be handled, too? |
|
XRef: h5py/h5py#2059 for discussion of adding globbing in h5py |
|
@kmuehlbauer, thanks for taking the time to check this out.
This seems to be a strange way to name a group, but yes. It will work via the same character-class escape that For example, if we have something like this paths = ['/my_nifty_group_with_a_star_*_01',
'/my_nifty_group_with_a_star_*_11',
'/my_nifty_group_with_a_star_*_12'] We can use this pattern to get those groups |
Add coverage for group names containing literal ``*`` / ``?`` / ``[``. These are reachable with ``[*]`` / ``[?]`` / ``[[]`` character-class escaping (inherited from ``fnmatch`` / ``PurePath.match`` semantics). New tests: - ``test_open_datatree_glob_char_class_escape_literal_metachar`` on ``NetCDFIOBase`` and ``TestZarrDatatreeIO`` — end-to-end verification that groups with literal metacharacters in their names can be targeted across all supported backends. - ``test_filter_group_paths_literal_metachar_via_char_class`` on ``TestGlobPatternUtilities`` — unit-level check of the filter.
Explain that matching follows ``fnmatch`` / :py:meth:`pathlib.PurePath.match` semantics and that literal ``*`` / ``?`` / ``[`` in group names can be targeted via character-class escapes (``[*]``, ``[?]``, ``[[]``), with a short example. Applied to both :py:func:`open_datatree` and :py:func:`open_groups` for consistency.
Add ``/plain_01`` to the zarr ``test_open_datatree_glob_char_class_escape_literal_metachar`` fixture so it matches the NetCDF version and confirms plain (no-metachar) group names are excluded when the pattern targets literal-metachar names.
Windows forbids ``*`` and ``?`` in filesystem directory/file names, and zarr stores each group as an on-disk directory. That makes writing the fixture impossible before the test can exercise the filter. NetCDF4/H5 store groups inside the HDF5 container so they are unaffected. Skip the zarr variant on Windows with a clear reason; the NetCDF variants still cover the escape behavior on all platforms.
The previous commit skipped the zarr variant on Windows because the filesystem rejects ``*`` and ``?`` in directory names. Using ``zarr.storage.MemoryStore`` side-steps the filesystem entirely, so the test now runs on every platform and still exercises the escape logic. This is also a more realistic target for the feature on Windows — users who hit group names with glob metacharacters are likely reading from cloud/icechunk stores (dict-keyed like ``MemoryStore``), not an on-disk zarr directory tree.
``open_datatree``'s static signature doesn't list zarr store objects (``MemoryStore`` etc.) among its accepted first-argument types, but the zarr backend handles them correctly at runtime. Apply a narrow ``# type: ignore[arg-type]`` on the three test calls rather than widening the public signature.
|
@aladinor Thanks for adding the glob escapes. Is this ready from your side? |
|
Yep, it is ready to merge @kmuehlbauer |
kmuehlbauer
left a comment
There was a problem hiding this comment.
This is looking good to me. Can't say much wrt typing, though.
|
@pydata/xarray Another set of eyes much appreciated here. If there are no concerns, I'd move on and merge early next week. Thanks! |
Summary
When the
groupparameter contains glob metacharacters (*,?,[), filter which groups are opened instead of re-rooting the tree. This avoids loading the entire hierarchy when only a subset is needed.Use cases
xr.open_datatree("radar.nc", group="*/sweep_0")— load only the lowest elevation sweep from each volume scanxr.open_datatree("cmip.zarr", group="*/historical/tas")— load only temperature across all modelsChanges
_is_glob_pattern,_filter_group_paths, and_resolve_group_and_filterincommon.pyDataTree.match()(PurePosixPath.match)/) and all ancestors of matched nodes are always included to form a valid treeBehavior summary
groupvalueNone"VCP-34"(no glob chars)"*/sweep_0"(glob chars)open_datatree(group=...)for selective group loading #11196whats-new.rstapi.rstTest plan
_is_glob_pattern,_filter_group_paths,_resolve_group_and_filterwith*,?,[]open_groupsAPItest_backends_datatree.pysuite passes (228 passed, 0 failures)