Support buffer IO for .h5ad #800

flying-sheep · 2022-08-08T12:15:10Z

So far we only support reading/writing from paths, this PR adds tests / type hints for binary buffers / file-like objects.

TODO: tests

codecov · 2022-08-08T12:18:28Z

Codecov Report

Merging #800 (ef3e1e9) into master (cc3ba6f) will decrease coverage by 0.05%.
The diff coverage is 86.95%.

@@            Coverage Diff             @@
##           master     #800      +/-   ##
==========================================
- Coverage   83.21%   83.15%   -0.06%     
==========================================
  Files          34       34              
  Lines        4450     4458       +8     
==========================================
+ Hits         3703     3707       +4     
- Misses        747      751       +4

Impacted Files	Coverage Δ
anndata/_io/h5ad.py	`90.82% <81.25%> (-1.16%)`	⬇️
anndata/_core/anndata.py	`83.45% <100.00%> (+0.04%)`	⬆️
anndata/_core/merge.py	`93.73% <0.00%> (-0.28%)`	⬇️

flying-sheep · 2022-08-10T13:28:16Z

anndata/_io/h5ad.py

+        elif not (
+            adata.isbacked
+            and (
+                not isinstance(file, (str, Path)) or Path(adata.filename) == Path(file)
+            )
+        ):


does this make sense?

flying-sheep · 2022-08-10T13:29:11Z

anndata/_io/h5ad.py

+def _maybe_get_file_name(file) -> Path | None:
+    if isinstance(file, (str, Path)):
+        return Path(file)
+    if hasattr(file, "name"):
+        return Path(file.name)
+    return None


best effort to set filename. If we can’t, we can’t.

ivirshup · 2022-08-30T09:21:42Z

General question about the PR, why buffers specifically and not h5py.File/ h5py.Group or zarr stores directly?

flying-sheep · 2022-09-09T08:22:55Z

That’s also a good idea, but this PR is about the most generic serialization format: a byte stream.

We can add the others at another date.

See quiltdata/quilt#2974 for motivation to get this in quickly.

ivirshup · 2022-09-13T21:18:50Z

Currently, this works:

import anndata as ad
from anndata.experimental import read_elem, write_elem

import h5py

from io import BytesIO

adata = ad.read_h5ad("/Users/isaac/data/pbmc3k_raw.h5ad")

bio = BytesIO()
with h5py.File(bio, "w") as f:
    write_elem(f, "/", adata)

with h5py.File(bio, "r") as f:
    from_bytes = read_elem(f["/"])

I don't think I want to directly support file-like objects as an input type to read_h5ad. It's more to support on our side, and the docs in h5py for this feature are full of "use at your own risk" warnings. See also h5py/h5py#1698

ivirshup · 2022-09-13T21:21:00Z

What is the use case here? I'm not too familiar with Quilt.

flying-sheep · 2022-09-14T09:51:23Z

Check out their promo material: https://quiltdata.com/

Basically it’s a catalog to browse and manage in-house or public data. The data is stored in versioned “packages” (file trees on S3) together with searchable/queryable metadata. Metadata can be validated using schemas.

But if [read|write]_elem works, it’s true, they can just use that.

ivirshup · 2022-09-14T11:03:01Z

For quilt:

I saw that, but the specifics were a little unclear. Looks like it's only s3?

I think there are some similarities to things I'd like to do with zarr and OME, but not quite sure yet.

For specific use case, are you expecting the data to always be delivered as streams? If so, I think pickling would have less overhead from chunking the arrays. If the data is on the cloud, zarr may be a nicer choice for storage format (plus defaults to more modern compression libraries).

flying-sheep · 2022-09-14T12:17:56Z

I agree about zarr, hdf5 is a legacy thing. I also don’t know if their (de)serialization supports S3 file hierarchies as opposed to individual files.

Regarding pickling: I avoid that format for anything but caches because of its lack of stability.

ivirshup · 2022-09-14T12:44:32Z

One can have backwards compatible pickled objects. I believe both numpy and pandas do this.

I don't think it should be too hard for us, as anndata is basically builtin python types and those. I'd be up for a PR implementing this.

Of course you could have then have references to items which don't have compatibility guarantees.

flying-sheep · 2022-09-15T09:17:15Z

yeah, that’s the problem, right? it’s easy that something sneaks in and everything breaks. Recovery is near impossible.

Suport buffers. TODO: tests

e2d4bb3

flying-sheep added 2 commits August 8, 2022 14:25

Added error tests

25893e3

add tests

9cfa47c

flying-sheep marked this pull request as ready for review August 8, 2022 12:29

flying-sheep added 6 commits August 8, 2022 15:27

Define protocols ourselves

839ec42

Break dep cycle

fef1dd6

Add release notes

84440e3

Merge branch 'master' into support-buffers

d18ca12

H5Py supports file objects

7180c17

Restore conf.py

f5910c0

flying-sheep changed the title ~~Suport io from buffers for .h5ad~~ Formalize support for buffer IO for .h5ad Aug 10, 2022

flying-sheep added 2 commits August 10, 2022 15:22

centralize name inference

54380ec

Fix deps

d6ec0ac

flying-sheep commented Aug 10, 2022

View reviewed changes

remove tests that aren’t necessary

1ac6483

flying-sheep requested a review from ivirshup August 18, 2022 16:48

flying-sheep changed the title ~~Formalize support for buffer IO for .h5ad~~ Support for buffer IO for .h5ad Sep 9, 2022

flying-sheep changed the title ~~Support for buffer IO for .h5ad~~ Support buffer IO for .h5ad Sep 9, 2022

Merge branch 'master' into support-buffers

ef3e1e9

sir-sigurd mentioned this pull request Sep 13, 2022

Add AnnData as format quiltdata/quilt#2974

Merged

7 tasks

flying-sheep closed this Sep 14, 2022

flying-sheep deleted the support-buffers branch September 14, 2022 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support buffer IO for .h5ad #800

Support buffer IO for .h5ad #800

flying-sheep commented Aug 8, 2022 •

edited

Loading

codecov bot commented Aug 8, 2022 •

edited

Loading

flying-sheep Aug 10, 2022

flying-sheep Aug 10, 2022

ivirshup commented Aug 30, 2022

flying-sheep commented Sep 9, 2022 •

edited

Loading

ivirshup commented Sep 13, 2022

ivirshup commented Sep 13, 2022

flying-sheep commented Sep 14, 2022

ivirshup commented Sep 14, 2022

flying-sheep commented Sep 14, 2022

ivirshup commented Sep 14, 2022

flying-sheep commented Sep 15, 2022

Support buffer IO for .h5ad #800

Support buffer IO for .h5ad #800

Conversation

flying-sheep commented Aug 8, 2022 • edited Loading

codecov bot commented Aug 8, 2022 • edited Loading

Codecov Report

flying-sheep Aug 10, 2022

Choose a reason for hiding this comment

flying-sheep Aug 10, 2022

Choose a reason for hiding this comment

ivirshup commented Aug 30, 2022

flying-sheep commented Sep 9, 2022 • edited Loading

ivirshup commented Sep 13, 2022

ivirshup commented Sep 13, 2022

flying-sheep commented Sep 14, 2022

ivirshup commented Sep 14, 2022

flying-sheep commented Sep 14, 2022

ivirshup commented Sep 14, 2022

flying-sheep commented Sep 15, 2022

flying-sheep commented Aug 8, 2022 •

edited

Loading

codecov bot commented Aug 8, 2022 •

edited

Loading

flying-sheep commented Sep 9, 2022 •

edited

Loading