Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support buffer IO for .h5ad #800

Closed
wants to merge 13 commits into from
Closed

Support buffer IO for .h5ad #800

wants to merge 13 commits into from

Conversation

flying-sheep
Copy link
Member

@flying-sheep flying-sheep commented Aug 8, 2022

So far we only support reading/writing from paths, this PR adds tests / type hints for binary buffers / file-like objects.

  • TODO: tests

@codecov
Copy link

codecov bot commented Aug 8, 2022

Codecov Report

Merging #800 (ef3e1e9) into master (cc3ba6f) will decrease coverage by 0.05%.
The diff coverage is 86.95%.

@@            Coverage Diff             @@
##           master     #800      +/-   ##
==========================================
- Coverage   83.21%   83.15%   -0.06%     
==========================================
  Files          34       34              
  Lines        4450     4458       +8     
==========================================
+ Hits         3703     3707       +4     
- Misses        747      751       +4     
Impacted Files Coverage Δ
anndata/_io/h5ad.py 90.82% <81.25%> (-1.16%) ⬇️
anndata/_core/anndata.py 83.45% <100.00%> (+0.04%) ⬆️
anndata/_core/merge.py 93.73% <0.00%> (-0.28%) ⬇️

@flying-sheep flying-sheep marked this pull request as ready for review August 8, 2022 12:29
@flying-sheep flying-sheep changed the title Suport io from buffers for .h5ad Formalize support for buffer IO for .h5ad Aug 10, 2022
Comment on lines +85 to +90
elif not (
adata.isbacked
and (
not isinstance(file, (str, Path)) or Path(adata.filename) == Path(file)
)
):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this make sense?

Comment on lines +138 to +143
def _maybe_get_file_name(file) -> Path | None:
if isinstance(file, (str, Path)):
return Path(file)
if hasattr(file, "name"):
return Path(file.name)
return None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best effort to set filename. If we can’t, we can’t.

@ivirshup
Copy link
Member

General question about the PR, why buffers specifically and not h5py.File/ h5py.Group or zarr stores directly?

@flying-sheep flying-sheep changed the title Formalize support for buffer IO for .h5ad Support for buffer IO for .h5ad Sep 9, 2022
@flying-sheep flying-sheep changed the title Support for buffer IO for .h5ad Support buffer IO for .h5ad Sep 9, 2022
@flying-sheep
Copy link
Member Author

flying-sheep commented Sep 9, 2022

That’s also a good idea, but this PR is about the most generic serialization format: a byte stream.

We can add the others at another date.

See quiltdata/quilt#2974 for motivation to get this in quickly.

@ivirshup
Copy link
Member

Currently, this works:

import anndata as ad
from anndata.experimental import read_elem, write_elem

import h5py

from io import BytesIO

adata = ad.read_h5ad("/Users/isaac/data/pbmc3k_raw.h5ad")

bio = BytesIO()
with h5py.File(bio, "w") as f:
    write_elem(f, "/", adata)

with h5py.File(bio, "r") as f:
    from_bytes = read_elem(f["/"])

I don't think I want to directly support file-like objects as an input type to read_h5ad. It's more to support on our side, and the docs in h5py for this feature are full of "use at your own risk" warnings. See also h5py/h5py#1698

@ivirshup
Copy link
Member

What is the use case here? I'm not too familiar with Quilt.

@flying-sheep
Copy link
Member Author

Check out their promo material: https://quiltdata.com/

Basically it’s a catalog to browse and manage in-house or public data. The data is stored in versioned “packages” (file trees on S3) together with searchable/queryable metadata. Metadata can be validated using schemas.

But if [read|write]_elem works, it’s true, they can just use that.

@flying-sheep flying-sheep deleted the support-buffers branch September 14, 2022 09:51
@ivirshup
Copy link
Member

For quilt:

I saw that, but the specifics were a little unclear. Looks like it's only s3?

I think there are some similarities to things I'd like to do with zarr and OME, but not quite sure yet.


For specific use case, are you expecting the data to always be delivered as streams? If so, I think pickling would have less overhead from chunking the arrays. If the data is on the cloud, zarr may be a nicer choice for storage format (plus defaults to more modern compression libraries).

@flying-sheep
Copy link
Member Author

I agree about zarr, hdf5 is a legacy thing. I also don’t know if their (de)serialization supports S3 file hierarchies as opposed to individual files.

Regarding pickling: I avoid that format for anything but caches because of its lack of stability.

@ivirshup
Copy link
Member

One can have backwards compatible pickled objects. I believe both numpy and pandas do this.

I don't think it should be too hard for us, as anndata is basically builtin python types and those. I'd be up for a PR implementing this.

Of course you could have then have references to items which don't have compatibility guarantees.

@flying-sheep
Copy link
Member Author

yeah, that’s the problem, right? it’s easy that something sneaks in and everything breaks. Recovery is near impossible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants