-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support buffer IO for .h5ad #800
Conversation
Codecov Report
@@ Coverage Diff @@
## master #800 +/- ##
==========================================
- Coverage 83.21% 83.15% -0.06%
==========================================
Files 34 34
Lines 4450 4458 +8
==========================================
+ Hits 3703 3707 +4
- Misses 747 751 +4
|
elif not ( | ||
adata.isbacked | ||
and ( | ||
not isinstance(file, (str, Path)) or Path(adata.filename) == Path(file) | ||
) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this make sense?
def _maybe_get_file_name(file) -> Path | None: | ||
if isinstance(file, (str, Path)): | ||
return Path(file) | ||
if hasattr(file, "name"): | ||
return Path(file.name) | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
best effort to set filename
. If we can’t, we can’t.
General question about the PR, why buffers specifically and not |
That’s also a good idea, but this PR is about the most generic serialization format: a byte stream. We can add the others at another date. See quiltdata/quilt#2974 for motivation to get this in quickly. |
Currently, this works: import anndata as ad
from anndata.experimental import read_elem, write_elem
import h5py
from io import BytesIO
adata = ad.read_h5ad("/Users/isaac/data/pbmc3k_raw.h5ad")
bio = BytesIO()
with h5py.File(bio, "w") as f:
write_elem(f, "/", adata)
with h5py.File(bio, "r") as f:
from_bytes = read_elem(f["/"]) I don't think I want to directly support file-like objects as an input type to |
What is the use case here? I'm not too familiar with Quilt. |
Check out their promo material: https://quiltdata.com/ Basically it’s a catalog to browse and manage in-house or public data. The data is stored in versioned “packages” (file trees on S3) together with searchable/queryable metadata. Metadata can be validated using schemas. But if |
For quilt: I saw that, but the specifics were a little unclear. Looks like it's only s3? I think there are some similarities to things I'd like to do with zarr and OME, but not quite sure yet. For specific use case, are you expecting the data to always be delivered as streams? If so, I think pickling would have less overhead from chunking the arrays. If the data is on the cloud, |
I agree about zarr, hdf5 is a legacy thing. I also don’t know if their (de)serialization supports S3 file hierarchies as opposed to individual files. Regarding pickling: I avoid that format for anything but caches because of its lack of stability. |
One can have backwards compatible pickled objects. I believe both numpy and pandas do this. I don't think it should be too hard for us, as anndata is basically builtin python types and those. I'd be up for a PR implementing this. Of course you could have then have references to items which don't have compatibility guarantees. |
yeah, that’s the problem, right? it’s easy that something sneaks in and everything breaks. Recovery is near impossible. |
So far we only support reading/writing from paths, this PR adds tests / type hints for binary buffers / file-like objects.