Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Implement read, seek, and tell for SOMAVFSFilebuf #3543

Conversation

nguyenv
Copy link
Member

@nguyenv nguyenv commented Jan 10, 2025

Issue and/or context:

[sc-61381]

Although h5py does support ingesting S3 files directly, as already noted in our documentation, anndata uses the default driver which means it expects local files with os.PathLike URIs. This is what motivates us to implement the _FSPathWrapper so that anndata can pass the file to h5ad and parse it as a file-like object. However, in order to actually be a file-like object, we need to implement the read and seek methods. Otherwise, when make_fid attempts to open the file, it errors out with FileNotFoundError because the object does not exist as a file-like object even though an object does exist at the given URI.

This regression was introduced during the C++-ification work when tiledb.VFS() (which fully implements all file I/O methods required by h5py) and replaced with clib.SOMAVFS and clib.SOMAVFSFilebuf.

Changes:

Implement read, seek, and tell for SOMAVFSFilebuf.

Notes for Reviewer:

Proof that the original bug is fixed for 1.15:

(py311) vivian@mangonada:~/TileDB-SOMA$ cat ~/tiledb-bugs/s3-fails.py
#!/usr/bin/env python3

import tempfile
from pathlib import Path
import pathlib

import tiledbsoma as soma
import tiledbsoma.pytiledbsoma as clib
import anndata as ad
import tiledbsoma.io
import tiledb

from anndata._core import file_backing
from unittest import mock

tiledbsoma.show_package_versions()

vfs = tiledb.VFS()
S3_BUCKET = "s3://tiledb-johnkerl/s/a/stack-small/"

# # List the files in the S3 bucket, verifying env has access to the bucket
h5ad_files = vfs.ls(S3_BUCKET)
print(f"Discovered {len(h5ad_files)} files in {S3_BUCKET}")

# # Ingest the first file
h5ad_file = h5ad_files[0]
print(f"Ingesting {h5ad_file}")

print("...starting ingestion")
experiment_uri: str = tiledbsoma.io.from_h5ad(
    input_path=h5ad_file,
    experiment_uri=tempfile.mkdtemp(prefix=f"soma-{Path(h5ad_file).stem}-"),
    measurement_name="RNA",
)

print(f"Experiment URI: {experiment_uri}")
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print("...obs data:")
    print(exp.obs.read().concat().to_pandas())
(py311) vivian@mangonada:~/TileDB-SOMA$ python ~/tiledb-bugs/s3-fails~/tiledb-bugs/s3-fails.py
tiledbsoma.__version__              1.15.0rc0.post225.dev28053676759
TileDB core version (libtiledbsoma) 2.27.0
python version                      3.11.11.final.0
OS version                          Linux 5.15.167.4-microsoft-standard-WSL2
Discovered 4 files in s3://tiledb-johnkerl/s/a/stack-small/
Ingesting s3://tiledb-johnkerl/s/a/stack-small/stack1.h5ad
...starting ingestion
Experiment URI: /tmp/soma-stack1-r3wgnyn4
...obs data:
   soma_joinid obs_id cell_type  is_primary_data
0            0   AAAT    B cell             True
1            1   ACTG    T cell             True
2            2   AGAG    B cell             True

No new tests were added because this only affects files on S3. However, this bug is further motivation to introduce testing with cloud storage.

Copy link

codecov bot commented Jan 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.30%. Comparing base (4d536c7) to head (fd39016).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3543      +/-   ##
==========================================
+ Coverage   86.27%   86.30%   +0.03%     
==========================================
  Files          55       55              
  Lines        6369     6369              
==========================================
+ Hits         5495     5497       +2     
+ Misses        874      872       -2     
Flag Coverage Δ
python 86.30% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 86.30% <ø> (+0.03%) ⬆️
libtiledbsoma ∅ <ø> (∅)

Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥
Thank you @nguyenv !

@nguyenv nguyenv merged commit f832dfc into main Jan 10, 2025
12 checks passed
@nguyenv nguyenv deleted the viviannguyen/sc-61381/python-from-h5ad-fails-with-filenotfounderror branch January 10, 2025 15:38
johnkerl pushed a commit that referenced this pull request Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants