Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate fsspec to enable accessing WFDB files from cloud URIs #523

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

briangow
Copy link
Contributor

@briangow briangow commented Jan 6, 2025

As mentioned in #517, we want to be able to read WFDB files from within cloud environments using WFDB-Python. This PR enables using the fsspec library ( https://filesystem-spec.readthedocs.io/en/latest/ ) to read WFDB files from cloud URIs. It replaces the standard Python open with fsspec.open . Also, it adds logic to differentiate between loading a file from a cloud URI or from a PhysioNet Database.

In the initial commit, access has only been added for rdheader. We can expand this across all relevant WFDB functions once the approach has been agreed upon.

I've tested this with a local .hea file, a file read from a PhysioNet Database (using pn_dir), and a file from a Datastore in the Azure AI / ML Studio.

@briangow briangow requested a review from bemoody January 6, 2025 19:22
@briangow
Copy link
Contributor Author

briangow commented Jan 6, 2025

@bemoody , could you suggest how to install fsspec in the test environment. Trying apt-get install -y --no-install-recommends \ python3-fsspec \ doesn't work since python3-fsspec isn't recognized. When trying to use pip instead to install fsspec I get the pip: not found error currently seen in the tests (test-deb10-i386).

@briangow
Copy link
Contributor Author

briangow commented Jan 9, 2025

@bemoody , I've updated this per our discussion:

  • The tests are now being run on Debian 11, where python3-fsspec is able to be installed
  • I removed changes to _stream_header so the pn_dir approach is no longer being changed at all
  • I updated logic in rdheader so that each of the 3 cases (cloud, pn_dir / PhysioNet servers, local) are handled separately. This is being done so that the appropriate path separators are used and pn_dir gets sent to _stream_header.

@bemoody
Copy link
Collaborator

bemoody commented Jan 17, 2025

I think the general idea makes sense: allow record_name to be either a (partial) path or a URL.

In determining whether something is a URL, we should require it to begin with <protocol>:// (i.e., include the double slash in CLOUD_PROTOCOLS.)

This code currently doesn't work because in line 1835 you are opening the file in binary mode ("rb"), whereas rdheader wants text ("r").

(Note that, completely independent of this pull request, the use of errors="ignore" in line 1855 is totally wrong. But it is probably best to keep behavior consistent between local and cloud files.)

@briangow
Copy link
Contributor Author

Thanks @bemoody !

In determining whether something is a URL, we should require it to begin with :// (i.e., include the double slash in CLOUD_PROTOCOLS.)

Makes sense, I've updated this.

This code currently doesn't work because in line 1835 you are opening the file in binary mode ("rb"), whereas rdheader wants text ("r").

Strange, I had already made this change on my local copy but it didn't make it to the remote.

@briangow
Copy link
Contributor Author

briangow commented Jan 17, 2025

@bemoody , I'll continue with this PR by updating the following in a similar manner:

  • wfdb.io.rdrecord
  • wfdb.io.rdsamp
  • wfdb.io.rdann

Do you think we should also integrate fsspec into the write functions / wrsamp ? If so, how should we deal with the soundfile writes?

sf.write(d_signal)
Similarly, we'll also need to figure this out for reading compressed files:
sf.seek(start_samp + sample_offset)

Please let me know if there are other areas where I should be integrating fsspec.

@bemoody
Copy link
Collaborator

bemoody commented Jan 17, 2025

Similarly, we'll also need to figure this out for reading compressed files:

Currently, to read a signal file, we open it using open(filename, "rb"), right? And then the resulting file object is passed to the SoundFile constructor. Hopefully the same should work using an fsspec file object.

Do you think we should also integrate fsspec into the write functions / wrsamp ?

Yes, that would be ideal, but as a separate pull request. I haven't tried, but I assume you can use "w"/"wb" mode with fsspec.open if you have appropriate credentials to write to the specified location.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants