Integrate `fsspec` to enable accessing WFDB files from cloud URIs #523

briangow · 2025-01-06T19:22:32Z

As mentioned in #517, we want to be able to read WFDB files from within cloud environments using WFDB-Python. This PR enables using the fsspec library ( https://filesystem-spec.readthedocs.io/en/latest/ ) to read WFDB files from cloud URIs. It replaces the standard Python open with fsspec.open . Also, it adds logic to differentiate between loading a file from a cloud URI or from a PhysioNet Database.

In the initial commit, access has only been added for rdheader. We can expand this across all relevant WFDB functions once the approach has been agreed upon.

I've tested this with a local .hea file, a file read from a PhysioNet Database (using pn_dir), and a file from a Datastore in the Azure AI / ML Studio.

briangow · 2025-01-06T20:25:56Z

@bemoody , could you suggest how to install fsspec in the test environment. Trying apt-get install -y --no-install-recommends \ python3-fsspec \ doesn't work since python3-fsspec isn't recognized. When trying to use pip instead to install fsspec I get the pip: not found error currently seen in the tests (test-deb10-i386).

briangow · 2025-01-09T22:14:05Z

@bemoody , I've updated this per our discussion:

The tests are now being run on Debian 11, where python3-fsspec is able to be installed
I removed changes to _stream_header so the pn_dir approach is no longer being changed at all
I updated logic in rdheader so that each of the 3 cases (cloud, pn_dir / PhysioNet servers, local) are handled separately. This is being done so that the appropriate path separators are used and pn_dir gets sent to _stream_header.

bemoody · 2025-01-17T18:49:31Z

I think the general idea makes sense: allow record_name to be either a (partial) path or a URL.

In determining whether something is a URL, we should require it to begin with <protocol>:// (i.e., include the double slash in CLOUD_PROTOCOLS.)

This code currently doesn't work because in line 1835 you are opening the file in binary mode ("rb"), whereas rdheader wants text ("r").

(Note that, completely independent of this pull request, the use of errors="ignore" in line 1855 is totally wrong. But it is probably best to keep behavior consistent between local and cloud files.)

briangow · 2025-01-17T20:28:19Z

Thanks @bemoody !

In determining whether something is a URL, we should require it to begin with :// (i.e., include the double slash in CLOUD_PROTOCOLS.)

Makes sense, I've updated this.

This code currently doesn't work because in line 1835 you are opening the file in binary mode ("rb"), whereas rdheader wants text ("r").

Strange, I had already made this change on my local copy but it didn't make it to the remote.

briangow · 2025-01-17T20:55:02Z

@bemoody , I'll continue with this PR by updating the following in a similar manner:

wfdb.io.rdrecord
wfdb.io.rdsamp
wfdb.io.rdann

Do you think we should also integrate fsspec into the write functions / wrsamp ? If so, how should we deal with the soundfile writes?

wfdb-python/wfdb/io/_signal.py

Line 2511 in e5c6fe5

sf.write(d_signal)

Similarly, we'll also need to figure this out for reading compressed files:

wfdb-python/wfdb/io/_signal.py

Line 1925 in e5c6fe5

sf.seek(start_samp + sample_offset)

Please let me know if there are other areas where I should be integrating fsspec.

bemoody · 2025-01-17T21:55:11Z

Similarly, we'll also need to figure this out for reading compressed files:

Currently, to read a signal file, we open it using open(filename, "rb"), right? And then the resulting file object is passed to the SoundFile constructor. Hopefully the same should work using an fsspec file object.

Do you think we should also integrate fsspec into the write functions / wrsamp ?

Yes, that would be ideal, but as a separate pull request. I haven't tried, but I assume you can use "w"/"wb" mode with fsspec.open if you have appropriate credentials to write to the specified location.

add fsspec to rdheader

fce4d62

briangow requested a review from bemoody January 6, 2025 19:22

briangow added 3 commits January 6, 2025 14:26

downgrade aiohttp for python 3.8 compatibility

2a116b9

add fsspec to run-tests

d72ccd7

reformat for compatibility with black

8930f1c

briangow force-pushed the bg_fsspec branch from 4b4e85c to 8930f1c Compare January 6, 2025 20:02

install pip before calling it during run-tests

de1483c

briangow added 6 commits January 9, 2025 11:34

update tests to run on debian 11

3794f92

dont use fsspec for pn_dir files

dfb7818

move cloud_protocols definition

ccd03cc

reformat per black

91c20b2

dont use local path separator for uri

b3e0bd7

only call abspath for local files

fa27e34

briangow added 2 commits January 17, 2025 15:24

use correct read mode

489dcc4

use double slash for cloud protocol urls

6e2b455

add fsspec to rdrecord

1b6f57b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate `fsspec` to enable accessing WFDB files from cloud URIs #523

Integrate `fsspec` to enable accessing WFDB files from cloud URIs #523

briangow commented Jan 6, 2025

briangow commented Jan 6, 2025

briangow commented Jan 9, 2025

bemoody commented Jan 17, 2025

briangow commented Jan 17, 2025

briangow commented Jan 17, 2025 •

edited

Loading

bemoody commented Jan 17, 2025 •

edited

Loading

Integrate fsspec to enable accessing WFDB files from cloud URIs #523

Are you sure you want to change the base?

Integrate fsspec to enable accessing WFDB files from cloud URIs #523

Conversation

briangow commented Jan 6, 2025

briangow commented Jan 6, 2025

briangow commented Jan 9, 2025

bemoody commented Jan 17, 2025

briangow commented Jan 17, 2025

briangow commented Jan 17, 2025 • edited Loading

bemoody commented Jan 17, 2025 • edited Loading

Integrate `fsspec` to enable accessing WFDB files from cloud URIs #523

Integrate `fsspec` to enable accessing WFDB files from cloud URIs #523

briangow commented Jan 17, 2025 •

edited

Loading

bemoody commented Jan 17, 2025 •

edited

Loading