Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] How to efficiently read data from Google Cloud Storage (gs://)? #1088

Closed
leiterenato opened this issue Sep 2, 2021 · 3 comments · Fixed by #1119
Closed

[QST] How to efficiently read data from Google Cloud Storage (gs://)? #1088

leiterenato opened this issue Sep 2, 2021 · 3 comments · Fixed by #1119
Assignees
Labels
question Further information is requested

Comments

@leiterenato
Copy link

What is your question?

I am using NVTabular to do the preprocessing of large parquet files stored in a bucket in Google Cloud Storage (~11GB each file - criteo dataset).
The end goal is to find a solution to efficiently preprocess large volumes of data (> 100TB) reading parquet files directly from GCS (protocol gs).

For an initial benchmark I used this script from the NVTabular repository.
In line 226 it creates a Dataset (universal external-data wrapper) from a list of GCS paths which instantiates a 'PaquetDatasetEngine' as 'self.engine'.

NVTabular methods .fit and .transform receive the Dataset instance and they both call the method Dataset.to_dff which returns the result of dask_cudf.read_parquet.
The problem is that for each call it opens the entire file from GCS with fs.open(path, mode="rb") and if it is reading parts of the file, it is opened multiple times.

As cuDF uses fsspec (and gcsfs for gs protocol), would it be possible to read parts of the file in parallel in GCS instead of opening the entire file?
GCS supports parallel random access to files in a bucket, so I did a quick test and it would be much fasters to use fsspec.filesystem('gs').read_block instead of just fsspec.open.

Example:
import fsspec as fs
a = fs.filesystem('gs')
a.read_block('gs:///criteo-parque/day_0.parquet', 10000, 1000)

I am using this Dockerfile definition for the tests.

@leiterenato leiterenato added the question Further information is requested label Sep 2, 2021
@leiterenato leiterenato changed the title [QST] [QST] How to efficiently read data from Google Cloud Storage (gs://)? Sep 2, 2021
@benfred benfred assigned rjzamora and unassigned albert17 Sep 7, 2021
@rjzamora
Copy link
Collaborator

Note that #1119 should get us a lot closer to where we want to be on GCS performance. In the long term, it will be best if cudf can efficiently work with Arrow-based file object, and in the "medium" term, we would want the fsspec data-transfer optimizations to live in cudf or Dask. However, for now, that PR exposes a nice location for use to explore and test remote data-transfer optimizations.

@benfred benfred linked a pull request Oct 4, 2021 that will close this issue
@benfred
Copy link
Member

benfred commented Oct 4, 2021

Fixed by #1119

@benfred benfred closed this as completed Oct 4, 2021
@leiterenato
Copy link
Author

Thank you very much!
This is great news ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants