[QST] How to efficiently read data from Google Cloud Storage (gs://)? #1088

leiterenato · 2021-09-02T13:39:52Z

What is your question?

I am using NVTabular to do the preprocessing of large parquet files stored in a bucket in Google Cloud Storage (~11GB each file - criteo dataset).
The end goal is to find a solution to efficiently preprocess large volumes of data (> 100TB) reading parquet files directly from GCS (protocol gs).

For an initial benchmark I used this script from the NVTabular repository.
In line 226 it creates a Dataset (universal external-data wrapper) from a list of GCS paths which instantiates a 'PaquetDatasetEngine' as 'self.engine'.

NVTabular methods .fit and .transform receive the Dataset instance and they both call the method Dataset.to_dff which returns the result of dask_cudf.read_parquet.
The problem is that for each call it opens the entire file from GCS with fs.open(path, mode="rb") and if it is reading parts of the file, it is opened multiple times.

As cuDF uses fsspec (and gcsfs for gs protocol), would it be possible to read parts of the file in parallel in GCS instead of opening the entire file?
GCS supports parallel random access to files in a bucket, so I did a quick test and it would be much fasters to use fsspec.filesystem('gs').read_block instead of just fsspec.open.

Example:
import fsspec as fs
a = fs.filesystem('gs')
a.read_block('gs:///criteo-parque/day_0.parquet', 10000, 1000)

I am using this Dockerfile definition for the tests.

The text was updated successfully, but these errors were encountered:

rjzamora · 2021-09-20T16:25:54Z

Note that #1119 should get us a lot closer to where we want to be on GCS performance. In the long term, it will be best if cudf can efficiently work with Arrow-based file object, and in the "medium" term, we would want the fsspec data-transfer optimizations to live in cudf or Dask. However, for now, that PR exposes a nice location for use to explore and test remote data-transfer optimizations.

benfred · 2021-10-04T16:49:06Z

Fixed by #1119

leiterenato · 2021-10-04T20:15:14Z

Thank you very much!
This is great news ...

leiterenato added the question Further information is requested label Sep 2, 2021

leiterenato changed the title ~~[QST]~~ [QST] How to efficiently read data from Google Cloud Storage (gs://)? Sep 2, 2021

jperez999 assigned albert17 Sep 7, 2021

benfred assigned rjzamora and unassigned albert17 Sep 7, 2021

martindurant mentioned this issue Sep 10, 2021

Parquet enhancements dask/dask#8132

Closed

This was referenced Sep 13, 2021

[Experimental] Optimize cudf/dask-cudf read_parquet for s3/remote filesystems rapidsai/cudf#9225

Closed

Add optimized read_parquet path for remote storage #1119

Merged

benfred linked a pull request Oct 4, 2021 that will close this issue

Add optimized read_parquet path for remote storage #1119

Merged

benfred closed this as completed Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How to efficiently read data from Google Cloud Storage (gs://)? #1088

[QST] How to efficiently read data from Google Cloud Storage (gs://)? #1088

leiterenato commented Sep 2, 2021

rjzamora commented Sep 20, 2021

benfred commented Oct 4, 2021

leiterenato commented Oct 4, 2021

[QST] How to efficiently read data from Google Cloud Storage (gs://)? #1088

[QST] How to efficiently read data from Google Cloud Storage (gs://)? #1088

Comments

leiterenato commented Sep 2, 2021

rjzamora commented Sep 20, 2021

benfred commented Oct 4, 2021

leiterenato commented Oct 4, 2021