You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using NVTabular to do the preprocessing of large parquet files stored in a bucket in Google Cloud Storage (~11GB each file - criteo dataset).
The end goal is to find a solution to efficiently preprocess large volumes of data (> 100TB) reading parquet files directly from GCS (protocol gs).
For an initial benchmark I used this script from the NVTabular repository.
In line 226 it creates a Dataset (universal external-data wrapper) from a list of GCS paths which instantiates a 'PaquetDatasetEngine' as 'self.engine'.
NVTabular methods .fit and .transform receive the Dataset instance and they both call the method Dataset.to_dff which returns the result of dask_cudf.read_parquet.
The problem is that for each call it opens the entire file from GCS with fs.open(path, mode="rb") and if it is reading parts of the file, it is opened multiple times.
As cuDF uses fsspec (and gcsfs for gs protocol), would it be possible to read parts of the file in parallel in GCS instead of opening the entire file?
GCS supports parallel random access to files in a bucket, so I did a quick test and it would be much fasters to use fsspec.filesystem('gs').read_block instead of just fsspec.open.
Example:
import fsspec as fs
a = fs.filesystem('gs')
a.read_block('gs:///criteo-parque/day_0.parquet', 10000, 1000)
Note that #1119 should get us a lot closer to where we want to be on GCS performance. In the long term, it will be best if cudf can efficiently work with Arrow-based file object, and in the "medium" term, we would want the fsspec data-transfer optimizations to live in cudf or Dask. However, for now, that PR exposes a nice location for use to explore and test remote data-transfer optimizations.
What is your question?
I am using NVTabular to do the preprocessing of large parquet files stored in a bucket in Google Cloud Storage (~11GB each file - criteo dataset).
The end goal is to find a solution to efficiently preprocess large volumes of data (> 100TB) reading parquet files directly from GCS (protocol
gs
).For an initial benchmark I used this script from the NVTabular repository.
In line 226 it creates a Dataset (universal external-data wrapper) from a list of GCS paths which instantiates a 'PaquetDatasetEngine' as 'self.engine'.
NVTabular methods .fit and .transform receive the Dataset instance and they both call the method Dataset.to_dff which returns the result of dask_cudf.read_parquet.
The problem is that for each call it opens the entire file from GCS with fs.open(path, mode="rb") and if it is reading parts of the file, it is opened multiple times.
As cuDF uses fsspec (and gcsfs for
gs
protocol), would it be possible to read parts of the file in parallel in GCS instead of opening the entire file?GCS supports parallel random access to files in a bucket, so I did a quick test and it would be much fasters to use fsspec.filesystem('gs').read_block instead of just fsspec.open.
Example:
import fsspec as fs
a = fs.filesystem('gs')
a.read_block('gs:///criteo-parque/day_0.parquet', 10000, 1000)
I am using this Dockerfile definition for the tests.
The text was updated successfully, but these errors were encountered: