[FEA] Profiling duplicate reading of metadata #6004
Labels
cuIO
cuIO issue
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
Milestone
The row-group-level filtered reading for Parquet that is introduced by #5843 creates an issue of duplicate metadata (metadata is stored in the footers of Parquet files) reading in the case when filters are specified. Arrow is used to read metadata and select a subset of data to read given user-provided filters [4] . Information about this subset is then passed to libcudf which reads in the subset [5]. The issue is that metadata gets read twice - first when Arrow reads metadata to do filtering and second when libcudf reads data.
This issue was initially raised here [1].
What to profile
What to determine
Dataset
to libcudf reader functionsRelevant discussion/code
[1] #5843 (comment)
[2] #5843 (comment)
[3] https://github.com/apache/arrow/blob/2e6009621011d7df43882aa883905b84d1647018/cpp/src/parquet/file_reader.cc#L532
[4] https://github.com/rapidsai/cudf/pull/5843/files#diff-deac873508aaa12ca2e7c0a2c9035230R316
[5] https://github.com/rapidsai/cudf/pull/5843/files#diff-deac873508aaa12ca2e7c0a2c9035230R359-R375
[6] #5843 (comment)
[7] #5843 (comment)
The text was updated successfully, but these errors were encountered: