Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Profiling duplicate reading of metadata #6004

Open
4 tasks
calebwin opened this issue Aug 17, 2020 · 3 comments
Open
4 tasks

[FEA] Profiling duplicate reading of metadata #6004

calebwin opened this issue Aug 17, 2020 · 3 comments
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@calebwin
Copy link
Contributor

The row-group-level filtered reading for Parquet that is introduced by #5843 creates an issue of duplicate metadata (metadata is stored in the footers of Parquet files) reading in the case when filters are specified. Arrow is used to read metadata and select a subset of data to read given user-provided filters [4] . Information about this subset is then passed to libcudf which reads in the subset [5]. The issue is that metadata gets read twice - first when Arrow reads metadata to do filtering and second when libcudf reads data.

This issue was initially raised here [1].

What to profile

  • Perf penalty of reading metadata using Arrow for filtering in the same vein as [2] but with datasets of varying # of files
  • Perf penalty of parsing metadata buffer [3] as fraction of total time Arrow spends reading metadata

What to determine

  • Determine whether or not perf penalty of the additional reading of metadata using Arrow is significant
  • Determine whether the duplicate reading should be resolved by passing metadata struct (steps to implement [6]) or metadata buffer (which is then parsed into metadata struct in libcudf) (steps to implement [7]) from Arrow Dataset to libcudf reader functions

Relevant discussion/code

[1] #5843 (comment)
[2] #5843 (comment)
[3] https://github.com/apache/arrow/blob/2e6009621011d7df43882aa883905b84d1647018/cpp/src/parquet/file_reader.cc#L532
[4] https://github.com/rapidsai/cudf/pull/5843/files#diff-deac873508aaa12ca2e7c0a2c9035230R316
[5] https://github.com/rapidsai/cudf/pull/5843/files#diff-deac873508aaa12ca2e7c0a2c9035230R359-R375
[6] #5843 (comment)
[7] #5843 (comment)

@calebwin calebwin added Needs Triage Need team to review and classify feature request New feature or request labels Aug 17, 2020
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Aug 18, 2020
@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@GregoryKimball
Copy link
Contributor

Hello @wence- , now that #15028 is merged, would you please let me know if cuDF-python is still reading parquet row group metadata using pyarrow? Or is that step completely removed?

@wence-
Copy link
Contributor

wence- commented Feb 17, 2024

Hello @wence- , now that #15028 is merged, would you please let me know if cuDF-python is still reading parquet row group metadata using pyarrow? Or is that step completely removed?

That change just exposed the libcudf functionality, we haven't migrated to using it from cudf-python (partly due to #15051 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

4 participants