Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Parquet writer to include Column Index feature #9268

Closed
revans2 opened this issue Sep 22, 2021 · 8 comments · Fixed by #11302
Closed

[FEA] Parquet writer to include Column Index feature #9268

revans2 opened this issue Sep 22, 2021 · 8 comments · Fixed by #11302
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Sep 22, 2021

Is your feature request related to a problem? Please describe.
In Parquet 1.11 a new feature was added for column indexes/page indexes.

https://github.com/apache/parquet-format/blob/master/PageIndex.md

When I grep through the code I do not see support for this feature, but I could have missed it. Spark supports using this feature on the CPU to reduce the total amount of data read from disk, and it would be great to be able to write parquet files that support this too. This is so our customers who need to read data using the CPU too can read the data written by the GPU and fast as data written by the CPU.

Describe the solution you'd like
Insert in the ColumnIndex and OffsetIndex automatically for each parquet file we write.

Describe alternatives you've considered
We cannot do this without the help of cudf, so there really are no other alternatives.

Additional context
None

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Sep 22, 2021
@devavret
Copy link
Contributor

IIUC this one is a lot simpler than the reader. Is there any benefit to implementing this before we do it for the reader?

@devavret devavret added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. labels Sep 23, 2021
@beckernick beckernick removed the Needs Triage Need team to review and classify label Sep 24, 2021
@revans2
Copy link
Contributor Author

revans2 commented Sep 28, 2021

The benefit would be that Spark CPU and other CPU software that is compatible with the feature will be able to read the same data much faster, because they didn't have to read in all of the pages. It would not provide a direct benefit to GPU processing without the read side changes.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@etseidl
Copy link
Contributor

etseidl commented Jun 1, 2022

Hi all. This issue is a pretty important one to me. I agree with @revans2 that implementing this in the writer has value to other projects even without reader support. I've been working on this lately, and once I get approval will be submitting a draft PR.

@devavret
Copy link
Contributor

devavret commented Jun 1, 2022

Hi, @etseidl, It would be great if you could submit a PR. One blocker we had due to which we didn't implement this yet is that without reader support, we wouldn't have any unit tests for this. And even if we did have reader support, I couldn't figure out how to properly test it. Any suggestions on that front are welcome.

@etseidl
Copy link
Contributor

etseidl commented Jun 1, 2022

Hi @devavret. Yes, testing this will require a lot of work. My initial thought was to write files with a variety of column types, and then read the footer to find the offset indexes. With those, I'd then seek to the start of each page and make sure the page headers were valid. For the column indexes, for a start I'd find the smallest min value and largest max value and make sure those at least match the stats for the column chunk. I don't yet have any thoughts beyond that :(

@devavret
Copy link
Contributor

devavret commented Jun 2, 2022

If you know how to check the validity of page headers then you could also check if the stat values match the values in the column index.

Anyway, your plan sounds good.

rapids-bot bot pushed a commit that referenced this issue Jul 4, 2022
Adds some necessary structs to parquet.hpp as well as methods to CompactProtocolReader/Writer to address #9268

I can add tests if necessary once #11177 is merged, or testing can be deferred to be included in a future PR (based on #11171)

Authors:
  - https://github.com/etseidl

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11178
rapids-bot bot pushed a commit that referenced this issue Jul 26, 2022
Closes #9268.

The column indexes are actually two different structures. The column index itself which is essentially per-page min/max statistics, and the offset index which stores each page's location, compressed size, and first row index. Since the column index contains information already in the EncColumnChunk structure, I calculate and encode the column index per chunk on device, storing the result in a blob I added to the EncColumnChunk struct. The offset index requires information available only after writing the file, so it is created on the CPU and stored in the aggregate_writer_metadata struct. The indexes themselves are then written to the file before the footer.

The current implementation does not include truncation of the statistics as recommended.  This will be addressed in a later PR.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - https://github.com/nvdbaranec
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #11302
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants