[FEA] Parquet writer to include Column Index feature #9268

revans2 · 2021-09-22T10:47:29Z

Is your feature request related to a problem? Please describe.
In Parquet 1.11 a new feature was added for column indexes/page indexes.

https://github.com/apache/parquet-format/blob/master/PageIndex.md

When I grep through the code I do not see support for this feature, but I could have missed it. Spark supports using this feature on the CPU to reduce the total amount of data read from disk, and it would be great to be able to write parquet files that support this too. This is so our customers who need to read data using the CPU too can read the data written by the GPU and fast as data written by the CPU.

Describe the solution you'd like
Insert in the ColumnIndex and OffsetIndex automatically for each parquet file we write.

Describe alternatives you've considered
We cannot do this without the help of cudf, so there really are no other alternatives.

Additional context
None

devavret · 2021-09-23T00:02:19Z

IIUC this one is a lot simpler than the reader. Is there any benefit to implementing this before we do it for the reader?

revans2 · 2021-09-28T18:51:59Z

The benefit would be that Spark CPU and other CPU software that is compatible with the feature will be able to read the same data much faster, because they didn't have to read in all of the pages. It would not provide a direct benefit to GPU processing without the read side changes.

github-actions · 2021-11-15T21:03:27Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-02-13T22:03:13Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

etseidl · 2022-06-01T17:09:14Z

Hi all. This issue is a pretty important one to me. I agree with @revans2 that implementing this in the writer has value to other projects even without reader support. I've been working on this lately, and once I get approval will be submitting a draft PR.

devavret · 2022-06-01T17:41:35Z

Hi, @etseidl, It would be great if you could submit a PR. One blocker we had due to which we didn't implement this yet is that without reader support, we wouldn't have any unit tests for this. And even if we did have reader support, I couldn't figure out how to properly test it. Any suggestions on that front are welcome.

etseidl · 2022-06-01T17:56:31Z

Hi @devavret. Yes, testing this will require a lot of work. My initial thought was to write files with a variety of column types, and then read the footer to find the offset indexes. With those, I'd then seek to the start of each page and make sure the page headers were valid. For the column indexes, for a start I'd find the smallest min value and largest max value and make sure those at least match the stats for the column chunk. I don't yet have any thoughts beyond that :(

devavret · 2022-06-02T07:10:59Z

If you know how to check the validity of page headers then you could also check if the stat values match the values in the column index.

Anyway, your plan sounds good.

Adds some necessary structs to parquet.hpp as well as methods to CompactProtocolReader/Writer to address #9268 I can add tests if necessary once #11177 is merged, or testing can be deferred to be included in a future PR (based on #11171) Authors: - https://github.com/etseidl Approvers: - Devavret Makkar (https://github.com/devavret) - Yunsong Wang (https://github.com/PointKernel) URL: #11178

Closes #9268. The column indexes are actually two different structures. The column index itself which is essentially per-page min/max statistics, and the offset index which stores each page's location, compressed size, and first row index. Since the column index contains information already in the EncColumnChunk structure, I calculate and encode the column index per chunk on device, storing the result in a blob I added to the EncColumnChunk struct. The offset index requires information available only after writing the file, so it is created on the CPU and stored in the aggregate_writer_metadata struct. The indexes themselves are then written to the file before the footer. The current implementation does not include truncation of the statistics as recommended. This will be addressed in a later PR. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Yunsong Wang (https://github.com/PointKernel) - https://github.com/nvdbaranec - Mike Wilson (https://github.com/hyperbolic2346) URL: #11302

revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Sep 22, 2021

revans2 mentioned this issue Sep 22, 2021

[FEA] Enable Page-level filtering based on the ColumnIndex feature from parquet 1.11 #9269

Open

devavret added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. labels Sep 23, 2021

beckernick removed the Needs Triage Need team to review and classify label Sep 24, 2021

github-actions bot added the inactive-30d label Nov 15, 2021

github-actions bot added the inactive-90d label Feb 13, 2022

This was referenced Jun 29, 2022

Modify Parquet writer to produce column indexes #11171

Closed

Add thrift support for parquet column and offset indexes #11178

Merged

etseidl mentioned this issue Jul 19, 2022

Add column indexes to Parquet writer #11302

Merged

rapids-bot bot closed this as completed in #11302 Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Parquet writer to include Column Index feature #9268

[FEA] Parquet writer to include Column Index feature #9268

revans2 commented Sep 22, 2021

devavret commented Sep 23, 2021

revans2 commented Sep 28, 2021

github-actions bot commented Nov 15, 2021

github-actions bot commented Feb 13, 2022

etseidl commented Jun 1, 2022

devavret commented Jun 1, 2022

etseidl commented Jun 1, 2022

devavret commented Jun 2, 2022

[FEA] Parquet writer to include Column Index feature #9268

[FEA] Parquet writer to include Column Index feature #9268

Comments

revans2 commented Sep 22, 2021

devavret commented Sep 23, 2021

revans2 commented Sep 28, 2021

github-actions bot commented Nov 15, 2021

github-actions bot commented Feb 13, 2022

etseidl commented Jun 1, 2022

devavret commented Jun 1, 2022

etseidl commented Jun 1, 2022

devavret commented Jun 2, 2022