-
Notifications
You must be signed in to change notification settings - Fork 922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Parquet writer to include Column Index feature #9268
Comments
IIUC this one is a lot simpler than the reader. Is there any benefit to implementing this before we do it for the reader? |
The benefit would be that Spark CPU and other CPU software that is compatible with the feature will be able to read the same data much faster, because they didn't have to read in all of the pages. It would not provide a direct benefit to GPU processing without the read side changes. |
This issue has been labeled |
This issue has been labeled |
Hi all. This issue is a pretty important one to me. I agree with @revans2 that implementing this in the writer has value to other projects even without reader support. I've been working on this lately, and once I get approval will be submitting a draft PR. |
Hi, @etseidl, It would be great if you could submit a PR. One blocker we had due to which we didn't implement this yet is that without reader support, we wouldn't have any unit tests for this. And even if we did have reader support, I couldn't figure out how to properly test it. Any suggestions on that front are welcome. |
Hi @devavret. Yes, testing this will require a lot of work. My initial thought was to write files with a variety of column types, and then read the footer to find the offset indexes. With those, I'd then seek to the start of each page and make sure the page headers were valid. For the column indexes, for a start I'd find the smallest min value and largest max value and make sure those at least match the stats for the column chunk. I don't yet have any thoughts beyond that :( |
If you know how to check the validity of page headers then you could also check if the stat values match the values in the column index. Anyway, your plan sounds good. |
Adds some necessary structs to parquet.hpp as well as methods to CompactProtocolReader/Writer to address #9268 I can add tests if necessary once #11177 is merged, or testing can be deferred to be included in a future PR (based on #11171) Authors: - https://github.com/etseidl Approvers: - Devavret Makkar (https://github.com/devavret) - Yunsong Wang (https://github.com/PointKernel) URL: #11178
Closes #9268. The column indexes are actually two different structures. The column index itself which is essentially per-page min/max statistics, and the offset index which stores each page's location, compressed size, and first row index. Since the column index contains information already in the EncColumnChunk structure, I calculate and encode the column index per chunk on device, storing the result in a blob I added to the EncColumnChunk struct. The offset index requires information available only after writing the file, so it is created on the CPU and stored in the aggregate_writer_metadata struct. The indexes themselves are then written to the file before the footer. The current implementation does not include truncation of the statistics as recommended. This will be addressed in a later PR. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Yunsong Wang (https://github.com/PointKernel) - https://github.com/nvdbaranec - Mike Wilson (https://github.com/hyperbolic2346) URL: #11302
Is your feature request related to a problem? Please describe.
In Parquet 1.11 a new feature was added for column indexes/page indexes.
https://github.com/apache/parquet-format/blob/master/PageIndex.md
When I grep through the code I do not see support for this feature, but I could have missed it. Spark supports using this feature on the CPU to reduce the total amount of data read from disk, and it would be great to be able to write parquet files that support this too. This is so our customers who need to read data using the CPU too can read the data written by the GPU and fast as data written by the CPU.
Describe the solution you'd like
Insert in the ColumnIndex and OffsetIndex automatically for each parquet file we write.
Describe alternatives you've considered
We cannot do this without the help of cudf, so there really are no other alternatives.
Additional context
None
The text was updated successfully, but these errors were encountered: