-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement chunked Parquet reader #11867
Merged
Merged
Changes from 168 commits
Commits
Show all changes
181 commits
Select commit
Hold shift + click to select a range
c2e8e87
Fix an issue where using num_rows and skip_rows on a parquet file con…
nvdbaranec f330431
Merge branch 'branch-22.12' into reader_preprocess_fix_and_opt
nvdbaranec eadfd63
Fixed an issue with the tests: input columns cannot have unsanitary …
nvdbaranec c4de038
Merge branch 'branch-22.12' into reader_preprocess_fix_and_opt
nvdbaranec 222c9fe
Copy `parquet_reader_*` into `chunked_parquet_reader_*`
ttnghia f49cfed
Modify `chunked_parquet_reader_options`
ttnghia dd39804
Exploit inheritance to extend the options and options_builder classes
ttnghia 81bc68f
Remove unnecessary variable
ttnghia f8126be
Misc
ttnghia 0e7692c
Add docs
ttnghia 9f9eeb0
PR feedback changes.
nvdbaranec 9b3ea62
Merge branch 'branch-22.12' into reader_preprocess_fix_and_opt
nvdbaranec d2e409a
Fixed some compile errors from merging.
nvdbaranec ed41ac1
Add `chunked_parquet_reader`
ttnghia be782f2
Add empty implementation
ttnghia 7908b66
Merge branch 'branch-22.12' into parquet_reader
ttnghia a7175c8
Add a destructor and `close`
ttnghia 63a7bd6
Update docs
ttnghia 16c12d9
Fix comment
ttnghia cd85385
Construct `chunked_parquet_reader`
ttnghia 5944beb
Add comment
ttnghia 7cfa72a
Rename function and implementing
ttnghia 4696bd3
MISC
ttnghia 99dc786
Bare bones implementation. Many types still not working.
nvdbaranec ad9c399
Merge branch 'branch-22.12' into parquet_reader
ttnghia ecf225d
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia 583a7ef
Add test
ttnghia b250d6f
Cleanup
ttnghia e7a9e3e
Modify docs
ttnghia 811354a
Cleanup
ttnghia 12ba72e
Add TODO
ttnghia 45668ff
Add `read_intermediate_data`
ttnghia 1bb8254
Use `read_intermediate_data`
ttnghia b1c44dd
Merge branch 'branch-22.12' into parquet_reader
ttnghia 56715ef
Fix bug
ttnghia a7e7e93
Simplify code
ttnghia 8fe87b1
Implement `file_intermediate_data`
ttnghia 464f4f9
Add `make_output`
ttnghia 56756d6
Implement `read_chunk`
ttnghia 3044ac5
Cleanup
ttnghia ffb8a19
Fix bug when `skip_rows` and `num_rows` are modified inside a called …
ttnghia baf3603
Fix comment
ttnghia 8bdab44
Store preprocess data
ttnghia ec4abfb
Implement `chunked_reader` detail class
ttnghia cb1dea4
Refactoring
ttnghia a8dfd82
Rename structs
ttnghia 7889e5a
Increment `current_read_chunk`
ttnghia 63a6511
Call preprocessing in `read_chunk`
ttnghia c1269d1
Fix `has_next`
ttnghia 95e6c1d
Refactoring
ttnghia bd7b510
Fix errors
ttnghia eb78526
Merge branch 'branch-22.12' into chunked_reader_gpu. Also: work to…
nvdbaranec 66aeaf4
Change param
ttnghia 1d700e3
Rename variables
ttnghia 4af948b
Remove intermediate variables
ttnghia 28cfc6f
Modify tests
ttnghia 8135ed5
First pass of string support.
nvdbaranec fbeabfc
Fix bug
ttnghia df074e0
Remove debug print
ttnghia 5653090
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia 3071fd7
Merge branch 'branch-22.12' into parquet_reader
ttnghia 631eff1
Fix tests
ttnghia d1b4e4c
Fix chunk size limit
ttnghia 3f2f8a4
Turn back to do preprocess once
ttnghia 974e7ef
The read limit parameter is now no longer const but truely runtime pa…
ttnghia 0be096b
Add new test file
ttnghia f7018fe
Reverse `parquet_test.cpp`
ttnghia 81097eb
Modify `read` to add exception and preprocess once
ttnghia fcffac8
Rewrite tests
ttnghia 43dd802
Store `decomp_page_data`
ttnghia eeec023
Rewrite tests
ttnghia 14dfd3f
Simple test
ttnghia 66e9f09
Store `raw_page_data`
ttnghia 669b8cf
Cleanup test
ttnghia 001c6c7
Fix empty output
ttnghia f50603a
Add `preprocess_file_and_columns`
ttnghia 66976aa
Misc
ttnghia 0b0040a
Fixed some incorrect logic in preprocess tep.
nvdbaranec 467de78
Removed debug stuff. Added some comments.
nvdbaranec d68bf80
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia 1888aa3
Merge branch 'branch-22.12' into parquet_reader
ttnghia 5861747
Change function
ttnghia 721c052
Disable debug printing
ttnghia 7cda8c2
Fixed an issue with non-first reads in the chunked reader. Made an a…
nvdbaranec 6bc073d
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia 2aef5cc
Fix off-by-one bug
ttnghia 89ca1a6
Merge branch 'branch-22.12' into parquet_reader
ttnghia 5245b9b
Fix an issue related to aliased output pointers in the chunked read c…
nvdbaranec 62283c7
Do not keep reference---copy object instead
ttnghia 44424a2
Optimization: don't do any decoding or page size computation for pag…
nvdbaranec 95df356
Merge branch 'branch-22.12' into chunked_reader_gpu
nvdbaranec 445db9b
Fix build issue for spark-rapids-jni
nvdbaranec db56908
Cleanup: Remove `chunked_parquet_reader_options` and `chunked_parquet…
ttnghia ef5eaee
Move `preprocess_file` into `reader_preprocess.cu`
ttnghia d19260d
Move common implementation into `reader_impl_helpers.*`
ttnghia 6569d62
Cleanup
ttnghia 0a1e2c3
Merge branch 'branch-22.12' into parquet_reader
ttnghia 04e1320
More cleanup
ttnghia 395413d
Rewrite docs for `parquet.hpp` files
ttnghia e3e19e8
Extract functions for `reader` and `chunked_reader`
ttnghia 52339da
Fix issues with string length computation.
nvdbaranec f7e8694
Merge branch 'nghia_parquet_reader' into chunked_reader_gpu
nvdbaranec 83fa31a
Remove redundant changes
ttnghia 02ccdec
Add simple structs test
ttnghia ee0ffad
Rewrite tests
ttnghia 09a89e4
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia 034a5b7
Merge branch 'branch-22.12' into parquet_reader
ttnghia c149d64
Add lists test
ttnghia 9335bb7
MISC
ttnghia 3769fff
Cleanup comments
ttnghia dc9ef5c
Construct output table metadata just once
ttnghia 0366b7a
Construct `_output_columns` just once
ttnghia ea2fe9c
Remove `options` member variable
ttnghia 1804056
Make the chunked_read_limit a soft limit - if we can't find a split, …
nvdbaranec 5671c36
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia 3d6e13b
Add tests for structs of lists and lists of structs
ttnghia 826c46f
Fixed an issue in split generation code causing indexing off the end …
nvdbaranec 36c1972
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia d32c602
Merge branch 'nghia_parquet_reader' into chunked_reader_gpu
nvdbaranec 88ca034
Just reformat
ttnghia d76ee06
Change variable names in tests
ttnghia cf7b786
Merge branch 'nghia_parquet_reader' into chunked_reader_gpu
nvdbaranec b806323
Optimization: store off global nesting sizes per page so that during…
nvdbaranec 0c2178d
Adding doxygen, refactoring and cleaning up
ttnghia f759103
Merge branch 'branch-22.12' into parquet_reader
ttnghia c2bf7f5
Fixed issues with list, and validity size calculations.
nvdbaranec e7e74c5
More refactoring
ttnghia ee9edbc
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia 6fd3e90
Add more tests
ttnghia 9dda980
Add test with empty data
ttnghia bb35f9f
Add tests
ttnghia 59166cf
Rewrite null tests
ttnghia eb6e996
Add more extreme tests
ttnghia 1c56794
Rewrite tests to generate input files just once
ttnghia 4777419
Fix tests with structs of lists
ttnghia aedc37a
Handle nulls for more complex types
ttnghia c00eb3c
Fix another nulls handling bug for strings
ttnghia 4d24f88
Simplify the null purging process
ttnghia b15bb39
Cleanup.
nvdbaranec 7e38a56
Merge branch 'nghia_parquet_reader' into chunked_reader_gpu
nvdbaranec 321815d
Fleshed out list-of-structs and struct-of-lists tests.
nvdbaranec 8cf95e8
Docs and cleanup.
nvdbaranec af35c4d
Update doxygen
ttnghia fb1bd73
Cleaning up
ttnghia 842f9ea
Add doxygen
ttnghia 34e3777
Clean up `reader_impl.hpp`
ttnghia 40e463c
More cleanup
ttnghia 7252b0d
Cleanup `allocate_nesting_info`
ttnghia 696182c
Reformat
ttnghia 4c353fd
Further cleanup
ttnghia 31590cb
Rename `compute_chunk_read_info` into `preprocess_pages`
ttnghia d91c690
Re-adding an optimization that somehow got nuked during a merge.
nvdbaranec c5e73ce
Optimization: store off global nesting sizes per page so that during…
ttnghia 0b31b95
Merge branch 'branch-22.12' into parquet_reader
ttnghia c8e34ba
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia 450e6d5
Merge branch 'branch-22.12' into parquet_reader
ttnghia 4fca0c0
Fix several warnings that show up in the spark-rapids-jni build.
nvdbaranec fd9280c
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia cedcc07
Several changes from PR review.
nvdbaranec 16f78b6
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia 73503bb
Fix typo
ttnghia e9905b8
Fix test
ttnghia 4072a80
Address review comments
ttnghia 95a97fb
Small optimization
ttnghia 8390d4b
Implement `column_buffer::empty_like`
ttnghia b4b131f
Merge branch 'branch-22.12' into parquet_reader
ttnghia 5035bf4
Optimize unit tests: Only call `cudf::concatenate` once
ttnghia 912d86c
Remove redundant check
ttnghia bb2e26e
Fix `cudaLaunchKernel` error in `DecodePageData`
ttnghia 36c7ec2
Add assertion to make sure not to decode/parse empty page array
ttnghia 7203c67
Address some review comments
ttnghia 520448b
Merge branch 'branch-22.12' into parquet_reader
ttnghia 96eed8e
Fix `#endif`
ttnghia 6697e3b
Merge branch 'branch-22.12' into parquet_reader
ttnghia 70f4fde
PR review changes. Updated some incorrect/incomplete function docs.
nvdbaranec 4547483
Made the logic in the row_total_size functor much more readable.
nvdbaranec fe7d6d1
Merge branch 'branch-22.12' into parquet_reader
ttnghia db21bc3
Fix the tests
ttnghia 36043d8
Variable renaming for clarity.
nvdbaranec 83f0703
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia 3499bda
Merge branch 'branch-22.12' into parquet_reader
ttnghia File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought these could not be
const
because of internal state changeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this in turn is just calling
_impl->read_chunk()
. Since_impl
isconst
pointer we can make thisconst
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if
read_chunk
should be const but we can leave it for now.