-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write string data directly to column_buffer in Parquet reader #13302
Merged
rapids-bot
merged 193 commits into
rapidsai:branch-23.08
from
etseidl:feature/string_cols_v2
Jun 23, 2023
Merged
Changes from 5 commits
Commits
Show all changes
193 commits
Select commit
Hold shift + click to select a range
63a2d88
Rework of level decoding to be considerably more parallel. Previousl…
nvdbaranec 85dfe8a
Merge branch 'branch-23.06' into parquet_level_optimization
nvdbaranec eb37a59
Merge branch 'branch-23.06' into parquet_level_optimization
nvdbaranec 9211bcc
Style formatting.
nvdbaranec 2a2f6b2
checkpoint
etseidl 0cd8481
Merge remote-tracking branch 'origin/parquet_level_optimization' into…
etseidl 2b1f7d5
checkpoint
etseidl 6569684
fix is_bounds_page()
etseidl 2f8836b
pass decoders into page_bounds
etseidl db7e2a4
copy over changes from string_cols
etseidl 90e214c
works except skip_rows
etseidl 567a0ab
fix bug with skip_rows
etseidl fb45e8c
debug prints
etseidl 6d89752
fix bug in page_bounds
etseidl 5035703
optimization for countDictEntries
etseidl 595e2e1
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 37f7d46
fix another skip_rows bug, and round robin the countDictEntries calc
etseidl 19396bf
fix for chunked reads
etseidl 3780494
fix bug with setting the offsets for null values...chunked reader
etseidl 4373b8f
fix edge case where skip_rows ends on a page boundary
etseidl 3a39970
move test for long strings
etseidl 743b3f5
more string tweaks
etseidl ad651cf
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 08b68d7
change offsets to size_type
etseidl 269043d
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl b79c9ec
Remove definition and repetition levels from page_data_s struct to de…
nvdbaranec 38792e1
Merge remote-tracking branch 'origin/parquet_level_optimization' into…
etseidl 3320cde
fixes after merging
etseidl 897db8c
split out separate decoder for string columns
etseidl 15f4e12
remove test for string hash
etseidl b9399c0
get rid of little used variables
etseidl 57d7aa8
fix a few edge cases
etseidl 655c048
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 72d301a
use char parallel strcpy when avg string len is 32 or higher
etseidl 7768ae5
overlap decode kernels using stream pool
etseidl cbabce2
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 2d42bf3
Squeeze level values into uint16_t instead of uint32_t, shrink deocde…
nvdbaranec 51624c8
refactor to remove string decoding code from page_data.cu
etseidl b3afd25
Merge remote-tracking branch 'origin/parquet_level_optimization' into…
etseidl 59bd2d6
finish merge
etseidl 8d81822
Merge branch 'branch-23.06' into parquet_level_optimization
nvdbaranec 077ff39
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl cfbd5e1
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 8986afb
clean up
etseidl 02bf251
Merge branch 'feature/string_cols_v2' of github.com:etseidl/cudf into…
etseidl 6b352a5
clean up
etseidl 6aa1120
add docstrings
etseidl fbd9fc6
more docstrings and clean up
etseidl d85a8e4
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl f4cf521
PR review fixes. Removed unused shuffle_ptr() function. Corrected a…
nvdbaranec 305bf09
test for string col earlier
etseidl 2d406c2
Merge remote-tracking branch 'origin/parquet_level_optimization' into…
etseidl ad231f8
Change the level_decode_buf (temp space) to use rmm::mr::get_current_…
nvdbaranec 301cce8
need to call setupLocalPageInfo or bad things happen
etseidl 6db20c1
add todo
etseidl 07b0d73
final fix for restoring decode cache. add some consts.
etseidl 2c8dbb4
more consts
etseidl 00a87aa
Merge remote-tracking branch 'origin/parquet_level_optimization' into…
etseidl 8cb430e
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 7b392f6
simplify string col detection
etseidl 6156e66
more consts
etseidl fa0cdfc
cleanup
etseidl 62b61a2
add some TODOs
etseidl ab6d42e
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 3d5c1c8
Use a dynamically sized type for level/repetition data. In almost al…
nvdbaranec ee1a6fc
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl e01404a
Merge remote-tracking branch 'origin/parquet_level_optimization' into…
etseidl ecb336e
finish merge
etseidl c917d27
Merge branch 'feature/string_cols_v2' of github.com:etseidl/cudf into…
etseidl c1aebf3
fix string buffer length
etseidl 217e12f
fix for columns that start with null values
etseidl b49ff95
fix for decimal columns
etseidl 8054f10
another fix for null handlng
etseidl 84762b9
one more bug cleaning up nulls
etseidl 86ec00d
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 24fb8f2
PR review feedback.
nvdbaranec e85577d
Merge branch 'branch-23.06' into parquet_level_optimization
nvdbaranec abf4153
Merge remote-tracking branch 'origin/parquet_level_optimization' into…
etseidl 32ce89b
minor cleanup
etseidl 859eb43
PR review feedback.
nvdbaranec 809d4e9
Merge remote-tracking branch 'origin/parquet_level_optimization' into…
etseidl d994b0c
finish merge
etseidl f12bcc9
Fix a bug where specific usage of skip_rows/num_rows could cause a ra…
nvdbaranec 2d0739e
Merge remote-tracking branch 'origin/parquet_level_optimization' into…
etseidl 26d03a3
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 9ceaf18
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 8804007
PR review feedback.
nvdbaranec 8bbbab1
Merge branch 'branch-23.06' into parquet_level_optimization
nvdbaranec 5bbf9a1
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 86e8c2a
Merge remote-tracking branch 'origin/parquet_level_optimization' into…
etseidl 80219c9
Merge remote-tracking branch 'cudf/branch-23.06' into feature/string_…
etseidl dc681ec
Merge branch 'branch-23.06' into feature/string_cols_v2
vuule 9d09842
spelling
etseidl f23f9cf
simplify out_thread0 calc
etseidl 6f73510
Merge branch 'feature/string_cols_v2' of github.com:etseidl/cudf into…
etseidl e80c07b
fix for string col detection
etseidl 9d754fd
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 140749d
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 62befec
Merge remote-tracking branch 'cudf/branch-23.06' into feature/string_…
etseidl 6e89596
finish merge
etseidl f196e60
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl d7db9bb
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl f1669bc
alternate way to do column_buffer
etseidl 2b5a5e0
remove unused constructor
etseidl 30bfe9f
get rid of another unnecessary function
etseidl dea407c
rearrange some
etseidl c4781c0
Merge branch 'rapidsai:branch-23.06' into col_buf_v2
etseidl 7388dec
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 10d00d0
move make_column into policy object
etseidl 22b3d55
reduce diff
etseidl b907a1a
unify interfaces for policy objects
etseidl f1d2b84
Merge branch 'col_buf_v2' into feature/string_cols_v2
etseidl 73b8229
change template param name to string_policy
etseidl e921bd9
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl af1e407
restore null_count_back_copier
etseidl 1e18f1c
fix for page spanning rows
etseidl 681d57d
undo some reformatting of comments
etseidl d8bb072
change make_column to make_string_column
etseidl 2c596f9
move gpuDecodeRleBooleans
etseidl ca84c23
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 06521bd
remove t from docs
etseidl c20214b
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 6625690
CRTP, I think
vuule fcc6ca3
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl 321eb46
Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2
etseidl eaf457f
Merge branch 'pr/etseidl/13302-1' into feature/string_cols_v2
etseidl d742498
checkpoint CRTP changes
etseidl 3e70821
better fix for initializing _strings
etseidl cb934a8
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl a120a4a
json no longer needs to fully qualify make_column
etseidl 755918b
calculate col_sizes on device to save a round trip for the PageInfo
etseidl e1fd103
calculate offsets with exclusive scan
etseidl 539ef1f
cleanups
etseidl 5070a20
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 723e21d
only call string decode kernel if there are string columns
etseidl 769945e
Merge branch 'feature/string_cols_v2' of github.com:etseidl/cudf into…
etseidl 2bf8c19
offsets can be page local now
etseidl 449ab65
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 15986a9
move create() back to cpp file
etseidl 614c460
remove memory resource from column buffer. instead pass it in when
etseidl 7f1d245
remove if and add CUDF_EXPECTS to allocate_strings_data()
etseidl a0fb80e
get rid of anonymous namespace
etseidl fac2f3f
delete copy constructor
etseidl a03e3ca
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 53b38b8
revert removal of memory resource
etseidl d35236c
Merge branch 'feature/string_cols_v2' of github.com:etseidl/cudf into…
etseidl fc696e7
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 30d1698
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl d9614ff
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 6408f5b
fix some bitrot and add explanation for presence of gpuDecodeStringPa…
etseidl 204d2a2
add const versions of pointer accessors
etseidl f2028d2
document template params
etseidl 2a9b0ff
only need one version of is_string_col
etseidl 2b30f23
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 519ce00
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 7989135
check for string overflow
etseidl f1b3d23
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 1bfece4
more size_type -> size_t
etseidl 6a3400f
implement suggestion from review
etseidl 0799fed
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl eb94a91
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 59bfd1b
use thrust to calculate page string offsets
etseidl 5f3e5af
some cleanup
etseidl 6c3959b
Merge remote-tracking branch 'cudf/branch-23.08' into feature/string_…
etseidl cbd742f
throw std::overflow_error if string column gets too big
etseidl 7aef9e8
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl da830a2
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 967ecf6
only allocate memory for string nesting data if there are string columns
etseidl 8749724
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 7842c00
Merge remote-tracking branch 'cudf/branch-23.08' into feature/string_…
etseidl 0d1ac33
east const for new files
etseidl 9f054dc
Merge branch 'feature/string_cols_v2' of github.com:etseidl/cudf into…
etseidl 774e88f
Merge branch 'branch-23.08' into feature/string_cols_v2
ttnghia 9653fef
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl b09f31c
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 1c84c50
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 98b345f
add new worst-case benchmark for strings
etseidl 3de3554
use stream pool for decode kernels
etseidl a4548e7
move stream pool to impl object
etseidl 6ce50af
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl a2ebf32
Merge branch 'feature/string_cols_v2' of github.com:etseidl/cudf into…
etseidl 572836e
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 47cd9e1
filter on data types in setupLocalPageInfo
etseidl a1304c2
remove experimental decode kernel
etseidl ce2acbe
Revert "move stream pool to impl object"
etseidl 8653b93
finish moving back to static stream pool
etseidl 6ee7b29
add comment for NUM_DECODERS
etseidl a0db39c
call synch on _stream before launching decode kernels
etseidl a42137a
Merge branch 'branch-23.08' into feature/string_cols_v2
etseidl 2d1d556
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl b3ebab5
workaround for nvbench shutdown error
etseidl 19487af
Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2
etseidl 5b3d070
move page bounds check into setupLocalPageInfo
etseidl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this avoid the race condition when two separate kernels visit the same page? Won't one of them erroneously zero the page out that another may have written a valid value to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one invocation should make it past the filter. That one will zero out the null count and then the back copier will copy it back to the page. @vuule added the logic to make the back copy a no-op if the setup returns early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Checking to see if the nesting_info pointer is null.