-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write string data directly to column_buffer in Parquet reader #13302
Write string data directly to column_buffer in Parquet reader #13302
Conversation
…, it was only 1 warp wide. Now it is block-wide. Only integrated into the gpuComputePageSizes() kernel. gpuDecodePages() will be a followup PR.
… feature/string_cols_v2
still not quite happy
…al with a performance issue introduced in gpuDecodePageData by previously changing them to be pointers instead of hardcoded arrays.
… feature/string_cols_v2
This reverts commit a4548e7.
/ok to test |
/ok to test |
@@ -663,38 +663,19 @@ __global__ void __launch_bounds__(decode_block_size) gpuDecodeStringPageData( | |||
page_state_buffers_s* const sb = &state_buffers; | |||
int const page_idx = blockIdx.x; | |||
int const t = threadIdx.x; | |||
[[maybe_unused]] null_count_back_copier _{s, t}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this avoid the race condition when two separate kernels visit the same page? Won't one of them erroneously zero the page out that another may have written a valid value to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one invocation should make it past the filter. That one will zero out the null count and then the back copier will copy it back to the page. @vuule added the logic to make the back copy a no-op if the setup returns early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Checking to see if the nesting_info pointer is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it.
/merge |
oops, missing a cmake review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake approval
Description
The current Parquet reader decodes string data into a list of {ptr, length} tuples, which are then used in a gather step by
make_strings_column
. This gather step can be time consuming, especially when there are a large number of string columns. This PR addresses this by changing the decode step to write char and offset data directly to thecolumn_buffer
, which can then be used directly, bypassing the gather step.The image below compares the new approach to the old. The green arc at the top (82ms) is
gpuDecodePageData
, and the red arc (252ms) is the time spent inmake_strings_column
. The green arc below (25ms) isgpuDecodePageData
, the amber arc (22ms) is a new kernel that computes string sizes for each page, and the magenta arc (106ms) is the kernel that decodes string columns.NVbench shows a good speed up for strings as well. There is a jump in time for the INTEGRAL benchmark, but little to no change for other data types. The INTEGRAL time seems to be affected by extra time spent in
malloc
allocating host memory for ahostdevice_vector
. Thismalloc
always occurs, but for some reason in this branch it takes much longer to return.This is comparing to @nvdbaranec's branch for #13203.
May address #13024
Depends on #13203Checklist