-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix writing of Parquet files with many fragments #11869
Fix writing of Parquet files with many fragments #11869
Conversation
Alternative approaches to consider are either using a single dimension and calculating column and fragment indexes, or using a fixed y dimension and looping over fragments. I added a test that fails when nvbench and nsys profile don't show any performance degradation with the swap in place. |
Codecov ReportBase: 87.40% // Head: 88.12% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #11869 +/- ##
================================================
+ Coverage 87.40% 88.12% +0.71%
================================================
Files 133 133
Lines 21833 21905 +72
================================================
+ Hits 19084 19304 +220
+ Misses 2749 2601 -148
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
A downside to simply swapping x and y is that now the number of columns is limited to 64k. A 1D grid would allow either number of fragments or number of columns to exceed the 64k limit, but not both obviously. Maybe a fixed y with looping (as suggested offline by @vuule) is the best fix? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this! Got nothing to contribute :)
cpp/src/io/parquet/page_enc.cu
Outdated
g->col = &col_desc[column_id]; | ||
g->start_row = fragments[column_id][frag_id].start_value_idx; | ||
g->num_rows = fragments[column_id][frag_id].num_leaf_values; | ||
uint32_t const lane_id = threadIdx.x & 0x1f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some rather innocuous-seeming magic values that all related to cudf::detail::warp_size
in this function. I'll point them out, but I am fine with doing nothing if we feel the current code is better not to change.
uint32_t const lane_id = threadIdx.x & 0x1f; | |
uint32_t const lane_id = threadIdx.x % cudf::detail::warp_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good (although using the mod operator makes my teeth itch 🤣). Does anyone happen to know if there are constants anywhere for the max threadblock dimensions? Or are those per-card values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there are no constants for that, and that's why we defined cudf::detail::warp_size
. It is a constant for all NVIDIA GPUs as far as I am aware.
These two snippets should compile out roughly the same. Compilers can recognize that unsigned modulo by
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and added a constexpr for the warp mask (before I read your reply)...there are several other instances of 0x1f sprinkled about in this file that can be replaced later.
Thanks for the link @bdice! Should I get rid of my mask constexpr and just use cudf::detail::warp_size
everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be a bit more precise here, CUDA does provide warpSize
, which is available inside device code, and the getDeviceProperties
host function, which returns a struct containing the warp size. However, neither of them is a constant and therefore cannot be used in constant expressions (e.g. for declaring a C-style or std::array). The warp size is indeed constant across all current compute capabilities. In theory that's not something that we promise, so the technically correct answer is that we can't use a compile-time constant because in theory someone could run on a new architecture with a different answer. In practice, NVIDIA has no plans to change the warp size AFAIK and many examples of GPU code (even lots of code written by NVIDIA) define a warp_size
constant. Lots of places use it assuming that it is in fact a compile-time constant and would have to be rewritten if we ever had any cards with a different warp size, so that's a much bigger problem to deal with another day if that ever changes :)
cpp/src/io/parquet/page_enc.cu
Outdated
uint32_t const column_id = blockIdx.x; | ||
uint32_t const num_fragments_per_column = fragments.size().second; | ||
|
||
uint32_t frag_id = blockIdx.y * 4 + (threadIdx.x >> 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint32_t frag_id = blockIdx.y * 4 + (threadIdx.x >> 5); | |
uint32_t frag_id = blockIdx.y * 4 + (threadIdx.x * cudf::detail::warp_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of magic values that don't need to be magical -- otherwise LGTM.
Tests seem to be failing on the mimesis stuff now. Should I merge with 22.12 to pull in #11906? |
@etseidl Merging the upstream or commenting “rerun tests” should work. |
rerun tests |
@gpucibot merge |
Description
This PR fixes an error that can occur when very small page sizes are used when writing Parquet files. #11551 changed from fixed 5000 row page fragments to a scaled value based on the requested max page size. For small page sizes, the number of fragments to process can exceed 64k. The number of fragments is used as the
y
dimension when callinggpuInitPageFragments
, and when it exceeds 64k the kernel fails to launch, ultimately leading to an invalid memory access.Checklist