-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet reader optimization to address V100 regression. #12577
Parquet reader optimization to address V100 regression. #12577
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool optimization. Just have a few (potential) suggestions.
Tested against spark integration tests. |
rerun tests |
Codecov ReportBase: 85.71% // Head: 85.70% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-23.02 #12577 +/- ##
================================================
- Coverage 85.71% 85.70% -0.01%
================================================
Files 155 155
Lines 24873 24873
================================================
- Hits 21319 21318 -1
- Misses 3554 3555 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core of the changes look good. I have minor questions about types and variable name suggestions. I can approve after that.
*/ | ||
struct PageNestingDecodeInfo { | ||
// set up prior to decoding | ||
int32_t max_def_level; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verifying our intended spellings: Should all these types be written as int32_t
, or should some be int
or cudf::size_type
?
valid_count
seems like it should be cudf::size_type
, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How far do we want to take this here? Broadly, cuio tends to work with native types in the kernels and not cudf typedefs (mostly for historical reasons, not really philosophically). I think my preference here would be to keep int32_t
for now (although there's some errant int
fields worth fixing up) and then maybe file an issue to do a more detailed audit cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d defer to @vuule on scope and prioritization but I do think this deserves an issue if we choose not to address it here. We really shouldn’t be redefining core types.
I hoped this PR might be able to make a dent in this broader problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is my understanding that size_type
should be used for row count and values derived from row count. I suspect that max_def_level value range is dictated by PQ specs, in which case int32_t is the right type to use.
Given that we are nearly at code freeze, I'm against increasing the scope of this PR.
We can open an issue to audit integral types in cuIO. Taking suggestion on how we could make this less tedious 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of questions, but very nice find and a solid fix.
…Renamed PageNestingDecodeInfo::pndi to PageNestingDecodeInfo::nesting_info for clarity. Improved some variable names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments on naming. Otherwise looks good.
…e nesting decode cache limit.
…est. Cleaned up some more variable names.
/merge |
Addresses #12316
Some recent changes caused a performance regression in the parquet reader benchmarks for lists. The culprit ended up being slightly different code generation happening for arch 70. In several memory hotspots, the code was reading values from global, modifying them and then storing them. Previously it had done a better job of loading and keeping them in registers and the L2 cache was helping keep things fast. But the extra store was causing twice as many L2 access in these places and causing many long scoreboard stalls.
Ultimately the issue is that these values shouldn't be kept in global memory. The initial implementation did it this way because the data was variable in size (based on depth of column nesting). But in practice, we never see more than 2 or 3 levels of nesting. So the solution is:
PageNestingDecodeInfo
) that is kept in shared memory for up to N nesting levels. N is currently 10.This addresses the performance regression and actually gives some performance increases. Some comparisons for LIST benchmarks.