-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] cuIO Statistics calculation code is redundant #6920
Comments
@devavret assigning to you since you already got the POC implementation. Feel free to hand off if needed. |
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
Do the recent changes to statistics calculation affect this issue? |
Quite possibly. The new statistics code is much simpler and maybe obviate the need for this. |
Addresses #6920 Use type dispatched functors to calculate statistics in Parquet and ORC. Authors: - Kumar Aatish (https://github.com/kaatish) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Devavret Makkar (https://github.com/devavret) - Vukasin Milovanovic (https://github.com/vuule) - Ashwin Srinath (https://github.com/shwina) - Michael Wang (https://github.com/isVoid) URL: #8191
@devavret can we close this issue now? |
Per @devavret's comment above, closing this. Can reopen a more targeted issue for few remaining redundancies if needed. |
cuIO has common code for statistics calculation between Parquet writer and ORC writer. This uses custom logic to perform reductions across chunks of rows. These chunks of rows are defined by the unit for which the statistics is generated. e.g. pages in case of parquet and stripes in case of ORC.
This can be refactored to use
cub::DeviceSegmentedReduce
with a custom iterator that that creates astatistics_val
from each column element and a custom reduce operator that reduces between twostatistics_val
s.We should also think about using input columns' cudf types rather than specially mapped output types to perform the reduction in. Once the reduction is complete, if the format calls for it, we can convert the type while encoding. This will allow us to replace switch cases made for these dtypes (e.g. here here here) with cudf's type dispatcher.
This will have following advantages:
DeviceMin
,DeviceMax
, andDeviceSum
operators that define the min/max/sum operators but more importantly, the respective identity for all current and future cudf types.Concerns:
The current kernel
gpuMergeColumnStatistics
is launched only once for the entire table but withcub::DeviceSegmentedReduce
, we'd have one async launch per column. This can be an issue when the table has a high number of columns.Profiling for feasibility
As per some preliminary profiling, I found that the cub kernel performs faster in case of a single 1GB column as compared to the existing approach.
To predict the effect of launching multiple kernels for columns, I tried launching 64 cub kernels totaling 1GB data. The total resulting time loses to single column scenario but still performs a bit better than the current approach. (3.1 ms vs 5.7 ms)
The text was updated successfully, but these errors were encountered: