[FEA] cuIO Statistics calculation code is redundant #6920

devavret · 2020-12-04T21:32:58Z

cuIO has common code for statistics calculation between Parquet writer and ORC writer. This uses custom logic to perform reductions across chunks of rows. These chunks of rows are defined by the unit for which the statistics is generated. e.g. pages in case of parquet and stripes in case of ORC.

This can be refactored to use cub::DeviceSegmentedReduce with a custom iterator that that creates a statistics_val from each column element and a custom reduce operator that reduces between two statistics_vals.

We should also think about using input columns' cudf types rather than specially mapped output types to perform the reduction in. Once the reduction is complete, if the format calls for it, we can convert the type while encoding. This will allow us to replace switch cases made for these dtypes (e.g. here here here) with cudf's type dispatcher.

This will have following advantages:

By using cub's optimized kernels we reduce line count by a lot, thus de-duplicating functionality.
We can use cudf's standard DeviceMin, DeviceMax, and DeviceSum operators that define the min/max/sum operators but more importantly, the respective identity for all current and future cudf types.

Concerns:

The current kernel gpuMergeColumnStatistics is launched only once for the entire table but with cub::DeviceSegmentedReduce, we'd have one async launch per column. This can be an issue when the table has a high number of columns.

Profiling for feasibility

As per some preliminary profiling, I found that the cub kernel performs faster in case of a single 1GB column as compared to the existing approach.

To predict the effect of launching multiple kernels for columns, I tried launching 64 cub kernels totaling 1GB data. The total resulting time loses to single column scenario but still performs a bit better than the current approach. (3.1 ms vs 5.7 ms)

The text was updated successfully, but these errors were encountered:

vuule · 2020-12-11T00:39:34Z

@devavret assigning to you since you already got the POC implementation. Feel free to hand off if needed.

github-actions · 2021-02-16T20:20:18Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

vuule · 2021-03-31T17:44:00Z

Do the recent changes to statistics calculation affect this issue?

devavret · 2021-03-31T18:01:16Z

Do the recent changes to statistics calculation affect this issue?

Quite possibly. The new statistics code is much simpler and maybe obviate the need for this.

Addresses #6920 Use type dispatched functors to calculate statistics in Parquet and ORC. Authors: - Kumar Aatish (https://github.com/kaatish) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Devavret Makkar (https://github.com/devavret) - Vukasin Milovanovic (https://github.com/vuule) - Ashwin Srinath (https://github.com/shwina) - Michael Wang (https://github.com/isVoid) URL: #8191

vuule · 2021-05-26T20:28:43Z

@devavret can we close this issue now?

devavret · 2021-05-27T05:52:00Z

Sure. I don't remember why #8191 didn't close this and called it as being "addressed". Most of what this issue aims to achieve has been completed in #8191

JohnZed · 2021-05-27T15:47:03Z

Per @devavret's comment above, closing this. Can reopen a more targeted issue for few remaining redundancies if needed.

devavret added feature request New feature or request Needs Triage Need team to review and classify labels Dec 4, 2020

devavret added cuIO cuIO issue proposal Change current process or code code quality and removed feature request New feature or request labels Dec 4, 2020

devavret mentioned this issue Dec 4, 2020

[FEA] In parquet writer, use column_device_view in EncColumnDesc #6893

Closed

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Dec 7, 2020

vuule assigned devavret Dec 11, 2020

github-actions bot added the stale label Feb 16, 2021

github-actions bot removed the inactive-30d label Mar 31, 2021

devavret assigned kaatish Apr 12, 2021

kaatish mentioned this issue May 10, 2021

IO statistics cleanup #8191

Merged

JohnZed closed this as completed May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] cuIO Statistics calculation code is redundant #6920

[FEA] cuIO Statistics calculation code is redundant #6920

devavret commented Dec 4, 2020

vuule commented Dec 11, 2020 •

edited

Loading

github-actions bot commented Feb 16, 2021

vuule commented Mar 31, 2021

devavret commented Mar 31, 2021

vuule commented May 26, 2021

devavret commented May 27, 2021

JohnZed commented May 27, 2021

[FEA] cuIO Statistics calculation code is redundant #6920

[FEA] cuIO Statistics calculation code is redundant #6920

Comments

devavret commented Dec 4, 2020

This will have following advantages:

Concerns:

Profiling for feasibility

vuule commented Dec 11, 2020 • edited Loading

github-actions bot commented Feb 16, 2021

vuule commented Mar 31, 2021

devavret commented Mar 31, 2021

vuule commented May 26, 2021

devavret commented May 27, 2021

JohnZed commented May 27, 2021

vuule commented Dec 11, 2020 •

edited

Loading