IO statistics cleanup #8191

kaatish · 2021-05-10T06:27:02Z

Addresses #6920

Use type dispatched functors to calculate statistics in Parquet and ORC.

…o-statistics-cleanup

codecov · 2021-05-10T09:30:35Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.06@9a85b3b). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.06    #8191   +/-   ##
===============================================
  Coverage                ?   82.89%           
===============================================
  Files                   ?      105           
  Lines                   ?    17875           
  Branches                ?        0           
===============================================
  Hits                    ?    14817           
  Misses                  ?     3058           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a85b3b...698dfad. Read the comment docs.

devavret

There's many usages of stats_dtype in parquet and orc code that can be replaced with cudf type but other than that, we can get rid of stats_dtype. 🥳
I'm also curious to know what is causing the prevention of stats calculation code to be format agnostic.

cpp/src/io/statistics/statistics.cuh

cpp/src/io/statistics/statistics_type_identification.cuh

cpp/src/io/statistics/typed_statistics_chunk.cuh

robertmaynard

It might be worthwhile moving each specialization of detail::GatherColumnStatistics< TYPE > to a separate file ( GatherOrcColumnStatistics, ... ) so that we don't increase the compile times for the different writer_impl.cu files.

vuule

Partial review, still need to figure out some parts of the PR.
Looking great so far!

cpp/src/io/statistics/statistics_type_identification.cuh

cpp/src/io/statistics/temp_storage_wrapper.cuh

cpp/src/io/orc/stats_enc.cu

cpp/src/io/statistics/typed_statistics_chunk.cuh

vuule · 2021-05-12T21:17:22Z

rerun tests

vuule · 2021-05-24T19:54:17Z

python/cudf/cudf/tests/test_parquet.py

@@ -1782,6 +1782,15 @@ def test_parquet_writer_statistics(tmpdir, pdf):
    if "col_category" in pdf.columns:
        pdf = pdf.drop(columns=["col_category", "col_bool"])

+    timedelta_types = [


@devavret should we add duration types to the pdf fixture?

When we talked, the idea was to add these types to the fixture first and hope no other test fails. If they do then make this a local change to the stats test to unblock this PR and file the breakages separately. I suppose @kaatish is going to reveal the tests that broke.

@devavret Yes, that was my experience. Adding timedelta types to the build_pdf function causes tests to fail.

Will open an issue to add duration coverage, and we can go ahead and merge this one as-is. Objections can be filed until CI passes :)

devavret · 2021-05-24T20:00:48Z

Performance impact has been captured here.

I think there is sizeable improvement that's being hidden by all the other kernels and file/buffer writing. Try running this through nsys and filtering the gather stats kernel.

vuule · 2021-05-24T20:46:36Z

rerun tests

kaatish · 2021-05-25T02:58:20Z

Performance impact has been captured here.

I think there is sizeable improvement that's being hidden by all the other kernels and file/buffer writing. Try running this through nsys and filtering the gather stats kernel.

	Time(%)	Total Time (ns)	Average	Minimum	Maximum	Name
Before	4.4	2,488,925,639	2,407,084.8	2,041,010	3,824,900	gpuGatherColumnStatistics
After	2.4	1,370,010,680	1,268,528.4	1,027,833	7,236,049	gpu_calculate_group_statistics
Before	0.1	56,034,794	27,096.1	6,687	195,135	gpuMergeColumnStatistics
After	0.0	19,072,646	8,829.9	5,023	41,664	gpu_merge_group_statistics

vuule · 2021-05-25T17:16:53Z

rerun tests

python/cudf/cudf/tests/test_parquet.py

isVoid

pytest lgtm

galipremsagar · 2021-05-26T16:54:59Z

rerun tests

kkraus14 · 2021-05-26T20:19:58Z

@gpucibot merge

kaatish added 5 commits May 7, 2021 10:18

Initial commit

5fbff9e

Rename files

001d7b2

Add timestamp and duration type conversion

13a458c

Fix block reduce in gather statistics

ff249dc

Merge branch 'branch-0.20' of https://github.com/rapidsai/cudf into i…

6ba82e9

…o-statistics-cleanup

kaatish added 2 - In Progress Currently a work in progress code quality libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 10, 2021

kaatish requested review from devavret and vuule May 10, 2021 06:27

kaatish requested review from a team as code owners May 10, 2021 06:27

kaatish self-assigned this May 10, 2021

github-actions bot added the CMake CMake build issue label May 10, 2021

Style fix

b825eb6

devavret suggested changes May 10, 2021

View reviewed changes

robertmaynard reviewed May 10, 2021

View reviewed changes

vuule requested changes May 11, 2021

View reviewed changes

PR review fixes

1f821a4

robertmaynard approved these changes May 12, 2021

View reviewed changes

Added documentation and addresses reviews

53b13f4

kaatish requested review from vuule and devavret May 14, 2021 20:44

Added test for timedelta types

16a9868

github-actions bot added the Python Affects Python cuDF API. label May 24, 2021

kaatish requested a review from vuule May 24, 2021 19:47

vuule reviewed May 24, 2021

View reviewed changes

vuule approved these changes May 24, 2021

View reviewed changes

Style fix

c59aa42

kaatish requested a review from a team as a code owner May 24, 2021 20:26

kaatish requested review from shwina and isVoid May 24, 2021 20:26

vuule added 4 - Needs cuDF (Python) Reviewer and removed 0 - Waiting on Author Waiting for author to respond to review labels May 24, 2021

shwina reviewed May 25, 2021

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

PR review fix

698dfad

kaatish requested a review from shwina May 25, 2021 18:45

shwina approved these changes May 25, 2021

View reviewed changes

vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Python) Reviewer labels May 25, 2021

isVoid approved these changes May 25, 2021

View reviewed changes

rapids-bot bot merged commit 24e05a0 into rapidsai:branch-21.06 May 26, 2021

kaatish deleted the io-statistics-cleanup branch May 26, 2021 20:26

devavret mentioned this pull request May 27, 2021

[FEA] cuIO Statistics calculation code is redundant #6920

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO statistics cleanup #8191

IO statistics cleanup #8191

kaatish commented May 10, 2021

codecov bot commented May 10, 2021 •

edited

Loading

devavret left a comment

robertmaynard left a comment

vuule left a comment

vuule commented May 12, 2021

vuule May 24, 2021

devavret May 24, 2021

kaatish May 24, 2021

vuule May 24, 2021

devavret commented May 24, 2021

vuule commented May 24, 2021

kaatish commented May 25, 2021

vuule commented May 25, 2021

isVoid left a comment

galipremsagar commented May 26, 2021

kkraus14 commented May 26, 2021

IO statistics cleanup #8191

IO statistics cleanup #8191

Conversation

kaatish commented May 10, 2021

codecov bot commented May 10, 2021 • edited Loading

Codecov Report

devavret left a comment

Choose a reason for hiding this comment

robertmaynard left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

vuule commented May 12, 2021

vuule May 24, 2021

Choose a reason for hiding this comment

devavret May 24, 2021

Choose a reason for hiding this comment

kaatish May 24, 2021

Choose a reason for hiding this comment

vuule May 24, 2021

Choose a reason for hiding this comment

devavret commented May 24, 2021

vuule commented May 24, 2021

kaatish commented May 25, 2021

vuule commented May 25, 2021

isVoid left a comment

Choose a reason for hiding this comment

galipremsagar commented May 26, 2021

kkraus14 commented May 26, 2021

codecov bot commented May 10, 2021 •

edited

Loading