Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO statistics cleanup #8191

Merged
merged 13 commits into from
May 26, 2021
Merged

IO statistics cleanup #8191

merged 13 commits into from
May 26, 2021

Conversation

kaatish
Copy link
Contributor

@kaatish kaatish commented May 10, 2021

Addresses #6920

Use type dispatched functors to calculate statistics in Parquet and ORC.

@kaatish kaatish added 2 - In Progress Currently a work in progress code quality libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 10, 2021
@kaatish kaatish requested review from devavret and vuule May 10, 2021 06:27
@kaatish kaatish requested review from a team as code owners May 10, 2021 06:27
@kaatish kaatish self-assigned this May 10, 2021
@github-actions github-actions bot added the CMake CMake build issue label May 10, 2021
@codecov
Copy link

codecov bot commented May 10, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.06@9a85b3b). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.06    #8191   +/-   ##
===============================================
  Coverage                ?   82.89%           
===============================================
  Files                   ?      105           
  Lines                   ?    17875           
  Branches                ?        0           
===============================================
  Hits                    ?    14817           
  Misses                  ?     3058           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a85b3b...698dfad. Read the comment docs.

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's many usages of stats_dtype in parquet and orc code that can be replaced with cudf type but other than that, we can get rid of stats_dtype. 🥳
I'm also curious to know what is causing the prevention of stats calculation code to be format agnostic.

cpp/src/io/statistics/statistics.cuh Outdated Show resolved Hide resolved
cpp/src/io/statistics/statistics_type_identification.cuh Outdated Show resolved Hide resolved
cpp/src/io/statistics/statistics_type_identification.cuh Outdated Show resolved Hide resolved
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worthwhile moving each specialization of detail::GatherColumnStatistics< TYPE > to a separate file ( GatherOrcColumnStatistics, ... ) so that we don't increase the compile times for the different writer_impl.cu files.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, still need to figure out some parts of the PR.
Looking great so far!

cpp/src/io/statistics/statistics_type_identification.cuh Outdated Show resolved Hide resolved
cpp/src/io/statistics/statistics_type_identification.cuh Outdated Show resolved Hide resolved
cpp/src/io/statistics/temp_storage_wrapper.cuh Outdated Show resolved Hide resolved
cpp/src/io/orc/stats_enc.cu Show resolved Hide resolved
@vuule
Copy link
Contributor

vuule commented May 12, 2021

rerun tests

@kaatish kaatish requested review from vuule and devavret May 14, 2021 20:44
@github-actions github-actions bot added the Python Affects Python cuDF API. label May 24, 2021
@kaatish kaatish requested a review from vuule May 24, 2021 19:47
@@ -1782,6 +1782,15 @@ def test_parquet_writer_statistics(tmpdir, pdf):
if "col_category" in pdf.columns:
pdf = pdf.drop(columns=["col_category", "col_bool"])

timedelta_types = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devavret should we add duration types to the pdf fixture?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we talked, the idea was to add these types to the fixture first and hope no other test fails. If they do then make this a local change to the stats test to unblock this PR and file the breakages separately. I suppose @kaatish is going to reveal the tests that broke.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devavret Yes, that was my experience. Adding timedelta types to the build_pdf function causes tests to fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will open an issue to add duration coverage, and we can go ahead and merge this one as-is. Objections can be filed until CI passes :)

@devavret
Copy link
Contributor

Performance impact has been captured here.

I think there is sizeable improvement that's being hidden by all the other kernels and file/buffer writing. Try running this through nsys and filtering the gather stats kernel.

@kaatish kaatish requested a review from a team as a code owner May 24, 2021 20:26
@kaatish kaatish requested review from shwina and isVoid May 24, 2021 20:26
@vuule
Copy link
Contributor

vuule commented May 24, 2021

rerun tests

@vuule vuule added 4 - Needs cuDF (Python) Reviewer and removed 0 - Waiting on Author Waiting for author to respond to review labels May 24, 2021
@kaatish
Copy link
Contributor Author

kaatish commented May 25, 2021

Performance impact has been captured here.

I think there is sizeable improvement that's being hidden by all the other kernels and file/buffer writing. Try running this through nsys and filtering the gather stats kernel.

Time(%) Total Time (ns) Average Minimum Maximum Name
Before 4.4 2,488,925,639 2,407,084.8 2,041,010 3,824,900 gpuGatherColumnStatistics
After 2.4 1,370,010,680 1,268,528.4 1,027,833 7,236,049 gpu_calculate_group_statistics
Before 0.1 56,034,794 27,096.1 6,687 195,135 gpuMergeColumnStatistics
After 0.0 19,072,646 8,829.9 5,023 41,664 gpu_merge_group_statistics

@vuule
Copy link
Contributor

vuule commented May 25, 2021

rerun tests

@kaatish kaatish requested a review from shwina May 25, 2021 18:45
@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Python) Reviewer labels May 25, 2021
Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest lgtm

@galipremsagar
Copy link
Contributor

rerun tests

@kkraus14
Copy link
Collaborator

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 24e05a0 into rapidsai:branch-21.06 May 26, 2021
@kaatish kaatish deleted the io-statistics-cleanup branch May 26, 2021 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants