-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IO statistics cleanup #8191
IO statistics cleanup #8191
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.06 #8191 +/- ##
===============================================
Coverage ? 82.89%
===============================================
Files ? 105
Lines ? 17875
Branches ? 0
===============================================
Hits ? 14817
Misses ? 3058
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's many usages of stats_dtype
in parquet and orc code that can be replaced with cudf type but other than that, we can get rid of stats_dtype. 🥳
I'm also curious to know what is causing the prevention of stats calculation code to be format agnostic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worthwhile moving each specialization of detail::GatherColumnStatistics< TYPE >
to a separate file ( GatherOrcColumnStatistics, ... ) so that we don't increase the compile times for the different writer_impl.cu
files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review, still need to figure out some parts of the PR.
Looking great so far!
rerun tests |
@@ -1782,6 +1782,15 @@ def test_parquet_writer_statistics(tmpdir, pdf): | |||
if "col_category" in pdf.columns: | |||
pdf = pdf.drop(columns=["col_category", "col_bool"]) | |||
|
|||
timedelta_types = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devavret should we add duration types to the pdf fixture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we talked, the idea was to add these types to the fixture first and hope no other test fails. If they do then make this a local change to the stats test to unblock this PR and file the breakages separately. I suppose @kaatish is going to reveal the tests that broke.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devavret Yes, that was my experience. Adding timedelta types to the build_pdf function causes tests to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will open an issue to add duration coverage, and we can go ahead and merge this one as-is. Objections can be filed until CI passes :)
I think there is sizeable improvement that's being hidden by all the other kernels and file/buffer writing. Try running this through nsys and filtering the gather stats kernel. |
rerun tests |
|
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest lgtm
rerun tests |
@gpucibot merge |
Addresses #6920
Use type dispatched functors to calculate statistics in Parquet and ORC.