-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce cudf library size #7583
Reduce cudf library size #7583
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmake lgtm
The runtime difference in benchmarks is a little interesting. I'd be curious to see results from a few other benchmarks. |
Given the background in #3044 I can run the Dispatcher benchmark, anything else you are interested in? |
Present day Jake is less convinced than past Jake that the dispatcher benchmark will be revelatory. There's no actual work being done in any of the device code in the benchmark that would be impacted. I'd maybe try some big ticket ones like join or sort. |
JOIN_BENCH
SORT_BENCH
TYPE_DISPATCHER_BENCH
|
So those results just look like noise to me. Roughly +/- 5%. I think it's safe to conclude there is no runtime performance impact. |
How can I verify that |
Try a debug build without it? |
The reduction bench results are a bit concerning -- they look much less like noise than the other benchmarks. |
To say more definitively, if you collect 10 iterations of each benchmark, you can use the |
0ae9aa8
to
83248f3
Compare
Here is the comparison of running the reduction benchmark 10 times. https://gist.github.com/robertmaynard/2cfb4ba27afcd7e52b2deccce3b78c29#file-results-txt Given the size, I have also uploaded the benchmark data so people can use |
@robertmaynard compare.py is capable of producing human-readable (human-pastable) output, not just JSON, so you don't have to make us go run it ourselves. :) |
Here is the raw compare results: |
What units are the time values in for the benchmarks? milliseconds, microseconds, nanoseconds? |
Interesting the |
rerun tests |
I think this is non-breaking even with the change to release_assert -- breaks should only be internal. |
|
Oh yeah, this test is going to fail since |
83248f3
to
1943e4e
Compare
@gpucibot merge |
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7583 +/- ##
===============================================
+ Coverage 81.86% 82.47% +0.60%
===============================================
Files 101 101
Lines 16884 17397 +513
===============================================
+ Hits 13822 14348 +526
+ Misses 3062 3049 -13
Continue to review full report at Codecov.
|
By explicitly telling nvcc's fatbin pass to always compress device code we can ensure that our binaries are the smallest possible size. See rapidsai/cudf#7583 for additional context.
By explicitly telling nvcc's fatbin pass to always compress device code we can ensure that our binaries are the smallest possible size. See rapidsai/cudf#7583 for additional context.
By explicitly telling nvcc's fatbin pass to always compress device code we can ensure that our binaries are the smallest possible size. See rapidsai/cudf#7583 for additional context. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Rick Ratzel (https://github.com/rlratzel) - Brad Rees (https://github.com/BradReesWork) URL: #1503
By explicitly telling nvcc's fatbin pass to always compress device code we can ensure that our binaries are the smallest possible size. See rapidsai/cudf#7583 for additional context. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #3702
Decrease library size by using fatbin compression. cuDF PR for context: rapidsai/cudf#7583 Authors: - Dillon Cullinan (https://github.com/dillon-cullinan) Approvers: - Paul Taylor (https://github.com/trxcllnt) - Keith Kraus (https://github.com/kkraus14) URL: #373
This PR combines two major changes that improve cudf's final binary size.
The optimizations are:
__FILE__
,__LINE__
, etc when in release mode.With this PR we take cudf when built for
ALL
archs from1.3GB
to329MB
.Since this has such a dramatic change, I did some analysis on library sizes, compile times, runtime startup times, and
runtime performance.
Library sizes
We see that compression and the removal of the asserts while staying with whole compilation ( no
-rdc
) is best for performance. RDC taking more space makes sense to me as it has a less agressive optimizer and therefore generates moresass/ptx.
Compile times
Compile times are interesting as we see that compression has no negative effect. I expect that the reduced IO ( from smaller binaries ) offsets the time spent compressing.
Runtime starup times for
all archs
:The goal here was to time the tests to see if we had any measurable differences.
The
DISPATCHER_TEST * 12
was selected as the runtime~0.3s
would hopefully highligh any performance differences, and therefore running it 12 times wouldmagnify the difference and wash out any system load impacts.
I think from these rather crude measurements we are safe to assume no massive runtime loading time costs.
Runtime performance for
all archs
:I used the REDUCTION_BENCH as my baseline for detecting any performance changes. The benchmarks executed on a V100 on a shared lab machine.
REDUCTION_BENCH ./compare benchmarks branch_0.19 this_compression_branch
It looks like the compression has a minor cost? I would have thought the removal of the
release_assert
would have improved performance. But maybereduce
isn't using it?