-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Resolve parquet reader performance regression on V100 from #14167 #14415
Comments
I collected some PTX from I wrote a script to reduce nuisance diffs from register count and code blocks id's and this diff was left: Possibly relevant? NVIDIA/cccl#1001 |
I did some quick testing on a V100 and the performance hotspot appears to be
|
I continued testing the diff in #14167 and found that commenting out these two calls to
Since the effect only appears in the However, this observation also suggests that the libcudf benchmark regressions on V100 may NOT have the same root cause as Spark-RAPIDS NDS regressions on A100. (because NDS does not have list types!!) @mattahrens the observation of performance issues on V100 only for list types makes getting an A100 libcudf repro even more important! |
Expanding on my last comment about
Please note that this code path is never reached (verified with |
Potentially fixes #14415 Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - David Wendt (https://github.com/davidwendt) URL: #14706
Describe the bug
As a side effect of #14167 (see 23.10 release), we observed about 10-15% slower parquet reader benchmarks on DGX V100. This effect was observed to not impact DGX A100. However, Spark-RAPIDS reported a 4-5% slowdown in the NDS benchmarking suite, driven by changes in IO-bound benchmarks.
The changes in #14167 are not expected to impact performance at all. The difference in libcudf nvbenchmarks on V100 could be from a change in the compiler code gen, and the difference in Spark-RAPIDS NDS on A100 could relate to the multi-threaded PTDS (pre thread default stream) workflow in NDS.
This issue documents the performance data and results of investigations into the root cause.
Steps/Code to reproduce bug
On a DGX V100, you can see the difference using the this benchmark command:
On commit
b789d4ce3c090a3f25a8657d9a8582a1edb54f12
we see 1.376s timeOn commit
2c19bf328ffefb97d17e5ae600197a4ea9ca4445
we see 1.572s time.The difference is driven by longer execution time of the
gpuDecodePageKernel
, as observed in nsys profiling.Possibly unrelated background includes the V100-only performance issue in #12577
Nsys profiles:
nsys profiles before and after.zip
The text was updated successfully, but these errors were encountered: