-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Performance regression in cuDF after #14679 #14886
Comments
@SurajAralihalli would you do a quick check to see if there is a regression in our groupby nvbenchmarks? |
These look to be simple sum aggregations on either uint64_t columns or int64_t columns. The actual spark type is a decimal, which we chunk into 4 components to perform overflow checking, but cuDF is operating on uint64_t/int64_t only here. |
I don't observe a significant variance in the groupby nvbenchmarks. I'll investigate further.
|
I have a repro case outside of Spark just using the cudf Java APIs where an aggregation call is taking over 15 seconds. I'll see if I can get this boiled down to a pure C++ repro. |
For reference, here's the repro steps that I have using just the cudf Java APIs (still working on the C++-only repro). Attached is a Parquet file with the data used in this test which consists of a two columns, a string column and a UINT32 column. import ai.rapids.cudf.*;
Table t = Table.readParquet(new java.io.File("testdata.parquet"));
Table.GroupByOperation g = t.groupBy(0);
Table result = g.aggregate(GroupByAggregation.sum().onColumn(1)); The test simply groups by the string column and does a sum aggregation on the UINT32 column. It takes a few milliseconds to run this before #14679 and over 15 seconds afterwards. |
Finally got the C++ repro for this. This program runs sub-second before #14679 and takes tens of seconds afterwards: #include <cudf/groupby.hpp>
#include <cudf/io/parquet.hpp>
int main(int argc, char** argv) {
auto read_opts = cudf::io::parquet_reader_options_builder(cudf::io::source_info{"./testdata.parquet"}).build();
auto read_result = cudf::io::read_parquet(read_opts);
auto t = read_result.tbl->view();
cudf::groupby::groupby grouper(cudf::table_view({t.column(0)}), cudf::null_policy::INCLUDE, cudf::sorted::NO);
std::vector<cudf::groupby::aggregation_request> requests;
requests.emplace_back(cudf::groupby::aggregation_request());
requests[0].values = t.column(1);
requests[0].aggregations.push_back(cudf::make_sum_aggregation<cudf::groupby_aggregation>());
auto result = grouper.aggregate(requests);
return 0;
} |
Thank you @jlowe for sharing this clear repro. Would you please share a bit about the contents of the parquet input file? I pulled together a snapshot of our recent microbenchmarks and they do not show the sign of a large regression. There must be a case we are missing. |
I didn't closely examine the contents of the parquet file. This is just data captured in the middle of query execution from an NDS run at scale factor 100g just before a long aggregation call. The original data had 9 grouping columns and 5 aggregation columns, but I was able to still reproduce it with just one grouping column and one aggregation column. I did not try reducing the number of rows. It may not be sensitive to the data at all, but I wanted to post what I had so far. |
Also note that the file is attached at this comment: #14886 (comment) so anyone can download the data, compile the test program, and reproduce the results. |
Thanks Json, the performance regression can be seen in bench_groupby_nvsum2 benchmarks for
|
This pull request reverses the modifications made to the sum/product aggregation target type, ensuring it always produces int64. The changes implemented by PR [14679](#14679) which led to degraded performance when the aggregation column had an unsigned type, are reverted. Additional details can be found in the issue [14886](#14886). Authors: - Suraj Aralihalli (https://github.com/SurajAralihalli) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann)
I believe this is safe to close |
…operators to detail namespace. (#14962) This PR does a thorough refactoring of `device_atomics.cuh`. - I moved all atomic-related functions to `cudf::detail::` (making this an API-breaking change, but most likely a low-impact break) - I added all missing operators for natively supported types to `atomicAdd`, `atomicMin`, `atomicMax`, etc. as discussed in #10149 and #14907. - This should prevent fallback to the `atomicCAS` path for types that are natively supported for those atomic operators, which we suspect as the root cause of the performance regression in #14886. - I kept `atomicAdd` rather than `cudf::detail::atomic_add` in locations where a native CUDA overload exists, and the same for min/max/CAS operations. Aggregations are the only place where we use the special overloads. We were previously calling the native CUDA function rather than our special overloads in many cases, so I retained the previous behavior. This avoids including the additional headers that implement an unnecessary level of wrapping for natively supported overloads. - I enabled native 2-byte CAS operations (on `unsigned short int`) that eliminate the do-while loop and extra alignment-checking logic - The CUDA docs don't state this, but some forum posts claim this is only supported by compute capability 7.0+. We now have 7.0 as a lower bound for RAPIDS so I'm not concerned by this as long as builds/tests pass. - I improved/cleaned the documentation and moved around some code so that the operators were in a logical order. - I assessed the existing tests and it looks like all the types are being covered. I'm not sure if there is a good way to enforce that certain types (like `uint64_t`) are passing through native `atomicAdd` calls. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Suraj Aralihalli (https://github.com/SurajAralihalli) URL: #14962
We are seeing some really big regressions in our NDS benchmarks after this PR went in: #14679. We verified that reverting this commit fixes the regression.
When running an nsys trace, and looking at traces, it's all in the groupby, specifically in this for_each call:
The text was updated successfully, but these errors were encountered: