Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve cudf::gather scalability as number of columns increases #13509

Open
abellina opened this issue Jun 5, 2023 · 3 comments
Open
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Comments

@abellina
Copy link
Contributor

abellina commented Jun 5, 2023

As the number of columns increases for cudf::gather with the same gather map, we see the number of kernels called increase proportionally and the runtime increases linearly. We are wondering if there are better ways to group or "batch" these calls so we perform less kernel invocations that can do more work all at once, in hopes of amortizing some of the cost with many columns or deeply nested schemas.

A very simple example is below. This creates a column of 10 int32_t rows and adds it to a struct N times (where N is between 2 and 1024):

#include <cudf/table/table.hpp>
#include <cudf_test/column_wrapper.hpp>

#include <rmm/mr/device/cuda_memory_resource.hpp>
#include <rmm/mr/device/device_memory_resource.hpp>
#include <rmm/mr/device/pool_memory_resource.hpp>

#include <memory>
#include <string>
#include <vector>
#include <nvtx3/nvToolsExt.h>

int main(int argc, char** argv)
{
  rmm::mr::cuda_memory_resource cuda_mr{};
  rmm::mr::pool_memory_resource mr{&cuda_mr};
  rmm::mr::set_current_device_resource(&mr);
  using col_t = cudf::test::fixed_width_column_wrapper<int32_t>;

  auto const values = std::vector<int32_t>{1,2,3,4,5,6,7,8,9,10};
  for (int num_cols = 2; num_cols <= 1024; num_cols *= 2) {
    std::vector<std::unique_ptr<cudf::column>> members(num_cols);
    for (auto i = 0; i < num_cols; ++i) {
      auto wrapper = col_t(values.begin(), values.end());
      members[i] = wrapper.release();
    }
    auto struct_col = cudf::test::structs_column_wrapper(std::move(members));
    auto gather_map = std::vector<cudf::offset_type>{1}; // gather 1 row
    std::stringstream msg;
    nvtxRangePush(msg.str().c_str()); 
    auto result = cudf::gather(
      cudf::table_view{{struct_col}}, 
      cudf::test::fixed_width_column_wrapper<int32_t>(gather_map.begin(), gather_map.end()),
      cudf::out_of_bounds_policy::NULLIFY);
    nvtxRangePop();
    std::cout << "Result: rows: " << result->num_rows() << " cols: " << result->num_columns() << std::endl;

  }
  return 0;
}

As the column count increases by 2x, the gather kernel takes 2x longer:

Screenshot from 2023-06-05 10-18-52

A similar argument can be made for columns that have nested things like arrays of structs (each with array members). The number of calls to underlying cub calls can increase drastically.

I am filing this issue to solicit comments/patches to see how we could improve this behavior.

@abellina abellina added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS Performance Performance related issue labels Jun 5, 2023
@abellina
Copy link
Contributor Author

abellina commented Jun 5, 2023

I also believe, that improving the performance for gather will help with copy_if (for non fixed width, for fix width it looks like we implement our own scatter kernel). copy_if is another kernel with very similar behavior. I think we can discuss what we want here and we can target copy_if as a follow on given what we learn with gather.

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Jun 5, 2023

If we ignore lists and strings for the moment, I think it would be pretty easy to put together a proof-of-concept for doing batched fixed width gathers as a single kernel invocation (well, maybe 2 - one more for validity).

Strings is probably not too hard of an extension. Lists would definitely be tricky. I'd have to wrap my head around the list gather stuff to remember :)

@bdice
Copy link
Contributor

bdice commented Jun 5, 2023

At one point cudf (possibly before libcudf!) used a stream pool for gather operations. Each gather is independent, so we can launch all the kernels on separate streams and synchronize them with an event on the input stream. I would love to reimplement this approach and see if it can improve the performance. See also #12086.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
Status: Needs owner
Development

No branches or pull requests

4 participants