[FEA] Improve cudf::gather scalability as number of columns increases #13509

abellina · 2023-06-05T15:24:45Z

As the number of columns increases for cudf::gather with the same gather map, we see the number of kernels called increase proportionally and the runtime increases linearly. We are wondering if there are better ways to group or "batch" these calls so we perform less kernel invocations that can do more work all at once, in hopes of amortizing some of the cost with many columns or deeply nested schemas.

A very simple example is below. This creates a column of 10 int32_t rows and adds it to a struct N times (where N is between 2 and 1024):

#include <cudf/table/table.hpp>
#include <cudf_test/column_wrapper.hpp>

#include <rmm/mr/device/cuda_memory_resource.hpp>
#include <rmm/mr/device/device_memory_resource.hpp>
#include <rmm/mr/device/pool_memory_resource.hpp>

#include <memory>
#include <string>
#include <vector>
#include <nvtx3/nvToolsExt.h>

int main(int argc, char** argv)
{
  rmm::mr::cuda_memory_resource cuda_mr{};
  rmm::mr::pool_memory_resource mr{&cuda_mr};
  rmm::mr::set_current_device_resource(&mr);
  using col_t = cudf::test::fixed_width_column_wrapper<int32_t>;

  auto const values = std::vector<int32_t>{1,2,3,4,5,6,7,8,9,10};
  for (int num_cols = 2; num_cols <= 1024; num_cols *= 2) {
    std::vector<std::unique_ptr<cudf::column>> members(num_cols);
    for (auto i = 0; i < num_cols; ++i) {
      auto wrapper = col_t(values.begin(), values.end());
      members[i] = wrapper.release();
    }
    auto struct_col = cudf::test::structs_column_wrapper(std::move(members));
    auto gather_map = std::vector<cudf::offset_type>{1}; // gather 1 row
    std::stringstream msg;
    nvtxRangePush(msg.str().c_str()); 
    auto result = cudf::gather(
      cudf::table_view{{struct_col}}, 
      cudf::test::fixed_width_column_wrapper<int32_t>(gather_map.begin(), gather_map.end()),
      cudf::out_of_bounds_policy::NULLIFY);
    nvtxRangePop();
    std::cout << "Result: rows: " << result->num_rows() << " cols: " << result->num_columns() << std::endl;

  }
  return 0;
}

As the column count increases by 2x, the gather kernel takes 2x longer:

A similar argument can be made for columns that have nested things like arrays of structs (each with array members). The number of calls to underlying cub calls can increase drastically.

I am filing this issue to solicit comments/patches to see how we could improve this behavior.

The text was updated successfully, but these errors were encountered:

abellina · 2023-06-05T15:28:10Z

I also believe, that improving the performance for gather will help with copy_if (for non fixed width, for fix width it looks like we implement our own scatter kernel). copy_if is another kernel with very similar behavior. I think we can discuss what we want here and we can target copy_if as a follow on given what we learn with gather.

nvdbaranec · 2023-06-05T15:31:00Z

If we ignore lists and strings for the moment, I think it would be pretty easy to put together a proof-of-concept for doing batched fixed width gathers as a single kernel invocation (well, maybe 2 - one more for validity).

Strings is probably not too hard of an extension. Lists would definitely be tricky. I'd have to wrap my head around the list gather stuff to remember :)

bdice · 2023-06-05T16:15:40Z

At one point cudf (possibly before libcudf!) used a stream pool for gather operations. Each gather is independent, so we can launch all the kernels on separate streams and synchronize them with an event on the input stream. I would love to reimplement this approach and see if it can improve the performance. See also #12086.

abellina added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS Performance Performance related issue labels Jun 5, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 26, 2023

GregoryKimball added this to libcudf Jun 26, 2023

GregoryKimball added this to the Stabilizing large workflows (OOM, spilling, partitioning) milestone Jun 26, 2023

bdice modified the milestones: Stabilizing large workflows (OOM, spilling, partitioning), Enable streams Jun 29, 2023

GregoryKimball mentioned this issue Aug 23, 2023

[FEA] Introduce the pylibcudf API and subpackage #13921

Closed

bdice mentioned this issue Aug 29, 2023

Global stream pool #13922

Merged

3 tasks

bdice mentioned this issue Sep 21, 2023

Use stream pool for gather/scatter. #14162

Draft

3 tasks

GregoryKimball removed this from libcudf Oct 26, 2023

vyasr mentioned this issue Feb 27, 2024

[FEA] Expose stream-ordered APIs in pylibcudf #15163

Open

ttnghia mentioned this issue Aug 2, 2024

[FEA] Support batch construction of strings columns #16486

Closed

GregoryKimball added this to libcudf Oct 23, 2024

GregoryKimball moved this to Needs owner in libcudf Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve cudf::gather scalability as number of columns increases #13509

[FEA] Improve cudf::gather scalability as number of columns increases #13509

abellina commented Jun 5, 2023 •

edited by mythrocks

Loading

abellina commented Jun 5, 2023 •

edited

Loading

nvdbaranec commented Jun 5, 2023 •

edited

Loading

bdice commented Jun 5, 2023 •

edited

Loading

[FEA] Improve cudf::gather scalability as number of columns increases #13509

[FEA] Improve cudf::gather scalability as number of columns increases #13509

Comments

abellina commented Jun 5, 2023 • edited by mythrocks Loading

abellina commented Jun 5, 2023 • edited Loading

nvdbaranec commented Jun 5, 2023 • edited Loading

bdice commented Jun 5, 2023 • edited Loading

abellina commented Jun 5, 2023 •

edited by mythrocks

Loading

abellina commented Jun 5, 2023 •

edited

Loading

nvdbaranec commented Jun 5, 2023 •

edited

Loading

bdice commented Jun 5, 2023 •

edited

Loading