Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example code for blog on new row comparators #13795

Merged
merged 42 commits into from
Nov 16, 2023

Conversation

divyegala
Copy link
Member

@divyegala divyegala commented Aug 1, 2023

Description

Example code using a few libcudf APIs to demonstrate nested-type usage.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Aug 1, 2023
@karthikeyann
Copy link
Contributor

@divyegala which is a good location for adding code to round tripping data using JSON reader and writer?

@karthikeyann karthikeyann added doc Documentation non-breaking Non-breaking change labels Aug 18, 2023
@karthikeyann
Copy link
Contributor

Added metadata round trip for json reader, writer usages.

@divyegala
Copy link
Member Author

@karthikeyann can you rebase and remove your commits? I removed example.cpp and only have deduplication.cpp in my working branch locally. When I try to merge your commits and run the examples, I get std::bad_alloc() and I'm probably doing a bad merge somewhere. It'll be easier for me to push my commits and for you to apply yours on top of mine

@karthikeyann
Copy link
Contributor

The cmake compilation didn’t work right away.
Added -DCMAKE_CUDA_ARCHITECTURES=native to all cmake commands in cpp/examples/build.sh to compile.

@divyegala divyegala marked this pull request as ready for review September 19, 2023 21:43
@divyegala divyegala requested a review from a team as a code owner September 19, 2023 21:43
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Sep 20, 2023

@divyegala Would you please update this example to use an RMM pool? For the strings example, @davidwendt added this in examples/strings/common.hpp

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 21, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@divyegala
Copy link
Member Author

/ok to test

@github-actions github-actions bot added the ci label Nov 14, 2023
@vyasr
Copy link
Contributor

vyasr commented Nov 14, 2023

/ok to test

cpp/examples/nested_types/CMakeLists.txt Show resolved Hide resolved
cpp/examples/nested_types/deduplication.cpp Show resolved Hide resolved
{
// Get count for each key
auto keys = cudf::table_view{{tbl.column(0)}};
auto val = cudf::make_numeric_column(cudf::data_type{cudf::type_id::INT32}, keys.num_rows());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We need to file an issue that makes this use hash-based aggregations without any "forcing"
  2. I think Vyas is asking, if you don't add this column to "force" hash aggregations, are some of the aggregations hash-based and others sort-based? My understanding of libcudf's behavior is that if any aggregation is sort-based, all the aggregations fall back to using sort-based implementations. Is that true?

cpp/examples/nested_types/deduplication.cpp Show resolved Hide resolved
cpp/examples/nested_types/example.json Outdated Show resolved Hide resolved
@divyegala
Copy link
Member Author

/ok to test

@divyegala
Copy link
Member Author

/ok to test

@divyegala
Copy link
Member Author

/ok to test

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@vyasr
Copy link
Contributor

vyasr commented Nov 16, 2023

/ok to test

@divyegala
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit afd7d18 into rapidsai:branch-23.12 Nov 16, 2023
65 checks passed
rapids-bot bot pushed a commit that referenced this pull request Nov 16, 2023
In #13795, we found out that `nullable()` causes severe perf degradation for the nested-type case when the input is read from file via `cudf::io::read_json`. This is because the JSON reader adds a null mask for columns that don't have NULLs. This change is a no-overhead replacement that checks the actual null count instead of checking if a null mask is present.

This PR also solves a bug in quantile/median groupby where NULLs were being [set](https://github.com/rapidsai/cudf/blob/8deb3dd7573000e7d87f18a9e2bbe39cf2932e10/cpp/src/groupby/sort/group_quantiles.cu#L73) but the null count was not updated.

Authors:
  - Divye Gala (https://github.com/divyegala)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)

URL: #14363
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue doc Documentation libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants