Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Review] Add optimized 2x string column renumbering code #1996

Merged
merged 6 commits into from
Jan 26, 2022

Conversation

chirayuG-nvidia
Copy link
Contributor

No description provided.

@chirayuG-nvidia chirayuG-nvidia requested review from a team as code owners December 21, 2021 01:04
@chirayuG-nvidia
Copy link
Contributor Author

CC: @rlratzel @ChuckHastings

@BradReesWork BradReesWork added this to the 22.02 milestone Jan 18, 2022
@BradReesWork BradReesWork added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 18, 2022
Copy link
Collaborator

@ChuckHastings ChuckHastings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start!

I have a bunch of comments, largely related to making this a bit more consistent with other cugraph software which will ultimately make this easier to maintain.

@@ -16,9 +16,954 @@
#include <cugraph_etl/functions.hpp>

#include <cugraph/utilities/error.hpp>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAPIDS convention is to group includes from nearest to farthest with a blank line between groups. So these should be reordered so that:

  • All of the cugraph_etl files and files that are part of lib_cugraph_etl/src are listed in the first group
  • All of the cugraph files included next
  • All of the cudf files included next
  • All of the rmm files included next
  • All of the thrust/cub files included next
  • System includes (cuda, stl, etc) last

@@ -0,0 +1,545 @@
/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is copied from cudf. I don't recall the exact reason we had to copy this code instead of using cuco.

It would be helpful to add some documentation here describing this so that we will remember down the road.

Convention within cugraph is to insert comments prefixed with FIXME: that describe this. So something like:

/*
 * FIXME: This file is copied from cudf because XXX
 *     The plan is to migrate to using the cuco version (or the libcudacxx version) once YYY is
 *     completed.  At that point this file can be deleted.
 */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment.

cpp/libcugraph_etl/include/hash/hashing.cu Outdated Show resolved Hide resolved
cpp/libcugraph_etl/include/hash/md5_hash.cu Outdated Show resolved Hide resolved
constexpr uint32_t hash_inc_constant = 9999;

typedef struct str_hash_value{
__host__ __device__ str_hash_value() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using initialization instead of assignment here.

If you defined them below as:

size_type row_{std::numeric_limits<size_type>::max()};
accum_type count_{std::numeric_limits<accum_type>::max()};
int32_t col_{std::numeric_limits<int32_t>::max()};

It might be a little cleaner and you wouldn't need to define the default constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to initialization, however empty default constructor still required otherwise thrust::sort is complaining.

cpp/libcugraph_etl/src/renumbering.cu Outdated Show resolved Hide resolved
cpp/libcugraph_etl/src/renumbering.cu Outdated Show resolved Hide resolved
cpp/libcugraph_etl/src/renumbering.cu Outdated Show resolved Hide resolved
str_col_view.offsets().data<str_offset_type>()));
}

accum_type *hist_insert_counter = nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this an std::vector? The RAAI construct is generally cleaner and easier to maintain.

The cugraph code uses raft utilities to support easy transfers between host and device. https://github.com/rapidsai/raft/blob/6a8c7a3bebe85d8fef34951e6f09c93fa733b06f/cpp/include/raft/cudart_utils.h#L251

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to efficiently use raft utilities, we might need to add the raft::handle_t as a parameter to the renumber_cudf_tables method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this to a raft::mr::host::buffer for RAII. I directly write some values to pinned memory so can't use a std::vector.

@ChuckHastings
Copy link
Collaborator

A couple of other notes...

  1. The copyright headers will need to be updated to include 2022 for any files that are new or modified.
  2. In order to pass CI you will need to format the code using clang-format. When you install the cugraph condo environment you should get clang-format installed into your conda environment. From the top level directory (cugraph) you should be able to run: python cpp/scripts/run-clang-format.py -inplace and it will modify any files that don't match our clang-format rules.

@codecov-commenter
Copy link

codecov-commenter commented Jan 26, 2022

Codecov Report

Merging #1996 (54d6532) into branch-22.02 (f80313e) will increase coverage by 2.59%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #1996      +/-   ##
================================================
+ Coverage         69.90%   72.50%   +2.59%     
================================================
  Files               142      146       +4     
  Lines              8689     9470     +781     
================================================
+ Hits               6074     6866     +792     
+ Misses             2615     2604      -11     
Impacted Files Coverage Δ
python/cugraph/cugraph/__init__.py 100.00% <0.00%> (ø)
python/cugraph/cugraph/tests/test_ecg.py 100.00% <0.00%> (ø)
python/cugraph/cugraph/tests/test_paths.py 100.00% <0.00%> (ø)
python/cugraph/cugraph/tests/test_egonet.py 100.00% <0.00%> (ø)
python/pylibcugraph/pylibcugraph/_version.py 0.00% <0.00%> (ø)
python/cugraph/cugraph/tests/test_wjaccard.py 100.00% <0.00%> (ø)
python/cugraph/cugraph/tests/test_hungarian.py 100.00% <0.00%> (ø)
python/cugraph/cugraph/tests/test_wsorensen.py 100.00% <0.00%> (ø)
python/cugraph/cugraph/tests/dask/mg_context.py 0.00% <0.00%> (ø)
python/cugraph/cugraph/tests/test_modularity.py 100.00% <0.00%> (ø)
... and 57 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f80313e...54d6532. Read the comment docs.

@chirayuG-nvidia
Copy link
Contributor Author

@ChuckHastings all the suggested edits are done. Please take a look.

@chirayuG-nvidia chirayuG-nvidia changed the title [WIP] Add optimized 2x string column renumbering code [Review] Add optimized 2x string column renumbering code Jan 26, 2022
@BradReesWork
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e0038f0 into rapidsai:branch-22.02 Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants