Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Renumbering refactor, add multi GPU support #963

Merged
merged 37 commits into from
Jul 23, 2020

Conversation

ChuckHastings
Copy link
Collaborator

@ChuckHastings ChuckHastings commented Jun 22, 2020

This PR will refactor the renumbering implementation to include:

  • A new high level design using a python NumberMap object which will encapsulate numbering functionality
  • A python/cudf/dask_cudf based implementation that provides a functional single GPU and MG implementation
  • All python algorithms converted to use the new implementation

Goal is:

This will NOT address a custom C++ implementation. That will be added later as an optimization, if desired. However, this implementation is intended to make it easier to address a custom C++ implementation.

Dependent on #1008

@ChuckHastings ChuckHastings requested a review from a team as a code owner June 22, 2020 13:52
@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.


import cugraph

class NumberMap:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this object live in:

  • single GPU pipeline?
  • multi GPU pipeline?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal was that the same object would be used in both pipelines.

Copy link
Member

@afender afender Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but my question was where in each pipeline?
Is this exposed to user? Does the Di/Graph carry it on a single GPU? If so, does the analytics carries it Multi GPU?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a single GPU, this object would be internal to the Graph object. It would be automatically generated when from_cudf_edgelist is called, it would be used by all analytics when renumbering/unrenumbering capabilities are required.

If a user decided to access the renumbering features themselves (currently supported in single GPU) the output of the renumbering function would be this object instead of the current Dataframe that gets returned, so that the user can directly call the to_vertex_id and from_vertex_id methods as they need to.

I haven't seen a complete description of the multi GPU pipeline. My assumption is that it will be similar to the single GPU pipeline. We would have some overarching python graph object that would create and maintain this object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a single GPU, this object would be internal to the Graph object. It would be automatically generated when from_cudf_edgelist is called, it would be used by all analytics when renumbering/unrenumbering capabilities are required.

Sounds good.

If a user decided to access the renumbering features themselves (currently supported in single GPU) the output of the renumbering function would be this object instead of the current Dataframe that gets returned, so that the user can directly call the to_vertex_id and from_vertex_id methods as they need to.

Ok. We would need to document this object and add an example in the user doc.

I haven't seen a complete description of the multi GPU pipeline.

There are GitHub Issues and PRs for OPG PageRank and BFS. Let me know if you have any specific questions.

My assumption is that it will be similar to the single GPU pipeline. We would have some overarching python graph object that would create and maintain this object.

To some extent. As of now, there is not one global multi GPU graph object. A distributed dask cudf edge list is accepted and single GPU Di/Graphs are created locally for each rank.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChuckHastings I lost track if you added doc for accessing the renumbered data.
This is motivated by users requests, see for instance #925.
Feel free to resolve and comment with the commit id.

python/cugraph/structure/number_map.py Outdated Show resolved Hide resolved
python/cugraph/structure/number_map.py Outdated Show resolved Hide resolved
This list of 1 or more strings contain the names
of the columns that uniquely identify an external
vertex identifier for destination vertices
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to document multi-GPU return behavior

  • Are internal IDs randomly spread or are they grouped together so that all edges for the same source (or perhaps destination) are on the same GPU?
  • How does this connect to the next step in the OPG pipeline (coo2csr [FEA] coo2csr #812 )? Are they natively compatible or do we need an intermediate step?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the ids, the approach I've been using is as follows:

  • External vertex ids are hashed and that hash value is used to partition vertex ids across the cluster. Each node of the cluster then is responsible for all vertices that hash to it.
  • Each local node of the cluster now has all of the vertex information, perform a local renumbering to get 0..(n-1) numbering of all vertices that hashed to that node
  • "Somehow" assign global ids. Current prototype does assignment by computing the number of vertices on each partition and doing a prefix sum to create a base vertex identifier for each node. The consequence is that we end up with an overall 0..(N-1) mapping, each node getting a contiguous subrange of ids that it created.

Regarding the OPG pipeline, this NumberMap does not entirely implement renumbering, it implements the core capability and will be hidden inside the graph object (although externally callable for those who wish to access the renumbering independently). The renumbering implementation itself will end up using this to create a COO that should be directly usable as input to #812.

Copy link
Member

@afender afender Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I think adding doc about this would be helpful.

Perhaps we should also add a note on getting these properties for CSC as the OPG Pagerank pipeline is the first one we are releasing?

SG does (1) renumber (2) swap src and dst (3) coo2csr. (1) being at edge list loading time and (2,3) at analytics time.
Now looking at MG, are we relying on the assumption that (2) gets done before (1) in order to get coo2csr successfully building the local CSC matrices?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't see the doc for the multi-GPU return behavior, let me know if you added it elsewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you're asking for. The typical workflow is outlined in the renumber notebooks. It works like this:

  • Instantiate a NumberMap object
  • Call from_dataframe to populate the number map
  • Call to_internal_vertex_id, add_internal_vertex_id, from_internal_vertex_id as desired to convert between internal and external vertex ids.

Alternatively, you can call NumberMap.renumber on a DataFrame and it will return a fully populated NumberMap and a renumbered DataFrame. Then you can call NumberMap.unrenumber on a DataFrame and it will convert internal vertex ids back into external vertex ids.

If you call it with a cudf.DataFrame, it uses Single GPU logic to create and translate everything. If you call it with a dask_cudf.DataFrame it uses MG logic and generated dask_cudf.DataFrame objects everywhere. The NumberMap object keeps track internally of everything necessary to do this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I further wrapped this in the Graph renumber and unrenumber functions which are automatically used if you specify renumber=True on the from_cudf_edgelist or from_dask_cudf_edgelist calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you're asking for.

Documentation that describes the data distribution behavior of MG renumbering output in general. Users of the renumbering feature (and MG developer) will wonder if each node gets a contiguous range or is it scattered for instance.

@ChuckHastings ChuckHastings changed the title [WIP] Renumbering [WIP] Renumbering refactor, add multi GPU support Jun 30, 2020
@ChuckHastings ChuckHastings requested a review from a team as a code owner July 8, 2020 15:01
@codecov-commenter
Copy link

codecov-commenter commented Jul 8, 2020

Codecov Report

Merging #963 into branch-0.15 will increase coverage by 2.36%.
The diff coverage is 81.63%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.15     #963      +/-   ##
===============================================
+ Coverage        65.24%   67.60%   +2.36%     
===============================================
  Files               58       57       -1     
  Lines             1755     1994     +239     
===============================================
+ Hits              1145     1348     +203     
- Misses             610      646      +36     
Impacted Files Coverage Δ
python/cugraph/community/__init__.py 100.00% <ø> (ø)
python/cugraph/layout/force_atlas2.py 60.00% <33.33%> (-1.54%) ⬇️
python/cugraph/link_prediction/wjaccard.py 70.00% <55.55%> (-3.34%) ⬇️
python/cugraph/dask/common/input_utils.py 96.90% <60.00%> (-2.03%) ⬇️
python/cugraph/centrality/katz_centrality.py 77.77% <71.42%> (-22.23%) ⬇️
python/cugraph/community/ktruss_subgraph.py 85.71% <75.00%> (-0.96%) ⬇️
python/cugraph/cores/k_core.py 84.21% <75.00%> (+8.21%) ⬆️
python/cugraph/structure/graph.py 75.93% <77.89%> (+3.76%) ⬆️
python/cugraph/structure/number_map.py 78.80% <78.80%> (ø)
python/cugraph/traversal/sssp.py 81.81% <84.61%> (+8.48%) ⬆️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a156bda...6fc4379. Read the comment docs.

@ChuckHastings ChuckHastings changed the title [WIP] Renumbering refactor, add multi GPU support [REVIEW] Renumbering refactor, add multi GPU support Jul 15, 2020
@afender afender requested a review from Iroy30 July 20, 2020 15:48
Copy link
Member

@afender afender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some suggestions regarding integration with the rest of the MG pipeline.

python/cugraph/layout/force_atlas2.py Show resolved Hide resolved
python/cugraph/tests/test_renumber.py Outdated Show resolved Hide resolved
python/cugraph/tests/test_renumber.py Outdated Show resolved Hide resolved
else:
self.from_cudf_edgelist(input_df)

def from_dask_cudf_edgelist(self, input_ddf):
def from_dask_cudf_edgelist(self, input_ddf, renumber=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renumber=False should be True to be consistent with the single GPU from_cudf_edgelist.
Notice that later degree and pagerank MG tests use both in the same test without specifying this option which would result in different input. I think these tests pass just by luck because the input is already renumbered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I set it as False so that I would - by default - not change the existing MG unit tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

One more thing I realized is that the doc regarding renumbering is outdated for

  • from_cudf_edgelist
  • from_dask_cudf_edgelist

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to True now that everything is working correctly.

I copied the renumber parameter documentation over to the from_dask_cudf_edgelist in the latest push.

Copy link
Contributor

@Iroy30 Iroy30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChuckHastings if the input data is int32 in the source and destination, what is the dtype of renumbered data, is it same as input or always int64?
Combining mg renumbering with mg pagerank makes dask_cudf's sorting go haywire, specifically during searchsorting https://github.com/rapidsai/cudf/blob/049f93c4387907553d614e69202cc8e0d2ddc793/python/dask_cudf/dask_cudf/sorting.py#L25 because of int32/int64 discrepancy arising from dask's lazy compute which could be the cause of the whole dataset being partitioned into a single gpu/patition.

@ChuckHastings
Copy link
Collaborator Author

@ChuckHastings if the input data is int32 in the source and destination, what is the dtype of renumbered data, is it same as input or always int64?
Combining mg renumbering with mg pagerank makes dask_cudf's sorting go haywire, specifically during searchsorting https://github.com/rapidsai/cudf/blob/049f93c4387907553d614e69202cc8e0d2ddc793/python/dask_cudf/dask_cudf/sorting.py#L25 because of int32/int64 discrepancy arising from dask's lazy compute which could be the cause of the whole dataset being partitioned into a single gpu/patition.

Found the disconnect in the calls (I was using int32 in some places and int64 in another place). Latest push fixes this. The output type of renumber is specified as an optional parameter to the NumberMap constructor, we default to int32.

@BradReesWork BradReesWork merged commit 14b7fd5 into rapidsai:branch-0.15 Jul 23, 2020
@BradReesWork BradReesWork added this to the 0.15 milestone Jul 28, 2020
@ChuckHastings ChuckHastings deleted the fea_opg_renumbering branch February 10, 2021 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants