[REVIEW] Renumbering refactor, add multi GPU support #963

ChuckHastings · 2020-06-22T13:52:27Z

This PR will refactor the renumbering implementation to include:

A new high level design using a python NumberMap object which will encapsulate numbering functionality
A python/cudf/dask_cudf based implementation that provides a functional single GPU and MG implementation
All python algorithms converted to use the new implementation

Goal is:

This will NOT address a custom C++ implementation. That will be added later as an optimization, if desired. However, this implementation is intended to make it easier to address a custom C++ implementation.

Dependent on #1008

GPUtester · 2020-06-22T13:52:59Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

afender · 2020-06-22T16:09:07Z

python/cugraph/structure/number_map.py

+
+import cugraph
+
+class NumberMap:


Where does this object live in:

single GPU pipeline?

multi GPU pipeline?

My goal was that the same object would be used in both pipelines.

Yes, but my question was where in each pipeline?
Is this exposed to user? Does the Di/Graph carry it on a single GPU? If so, does the analytics carries it Multi GPU?

On a single GPU, this object would be internal to the Graph object. It would be automatically generated when from_cudf_edgelist is called, it would be used by all analytics when renumbering/unrenumbering capabilities are required.

If a user decided to access the renumbering features themselves (currently supported in single GPU) the output of the renumbering function would be this object instead of the current Dataframe that gets returned, so that the user can directly call the to_vertex_id and from_vertex_id methods as they need to.

I haven't seen a complete description of the multi GPU pipeline. My assumption is that it will be similar to the single GPU pipeline. We would have some overarching python graph object that would create and maintain this object.

On a single GPU, this object would be internal to the Graph object. It would be automatically generated when from_cudf_edgelist is called, it would be used by all analytics when renumbering/unrenumbering capabilities are required.

Sounds good.

If a user decided to access the renumbering features themselves (currently supported in single GPU) the output of the renumbering function would be this object instead of the current Dataframe that gets returned, so that the user can directly call the to_vertex_id and from_vertex_id methods as they need to.

Ok. We would need to document this object and add an example in the user doc.

I haven't seen a complete description of the multi GPU pipeline.

There are GitHub Issues and PRs for OPG PageRank and BFS. Let me know if you have any specific questions.

My assumption is that it will be similar to the single GPU pipeline. We would have some overarching python graph object that would create and maintain this object.

To some extent. As of now, there is not one global multi GPU graph object. A distributed dask cudf edge list is accepted and single GPU Di/Graphs are created locally for each rank.

@ChuckHastings I lost track if you added doc for accessing the renumbered data.
This is motivated by users requests, see for instance #925.
Feel free to resolve and comment with the commit id.

python/cugraph/structure/number_map.py

afender · 2020-06-22T16:36:15Z

python/cugraph/structure/number_map.py

+            This list of 1 or more strings contain the names
+            of the columns that uniquely identify an external
+            vertex identifier for destination vertices
+        """


Need to document multi-GPU return behavior

Are internal IDs randomly spread or are they grouped together so that all edges for the same source (or perhaps destination) are on the same GPU?

How does this connect to the next step in the OPG pipeline (coo2csr [FEA] coo2csr #812 )? Are they natively compatible or do we need an intermediate step?

Regarding the ids, the approach I've been using is as follows:

External vertex ids are hashed and that hash value is used to partition vertex ids across the cluster. Each node of the cluster then is responsible for all vertices that hash to it.

Each local node of the cluster now has all of the vertex information, perform a local renumbering to get 0..(n-1) numbering of all vertices that hashed to that node

"Somehow" assign global ids. Current prototype does assignment by computing the number of vertices on each partition and doing a prefix sum to create a base vertex identifier for each node. The consequence is that we end up with an overall 0..(N-1) mapping, each node getting a contiguous subrange of ids that it created.

Regarding the OPG pipeline, this NumberMap does not entirely implement renumbering, it implements the core capability and will be hidden inside the graph object (although externally callable for those who wish to access the renumbering independently). The renumbering implementation itself will end up using this to create a COO that should be directly usable as input to #812.

Sounds good, I think adding doc about this would be helpful.

Perhaps we should also add a note on getting these properties for CSC as the OPG Pagerank pipeline is the first one we are releasing?

SG does (1) renumber (2) swap src and dst (3) coo2csr. (1) being at edge list loading time and (2,3) at analytics time.
Now looking at MG, are we relying on the assumption that (2) gets done before (1) in order to get coo2csr successfully building the local CSC matrices?

Can't see the doc for the multi-GPU return behavior, let me know if you added it elsewhere

Not sure what you're asking for. The typical workflow is outlined in the renumber notebooks. It works like this:

Instantiate a NumberMap object

Call from_dataframe to populate the number map

Call to_internal_vertex_id, add_internal_vertex_id, from_internal_vertex_id as desired to convert between internal and external vertex ids.

Alternatively, you can call NumberMap.renumber on a DataFrame and it will return a fully populated NumberMap and a renumbered DataFrame. Then you can call NumberMap.unrenumber on a DataFrame and it will convert internal vertex ids back into external vertex ids.

If you call it with a cudf.DataFrame, it uses Single GPU logic to create and translate everything. If you call it with a dask_cudf.DataFrame it uses MG logic and generated dask_cudf.DataFrame objects everywhere. The NumberMap object keeps track internally of everything necessary to do this.

I further wrapped this in the Graph renumber and unrenumber functions which are automatically used if you specify renumber=True on the from_cudf_edgelist or from_dask_cudf_edgelist calls.

Not sure what you're asking for.

Documentation that describes the data distribution behavior of MG renumbering output in general. Users of the renumbering feature (and MG developer) will wonder if each node gets a contiguous range or is it scattered for instance.

…ome algorithms to use it

…ap methods

…here is an empty partition

codecov-commenter · 2020-07-08T23:54:25Z

Codecov Report

Merging #963 into branch-0.15 will increase coverage by 2.36%.
The diff coverage is 81.63%.

@@               Coverage Diff               @@
##           branch-0.15     #963      +/-   ##
===============================================
+ Coverage        65.24%   67.60%   +2.36%     
===============================================
  Files               58       57       -1     
  Lines             1755     1994     +239     
===============================================
+ Hits              1145     1348     +203     
- Misses             610      646      +36

Impacted Files	Coverage Δ
python/cugraph/community/__init__.py	`100.00% <ø> (ø)`
python/cugraph/layout/force_atlas2.py	`60.00% <33.33%> (-1.54%)`	⬇️
python/cugraph/link_prediction/wjaccard.py	`70.00% <55.55%> (-3.34%)`	⬇️
python/cugraph/dask/common/input_utils.py	`96.90% <60.00%> (-2.03%)`	⬇️
python/cugraph/centrality/katz_centrality.py	`77.77% <71.42%> (-22.23%)`	⬇️
python/cugraph/community/ktruss_subgraph.py	`85.71% <75.00%> (-0.96%)`	⬇️
python/cugraph/cores/k_core.py	`84.21% <75.00%> (+8.21%)`	⬆️
python/cugraph/structure/graph.py	`75.93% <77.89%> (+3.76%)`	⬆️
python/cugraph/structure/number_map.py	`78.80% <78.80%> (ø)`
python/cugraph/traversal/sssp.py	`81.81% <84.61%> (+8.48%)`	⬆️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a156bda...6fc4379. Read the comment docs.

afender

Added some suggestions regarding integration with the rest of the MG pipeline.

python/cugraph/layout/force_atlas2.py

python/cugraph/tests/test_renumber.py

afender · 2020-07-20T17:34:41Z

python/cugraph/structure/graph.py

        else:
            self.from_cudf_edgelist(input_df)

-    def from_dask_cudf_edgelist(self, input_ddf):
+    def from_dask_cudf_edgelist(self, input_ddf, renumber=False):


renumber=False should be True to be consistent with the single GPU from_cudf_edgelist.
Notice that later degree and pagerank MG tests use both in the same test without specifying this option which would result in different input. I think these tests pass just by luck because the input is already renumbered.

Agreed. I set it as False so that I would - by default - not change the existing MG unit tests.

👍

One more thing I realized is that the doc regarding renumbering is outdated for

from_cudf_edgelist

from_dask_cudf_edgelist

Updated this to True now that everything is working correctly.

I copied the renumber parameter documentation over to the from_dask_cudf_edgelist in the latest push.

Iroy30

@ChuckHastings if the input data is int32 in the source and destination, what is the dtype of renumbered data, is it same as input or always int64?
Combining mg renumbering with mg pagerank makes dask_cudf's sorting go haywire, specifically during searchsorting https://github.com/rapidsai/cudf/blob/049f93c4387907553d614e69202cc8e0d2ddc793/python/dask_cudf/dask_cudf/sorting.py#L25 because of int32/int64 discrepancy arising from dask's lazy compute which could be the cause of the whole dataset being partitioned into a single gpu/patition.

ChuckHastings · 2020-07-22T19:13:14Z

@ChuckHastings if the input data is int32 in the source and destination, what is the dtype of renumbered data, is it same as input or always int64?
Combining mg renumbering with mg pagerank makes dask_cudf's sorting go haywire, specifically during searchsorting https://github.com/rapidsai/cudf/blob/049f93c4387907553d614e69202cc8e0d2ddc793/python/dask_cudf/dask_cudf/sorting.py#L25 because of int32/int64 discrepancy arising from dask's lazy compute which could be the cause of the whole dataset being partitioned into a single gpu/patition.

Found the disconnect in the calls (I was using int32 in some places and int64 in another place). Latest push fixes this. The output type of renumber is specified as an optional parameter to the NumberMap constructor, we default to int32.

…s working

propose new API

f8d2187

ChuckHastings requested a review from a team as a code owner June 22, 2020 13:52

afender reviewed Jun 22, 2020

View reviewed changes

ChuckHastings added 3 commits June 23, 2020 20:40

Provide single GPU implementation of new renumbering approach, port s…

3c7d00e

…ome algorithms to use it

refactor all algorithms to use new numbermap and graph methods

7c89e06

push current (not working) opg implementation up for discussion

dafe437

afender added the 2 - In Progress label Jun 29, 2020

refactor interface a bit, new method: add_vertex_id, return data frames

02053a6

ChuckHastings changed the title ~~[WIP] Renumbering~~ [WIP] Renumbering refactor, add multi GPU support Jun 30, 2020

ChuckHastings added 6 commits July 1, 2020 11:21

ran black to clean up flake8 errors, added drop flag to a few NumberM…

95b22e8

…ap methods

Merge branch 'branch-0.15' into fea_opg_renumbering

3be34b2

Correct things to work with pandas 1.0 and latest cudf

e4ba84e

add renumber and unrenumber methods and clean up calls

74fca17

Tested renumbering in OPG pipeline. Modify some things to handle if t…

60223dc

…here is an empty partition

Merge branch 'branch-0.15' into fea_opg_renumbering

246f988

ChuckHastings requested a review from a team as a code owner July 8, 2020 15:01

ChuckHastings added 5 commits July 8, 2020 11:06

fix flake8 errors

d3880e6

fix OPG pagerank test

97f5e74

make quotes consistent

a9d2c5d

update notebooks to support new renumbering

9ba013f

fix flake8 errors

4187f1c

missed updating date in notebooks

127523a

ChuckHastings changed the title ~~[WIP] Renumbering refactor, add multi GPU support~~ [REVIEW] Renumbering refactor, add multi GPU support Jul 15, 2020

Merge branch 'branch-0.15' into fea_opg_renumbering

95a374e

afender added 3 - Ready for Review and removed 2 - In Progress labels Jul 15, 2020

ChuckHastings added 2 commits July 16, 2020 11:11

Merge branch 'branch-0.15' into fea_opg_renumbering

51adc99

Merge branch 'branch-0.15' into fea_opg_renumbering

feef815

afender requested a review from Iroy30 July 20, 2020 15:48

ChuckHastings added 2 commits July 20, 2020 12:42

Rename vertex_id functions to specify internal_vertex_id

c90aeca

Merge branch 'branch-0.15' into fea_opg_renumbering

139295f

BradReesWork assigned ChuckHastings Jul 20, 2020

afender reviewed Jul 20, 2020

View reviewed changes

afender added the 4 - Waiting on Author label Jul 20, 2020

ChuckHastings added 3 commits July 20, 2020 15:08

update function name in notebook

7aba069

enable MG renumbering by default, add unrenumber to MG pagerank

f6870f8

Merge branch 'branch-0.15' into fea_opg_renumbering

19df99b

Iroy30 reviewed Jul 22, 2020

View reviewed changes

afender mentioned this pull request Jul 22, 2020

[ENH] Consistent python formatting tool and configuration #1016

Closed

Iroy30 mentioned this pull request Jul 22, 2020

[REVIEW] fix local vert and offset calculation #1017

Merged

fix multi-GPU renumbering issues from slack discussion

9b41e34

ChuckHastings added 3 commits July 22, 2020 15:16

add renumber documentation to from_dask_cudf_edgelist

bce2f38

update documentation, remove unnecessary hack now that load balance i…

e7abad7

…s working

Merge branch 'branch-0.15' into fea_opg_renumbering

2148dc5

seunghwak approved these changes Jul 22, 2020

View reviewed changes

fix flake8 issue

3ba6640

ChuckHastings mentioned this pull request Jul 22, 2020

[FEA] Make the Hash Size in Renumbering a function argument #661

Closed

Iroy30 requested review from Iroy30 and afender July 22, 2020 21:44

Iroy30 approved these changes Jul 22, 2020

View reviewed changes

afender approved these changes Jul 22, 2020

View reviewed changes

ChuckHastings added 2 commits July 22, 2020 20:43

fix issues related to new hypergraph unit tests

d10a4dc

update renumber notebook

6fc4379

BradReesWork merged commit 14b7fd5 into rapidsai:branch-0.15 Jul 23, 2020

BradReesWork added this to the 0.15 milestone Jul 28, 2020

ChuckHastings deleted the fea_opg_renumbering branch February 10, 2021 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Renumbering refactor, add multi GPU support #963

[REVIEW] Renumbering refactor, add multi GPU support #963

ChuckHastings commented Jun 22, 2020 •

edited

Loading

GPUtester commented Jun 22, 2020

afender Jun 22, 2020

ChuckHastings Jun 22, 2020

afender Jun 22, 2020 •

edited

Loading

ChuckHastings Jun 22, 2020

afender Jun 22, 2020

afender Jul 20, 2020

afender Jun 22, 2020

ChuckHastings Jun 22, 2020

afender Jun 22, 2020 •

edited

Loading

afender Jul 20, 2020

ChuckHastings Jul 21, 2020

ChuckHastings Jul 21, 2020

afender Jul 22, 2020

codecov-commenter commented Jul 8, 2020 •

edited

Loading

afender left a comment

afender Jul 20, 2020

ChuckHastings Jul 20, 2020

afender Jul 22, 2020

ChuckHastings Jul 22, 2020

Iroy30 left a comment

ChuckHastings commented Jul 22, 2020


		import cugraph

		class NumberMap:

[REVIEW] Renumbering refactor, add multi GPU support #963

[REVIEW] Renumbering refactor, add multi GPU support #963

Conversation

ChuckHastings commented Jun 22, 2020 • edited Loading

GPUtester commented Jun 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afender Jun 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afender Jun 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 8, 2020 • edited Loading

Codecov Report

afender left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Iroy30 left a comment

Choose a reason for hiding this comment

ChuckHastings commented Jul 22, 2020

ChuckHastings commented Jun 22, 2020 •

edited

Loading

afender Jun 22, 2020 •

edited

Loading

afender Jun 22, 2020 •

edited

Loading

codecov-commenter commented Jul 8, 2020 •

edited

Loading