-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Renumbering refactor, add multi GPU support #963
[REVIEW] Renumbering refactor, add multi GPU support #963
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
|
||
import cugraph | ||
|
||
class NumberMap: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this object live in:
- single GPU pipeline?
- multi GPU pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal was that the same object would be used in both pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but my question was where in each pipeline?
Is this exposed to user? Does the Di/Graph carry it on a single GPU? If so, does the analytics carries it Multi GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a single GPU, this object would be internal to the Graph object. It would be automatically generated when from_cudf_edgelist is called, it would be used by all analytics when renumbering/unrenumbering capabilities are required.
If a user decided to access the renumbering features themselves (currently supported in single GPU) the output of the renumbering function would be this object instead of the current Dataframe that gets returned, so that the user can directly call the to_vertex_id and from_vertex_id methods as they need to.
I haven't seen a complete description of the multi GPU pipeline. My assumption is that it will be similar to the single GPU pipeline. We would have some overarching python graph object that would create and maintain this object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a single GPU, this object would be internal to the Graph object. It would be automatically generated when from_cudf_edgelist is called, it would be used by all analytics when renumbering/unrenumbering capabilities are required.
Sounds good.
If a user decided to access the renumbering features themselves (currently supported in single GPU) the output of the renumbering function would be this object instead of the current Dataframe that gets returned, so that the user can directly call the to_vertex_id and from_vertex_id methods as they need to.
Ok. We would need to document this object and add an example in the user doc.
I haven't seen a complete description of the multi GPU pipeline.
There are GitHub Issues and PRs for OPG PageRank and BFS. Let me know if you have any specific questions.
My assumption is that it will be similar to the single GPU pipeline. We would have some overarching python graph object that would create and maintain this object.
To some extent. As of now, there is not one global multi GPU graph object. A distributed dask cudf edge list is accepted and single GPU Di/Graphs are created locally for each rank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChuckHastings I lost track if you added doc for accessing the renumbered data.
This is motivated by users requests, see for instance #925.
Feel free to resolve and comment with the commit id.
This list of 1 or more strings contain the names | ||
of the columns that uniquely identify an external | ||
vertex identifier for destination vertices | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to document multi-GPU return behavior
- Are internal IDs randomly spread or are they grouped together so that all edges for the same source (or perhaps destination) are on the same GPU?
- How does this connect to the next step in the OPG pipeline (coo2csr [FEA] coo2csr #812 )? Are they natively compatible or do we need an intermediate step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the ids, the approach I've been using is as follows:
- External vertex ids are hashed and that hash value is used to partition vertex ids across the cluster. Each node of the cluster then is responsible for all vertices that hash to it.
- Each local node of the cluster now has all of the vertex information, perform a local renumbering to get 0..(n-1) numbering of all vertices that hashed to that node
- "Somehow" assign global ids. Current prototype does assignment by computing the number of vertices on each partition and doing a prefix sum to create a base vertex identifier for each node. The consequence is that we end up with an overall 0..(N-1) mapping, each node getting a contiguous subrange of ids that it created.
Regarding the OPG pipeline, this NumberMap does not entirely implement renumbering, it implements the core capability and will be hidden inside the graph object (although externally callable for those who wish to access the renumbering independently). The renumbering implementation itself will end up using this to create a COO that should be directly usable as input to #812.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I think adding doc about this would be helpful.
Perhaps we should also add a note on getting these properties for CSC as the OPG Pagerank pipeline is the first one we are releasing?
SG does (1) renumber (2) swap src and dst (3) coo2csr. (1) being at edge list loading time and (2,3) at analytics time.
Now looking at MG, are we relying on the assumption that (2) gets done before (1) in order to get coo2csr successfully building the local CSC matrices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't see the doc for the multi-GPU return behavior, let me know if you added it elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you're asking for. The typical workflow is outlined in the renumber notebooks. It works like this:
- Instantiate a NumberMap object
- Call from_dataframe to populate the number map
- Call to_internal_vertex_id, add_internal_vertex_id, from_internal_vertex_id as desired to convert between internal and external vertex ids.
Alternatively, you can call NumberMap.renumber on a DataFrame and it will return a fully populated NumberMap and a renumbered DataFrame. Then you can call NumberMap.unrenumber on a DataFrame and it will convert internal vertex ids back into external vertex ids.
If you call it with a cudf.DataFrame, it uses Single GPU logic to create and translate everything. If you call it with a dask_cudf.DataFrame it uses MG logic and generated dask_cudf.DataFrame objects everywhere. The NumberMap object keeps track internally of everything necessary to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I further wrapped this in the Graph renumber and unrenumber functions which are automatically used if you specify renumber=True on the from_cudf_edgelist or from_dask_cudf_edgelist calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you're asking for.
Documentation that describes the data distribution behavior of MG renumbering output in general. Users of the renumbering feature (and MG developer) will wonder if each node gets a contiguous range or is it scattered for instance.
…here is an empty partition
Codecov Report
@@ Coverage Diff @@
## branch-0.15 #963 +/- ##
===============================================
+ Coverage 65.24% 67.60% +2.36%
===============================================
Files 58 57 -1
Lines 1755 1994 +239
===============================================
+ Hits 1145 1348 +203
- Misses 610 646 +36
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some suggestions regarding integration with the rest of the MG pipeline.
python/cugraph/structure/graph.py
Outdated
else: | ||
self.from_cudf_edgelist(input_df) | ||
|
||
def from_dask_cudf_edgelist(self, input_ddf): | ||
def from_dask_cudf_edgelist(self, input_ddf, renumber=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renumber=False
should be True to be consistent with the single GPU from_cudf_edgelist
.
Notice that later degree and pagerank MG tests use both in the same test without specifying this option which would result in different input. I think these tests pass just by luck because the input is already renumbered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I set it as False so that I would - by default - not change the existing MG unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
One more thing I realized is that the doc regarding renumbering is outdated for
from_cudf_edgelist
from_dask_cudf_edgelist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this to True now that everything is working correctly.
I copied the renumber parameter documentation over to the from_dask_cudf_edgelist in the latest push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChuckHastings if the input data is int32 in the source and destination, what is the dtype of renumbered data, is it same as input or always int64?
Combining mg renumbering with mg pagerank makes dask_cudf's sorting go haywire, specifically during searchsorting https://github.com/rapidsai/cudf/blob/049f93c4387907553d614e69202cc8e0d2ddc793/python/dask_cudf/dask_cudf/sorting.py#L25 because of int32/int64 discrepancy arising from dask's lazy compute which could be the cause of the whole dataset being partitioned into a single gpu/patition.
Found the disconnect in the calls (I was using int32 in some places and int64 in another place). Latest push fixes this. The output type of renumber is specified as an optional parameter to the NumberMap constructor, we default to int32. |
This PR will refactor the renumbering implementation to include:
Goal is:
This will NOT address a custom C++ implementation. That will be added later as an optimization, if desired. However, this implementation is intended to make it easier to address a custom C++ implementation.
Dependent on #1008