Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a page on why we do not support cascading #2842

Merged
merged 8 commits into from
Oct 26, 2022
53 changes: 53 additions & 0 deletions docs/cugraph/source/basics/cugraph_cascading.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@

# Method Cascading and cuGraph

BLUF: cuGraph does not support method cascading

[Method Cascading](https://en.wikipedia.org/wiki/Method_cascading) is a popular, and useful, functional programming concept and is a great way to make code more readable. Python supports method cascading ... _for the most part_. There are a number of Python built-in classes that do not support cascading.

An example, from cuDF, is a sequence of method calls for loading data and then finding the largest values from a subset of the data (yes there are other ways this could be done):

```
gdf = cudf.from_pandas(df).query(‘val > 200’).nlargest(‘va’3)
```

cuGraph does not support method cascading for two main reasons: (1) the object-oriented nature of the Graph data object leverages in-place methods, and (2) the fact that algorithms operate on graphs rather than graphs running algorithms.

## Graph Data Objects
cuGraph follow an object-oriented design for the Graph objects. Users create a Graph and can then add data to object, but every add method call returns `None`.
BradReesWork marked this conversation as resolved.
Show resolved Hide resolved

_Why Inplace methods?_ <br>
cuGraph focuses on the big graph problems where there are 10s of millions to trillions of edges (Giga bytes to Terabytes of data). At that scale, creating a copy of the data becomes memory inefficient.

_Why not return `self` rather than `None`?_<br>
It would be simple to modify the methods to return `self` rather than `None`, however it opens the methods to misinterpretation. Consider the following code:

```
# cascade flow - makes sense
G = cugraph.Graph().from_cudf_edgelist(df)

# none cascaded code can be confusing
BradReesWork marked this conversation as resolved.
Show resolved Hide resolved
G = cugraph.Graph()
G2 = G.from_cudf_edgelist(df)
G3 = G.from_cudf_edgelist(df2)
```
The confusion with the none-cascade code is that G, G1, and G3 are all the same object with the same data. Users could be confused since it is not obvious that changing G3 would also change G2 (and even G). To prevent confusion cuGraph has opted to not return `self`.
BradReesWork marked this conversation as resolved.
Show resolved Hide resolved

_Why not add a flag "return_self" to the methods?_<br>
```
# cascade flow - makes sense
G = cugraph.Graph().from_cudf_edgelist(df, return_self=True)
```
The fact that a developer would explicitly add a "return_self" flag to the method indicates that the developer is aware that the method returns None. It is just as easy for the developer to use a non-cascading workflow.

### Algorithms
Algorithms operate on graph object.
BradReesWork marked this conversation as resolved.
Show resolved Hide resolved
```
cugraph.pagerank(G) and not G.pagerank()
```
This is also due to cuGraph following a object-orient model. Data objects just understand data, not operations on the data. That fact makes it so that the developer cannot cascade graph creation into an algorithm
BradReesWork marked this conversation as resolved.
Show resolved Hide resolved

```
# will not work
G = cugraph.Graph().from_cudf_edgelist(df).pagerank()
```