diff --git a/notebooks/README.md b/notebooks/README.md index 98e421603ad..90bbd8142d4 100644 --- a/notebooks/README.md +++ b/notebooks/README.md @@ -10,9 +10,11 @@ This repository contains a collection of Jupyter Notebooks that outline how to r | Folder | Notebook | Description | | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | | Centrality | | | -| | [Centrality](centrality/Centrality.ipynb) | Compute and compare multiple (currently 4) centrality scores | -| | [Katz](centrality/Katz.ipynb) | Compute the Katz centrality for every vertex | -| | [Betweenness](centrality/Betweenness.ipynb) | Compute both Edge and Vertex Betweenness centrality | +| | [Centrality](algorithms/centrality/Centrality.ipynb) | Compute and compare multiple (currently 5) centrality scores | +| | [Katz](algorithms/centrality/Katz.ipynb) | Compute the Katz centrality for every vertex | +| | [Betweenness](algorithms/centrality/Betweenness.ipynb) | Compute both Edge and Vertex Betweenness centrality | +| | [Degree](algorithms/centrality/Degree.ipynb) | Compute Degree Centraility for each vertex | +| | [Eigenvector](algorithms/centrality/Eigenvector.ipynb) | Compute Eigenvector for every vertex | | Community | | | | | [Louvain](community/Louvain.ipynb) and Leiden | Identify clusters in a graph using both the Louvain and Leiden algorithms | | | [ECG](community/ECG.ipynb) | Identify clusters in a graph using the Ensemble Clustering for Graph | @@ -51,10 +53,10 @@ Running the example in these notebooks requires: * The latest version of RAPIDS with cuGraph. * Download via Docker, Conda (See [__Getting Started__](https://rapids.ai/start.html)) -* cuGraph is dependent on the latest version of cuDF. Please install all components of RAPIDS -* Python 3.7+ +* cuGraph is dependent on the latest version of cuDF. Please install all components of RAPIDS +* Python 3.8+ * A system with an NVIDIA GPU: Pascal architecture or better -* CUDA 11.0+ +* CUDA 11.4+ * NVIDIA driver 450.51+ @@ -73,7 +75,7 @@ Test Hardware ##### Copyright -Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved. +Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at diff --git a/notebooks/algorithms/README.md b/notebooks/algorithms/README.md new file mode 100644 index 00000000000..14b2929efac --- /dev/null +++ b/notebooks/algorithms/README.md @@ -0,0 +1,69 @@ +# cuGraph Algorithm Notebooks + +As all the algorithm Notebooks are updated and migrated to this area, they will show in this Readme. Until then they are available [here](../README.md) + +![GraphAnalyticsFigure](../img/GraphAnalyticsFigure.jpg) + +This repository contains a collection of Jupyter Notebooks that outline how to run various cuGraph analytics. The notebooks do not address a complete data science problem. The notebooks are simply examples of how to run the graph analytics. Manipulation of the data before or after the graph analytic is not covered here. Extended, more problem focused, notebooks are being created and available https://github.com/rapidsai/notebooks-extended + +## Summary + +| Folder | Notebook | Description | +| --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | +| Centrality | | | +| | [Centrality](centrality/Centrality.ipynb) | Compute and compare multiple (currently 5) centrality scores | +| | [Katz](centrality/Katz.ipynb) | Compute the Katz centrality for every vertex | +| | [Betweenness](centrality/Betweenness.ipynb) | Compute both Edge and Vertex Betweenness centrality | +| | [Degree](centrality/Degree.ipynb) | Compute Degree Centraility for each vertex | +| | [Eigenvector](centrality/Eigenvector.ipynb) | Compute Eigenvector for every vertex | + + + +[System Requirements](../README.md#requirements) + +| Author Credit | Date | Update | cuGraph Version | Test Hardware | +| --------------|------------|------------------|-----------------|----------------| +| Brad Rees | 04/19/2021 | created | 0.19 | GV100, CUDA 11.0 +| Don Acosta | 07/05/2022 | tested / updated | 22.08 nightly | DGX Tesla V100 CUDA 11.5 + +### Copyright + +Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + + + + + +![RAPIDS](img/rapids_logo.png) diff --git a/notebooks/centrality/Betweenness.ipynb b/notebooks/algorithms/centrality/Betweenness.ipynb similarity index 86% rename from notebooks/centrality/Betweenness.ipynb rename to notebooks/algorithms/centrality/Betweenness.ipynb index 79e5383ed94..8860819b3ad 100644 --- a/notebooks/centrality/Betweenness.ipynb +++ b/notebooks/algorithms/centrality/Betweenness.ipynb @@ -8,16 +8,11 @@ "\n", "In this notebook, we will compute the Betweenness centrality for both vertices and edges in our test database using cuGraph and NetworkX. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n", "\n", - "Notebook Credits\n", - "* Original Authors: Bradley Rees\n", - "* Created: 04/24/2019\n", - "* Last Edit: 08/16/2020\n", - "\n", - "RAPIDS Versions: 0.15 \n", - "\n", - "Test Hardware\n", - "\n", - "* GV100 32G, CUDA 10.2\n" + "| Author Credit | Date | Update | cuGraph Version | Test Hardware |\n", + "| --------------|------------|------------------|-----------------|----------------|\n", + "| Brad Rees | 04/24/2019 | created | 0.15 | GV100, CUDA 11.0\n", + "| Brad Rees | 08/16/2020 | tested / updated | 21.10 nightly | RTX 3090 CUDA 11.4\n", + "| Don Acosta | 07/05/2022 | tested / updated | 22.08 nightly | DGX Tesla V100 CUDA 11.5" ] }, { @@ -79,7 +74,7 @@ "metadata": {}, "source": [ "#### Some notes about vertex IDs...\n", - "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n", + "\n", "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n", " * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n", " * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n" @@ -95,7 +90,7 @@ "Anthropological Research 33, 452-473 (1977).*\n", "\n", "\n", - "![Karate Club](../img/zachary_black_lines.png)\n", + "\n", "\n", "\n", "Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.\n" @@ -143,7 +138,7 @@ "outputs": [], "source": [ "# Define the path to the test data \n", - "datafile='../data/karate-data.csv'" + "datafile='../../data/karate-data.csv'" ] }, { @@ -221,33 +216,6 @@ "Let's now look at the results" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Find the most important vertex using the scores\n", - "# This methods should only be used for small graph\n", - "def print_top_scores(_df, txt) :\n", - " m = _df['betweenness_centrality'].max()\n", - " _d = _df.query('betweenness_centrality == @m')\n", - " print(txt)\n", - " print(_d)\n", - " print()\n", - " " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print_top_scores(vertex_bc, \"top vertex centrality scores\")\n", - "print_top_scores(edge_bc, \"top edge centrality scores\")" - ] - }, { "cell_type": "code", "execution_count": null, @@ -342,7 +310,7 @@ "metadata": {}, "source": [ "___\n", - "Copyright (c) 2019-2020, NVIDIA CORPORATION.\n", + "Copyright (c) 2019-2022, NVIDIA CORPORATION.\n", "\n", "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n", "\n", @@ -353,9 +321,9 @@ ], "metadata": { "kernelspec": { - "display_name": "cugraph_dev", + "display_name": "Python 3.8.13 ('cugraph_dev')", "language": "python", - "name": "cugraph_dev" + "name": "python3" }, "language_info": { "codemirror_mode": { @@ -367,7 +335,12 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.6" + "version": "3.8.13" + }, + "vscode": { + "interpreter": { + "hash": "cee8a395f2f0c5a5bcf513ae8b620111f4346eff6dc64e1ea99c951b2ec68604" + } } }, "nbformat": 4, diff --git a/notebooks/algorithms/centrality/Centrality.ipynb b/notebooks/algorithms/centrality/Centrality.ipynb new file mode 100644 index 00000000000..4a584a7fb87 --- /dev/null +++ b/notebooks/algorithms/centrality/Centrality.ipynb @@ -0,0 +1,470 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Centrality" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this notebook, we will compute vertex centrality scores using the various cuGraph algorithms. We will then compare the similarities and differences.\n", + "\n", + "| Author Credit | Date | Update | cuGraph Version | Test Hardware |\n", + "| --------------|------------|------------------|-----------------|----------------|\n", + "| Brad Rees | 04/16/2021 | created | 0.19 | GV100, CUDA 11.0\n", + "| Brad Rees | 08/05/2021 | tested / updated | 21.10 nightly | RTX 3090 CUDA 11.4\n", + "| Don Acosta | 07/05/2022 | tested / updated | 22.08 nightly | DGX Tesla V100 CUDA 11.5\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Centrality is measure of how important, or central, a node or edge is within a graph. It is useful for identifying influencer in social networks, key routing nodes in communication/computer network infrastructures, \n", + "\n", + "The seminal paper on centrality is: Freeman, L. C. (1978). Centrality in social networks conceptual clarification. Social networks, 1(3), 215-239.\n", + "\n", + "__Degree Centrality__
\n", + "Degree centrality is based on the notion that whoever has the most connections must be important. \n", + "\n", + "
\n", + "\n", + "
\n", + "\n", + "\n", + "___Closeness centrality – coming soon___
\n", + "Closeness is a measure of the shortest path to every other node in the graph. A node that is close to every other node, can reach over other node in the fewest number of hops, means that it has greater influence on the network versus a node that is not close.\n", + "\n", + "__Betweenness Centrality__
\n", + "Betweenness is a measure of the number of shortest paths that cross through a node, or over an edge. A node with high betweenness means that it had a greater influence on the flow of information. \n", + "\n", + "Betweenness centrality of a node 𝑣 is the sum of the fraction of all-pairs shortest paths that pass through 𝑣\n", + "\n", + "
\n", + " \n", + "
\n", + "\n", + "To speedup runtime of betweenness centrailty, the metric can be computed on a limited number of nodes (randomly selected) and then used to estimate the other scores. For this example, the graphs are relatively small (under 5,000 nodes) so betweenness on every node will be computed.\n", + "\n", + "__Katz Centrality__
\n", + "Katz is a variant of degree centrality and of eigenvector centrality. \n", + "Katz centrality is a measure of the relative importance of a node within the graph based on measuring the influence across the total number of walks between vertex pairs.\n", + "\n", + "
\n", + " \n", + "
\n", + "\n", + "See:\n", + "* [Katz on Wikipedia](https://en.wikipedia.org/wiki/Katz_centrality) for more details on the algorithm.\n", + "* [Learn more about Katz Centrality](https://www.sci.unich.it/~francesc/teaching/network/katz.html)\n", + "\n", + "__Eigenvector Centrality__
\n", + "Eigenvectors can be thought of as the balancing points of a graph, or center of gravity of a 3D object. High centrality means that more of the graph is balanced around that node.\n", + "The eigenvector centrality for node i is the\n", + "i-th element of the vector x defined by the eigenvector equation.\n", + "\n", + "
\n", + "\n", + "
\n", + "Where M(v) is the adjacency list for the set of vertices(v) and λ is a constant.\n", + "\n", + "See:\n", + "* [Eigenvector Centrality on Wikipedia](https://en.wikipedia.org/wiki/Eigenvector_centrality) for more details on the algorithm.\n", + "* [Learn more about EigenVector Centrality](https://www.sci.unich.it/~francesc/teaching/network/eigenvector.html)\n", + "\n", + "\n", + "__PageRank__
\n", + "PageRank is classified as both a Link Analysis tool and a centrality measure. PageRank is based on the assumption that important nodes point (directed edge) to other important nodes. From a social network perspective, the question is who do you seek for an answer and then who does that person seek. PageRank is good when there is implied importance in the data, for example a citation network, web page linkages, or trust networks. Pagerank also dilutes the importance of nodes which have exceedingly high outward degree which is an improvement over Katz.\n", + "
\n", + "\n", + "
\n", + "\n", + "See:\n", + "* [Pagerank on Wikipedia](https://en.wikipedia.org/wiki/Pagerank) for more details on the algorithm and its commercial use.\n", + "* [Learn more about Pagerank Centrality](https://www.sci.unich.it/~francesc/teaching/network/pagerank.html) which is a centrality measure highly related to the Pagerank algorithm." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Test Data\n", + "We will be using the Zachary Karate club dataset \n", + "*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of\n", + "Anthropological Research 33, 452-473 (1977).*\n", + "\n", + "\n", + "\n", + "\n", + "Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import the cugraph modules\n", + "import cugraph\n", + "import cudf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#import the networkX required modules\n", + "import numpy as np\n", + "import pandas as pd \n", + "from IPython.display import display_html " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Functions\n", + "using underscore variable names to avoid collisions. \n", + "non-underscore names are expected to be global names" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Compute Centrality\n", + "# the centrality calls are very straightforward with the graph being the primary argument\n", + "# we are using the default argument values for all centrality functions\n", + "\n", + "def compute_centrality(_graph) :\n", + " # Compute Degree Centrality\n", + " _d = cugraph.degree_centrality(_graph)\n", + " \n", + " # Compute the Betweenness Centrality\n", + " _b = cugraph.betweenness_centrality(_graph)\n", + "\n", + " # Compute Katz Centrality\n", + " _k = cugraph.katz_centrality(_graph)\n", + " \n", + " # Compute PageRank Centrality\n", + " _p = cugraph.pagerank(_graph)\n", + "\n", + " # Compute EigenVector Centrality\n", + " _e = cugraph.eigenvector_centrality(_graph, max_iter=1000, tol=1.0e-3)\n", + " \n", + " return _d, _b, _k, _p, _e" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Print function\n", + "def print_centrality(k,dc,bc,katz,pr, ev):\n", + "\n", + " dc_top = dc.sort_values(by='degree_centrality', ascending=False).head(k).to_pandas()\n", + " bc_top = bc.sort_values(by='betweenness_centrality', ascending=False).head(k).to_pandas()\n", + " katz_top = katz.sort_values(by='katz_centrality', ascending=False).head(k).to_pandas()\n", + " pr_top = pr.sort_values(by='pagerank', ascending=False).head(k).to_pandas()\n", + " ev_top = ev.sort_values(by='eigenvector_centrality', ascending=False).head(k).to_pandas()\n", + " \n", + " df1_styler = dc_top.style.set_table_attributes(\"style='display:inline'\").set_caption('Degree').hide(axis='index')\n", + " df2_styler = bc_top.style.set_table_attributes(\"style='display:inline'\").set_caption('Betweenness').hide(axis='index')\n", + " df3_styler = katz_top.style.set_table_attributes(\"style='display:inline'\").set_caption('Katz').hide(axis='index')\n", + " df4_styler = pr_top.style.set_table_attributes(\"style='display:inline'\").set_caption('PageRank').hide(axis='index')\n", + " df5_styler = ev_top.style.set_table_attributes(\"style='display:inline'\").set_caption('EigenVector').hide(axis='index')\n", + "\n", + " display_html(df1_styler._repr_html_()+df2_styler._repr_html_()+df3_styler._repr_html_()+df4_styler._repr_html_()+df5_styler._repr_html_(), raw=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Read the data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define the path to the test data \n", + "datafile='../../data/karate-data.csv'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "cuGraph does not do any data reading or writing and is dependent on other tools for that, with cuDF being the preferred solution. \n", + "\n", + "The data file contains an edge list, which represents the connection of a vertex to another. The `source` to `destination` pairs is in what is known as Coordinate Format (COO). In this test case, the data is just two columns. However a third, `weight`, column is also possible" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "gdf = cudf.read_csv(datafile, delimiter='\\t', names=['src', 'dst'], dtype=['int32', 'int32'] )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "it was that easy to load data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a Graph" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe \n", + "G = cugraph.Graph()\n", + "G.from_cudf_edgelist(gdf, source='src', destination='dst')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Compute Centrality" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dc, bc, katz, pr, ev = compute_centrality(G)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Results\n", + "Typically, analysts just look at the top 10% of results. Basically just those vertices that are the most central or important. \n", + "The karate data has 32 vertices, so let's round a little and look at the top 5 vertices" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_centrality(5, dc, bc, katz, pr, ev)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### A Different Dataset\n", + "The Karate dataset is not that large or complex, which makes it a perfect test dataset since it is easy to visually verify results. Let's look at a larger dataset with a lot more edges" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define the path to the test data \n", + "datafile='../../data/netscience.csv'\n", + "\n", + "gdf = cudf.read_csv(datafile, delimiter=' ', names=['src', 'dst', 'wt'], dtype=['int32', 'int32', 'float'] )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe \n", + "G = cugraph.Graph()\n", + "G.from_cudf_edgelist(gdf, source='src', destination='dst')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "(G.number_of_nodes(), G.number_of_edges())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dc, bc, katz, pr, ev = compute_centrality(G)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_centrality(5, dc, bc, katz, pr, ev)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now see a larger discrepancy between the centrality scores and which nodes rank highest.\n", + "Which centrality measure to use is left to the analyst to decide and does require insight into the difference algorithms and graph structure." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### And One More Dataset\n", + "Let's look at a Cyber dataset. The vertex ID are IP addresses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define the path to the test data \n", + "datafile='../../data/cyber.csv'\n", + "\n", + "gdf = cudf.read_csv(datafile, delimiter=',', names=['idx', 'src', 'dst'], dtype=['int32', 'str', 'str'] )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe \n", + "G = cugraph.Graph()\n", + "G.from_cudf_edgelist(gdf, source='src', destination='dst')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "(G.number_of_nodes(), G.number_of_edges())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dc, bc, katz, pr, ev = compute_centrality(G)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here the results of all the measures can be seen side by side." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_centrality(5, dc, bc, katz, pr, ev)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are differences in how each centrality measure ranks the nodes. In some cases, every algorithm returns similar results, and in others, the results are different. Understanding how the centrality measure is computed and what edge represent is key to selecting the right centrality metric." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "----\n", + "Copyright (c) 2019-2022, NVIDIA CORPORATION.\n", + "\n", + "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n", + "\n", + "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.8.13 ('cugraph_dev')", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.13" + }, + "vscode": { + "interpreter": { + "hash": "cee8a395f2f0c5a5bcf513ae8b620111f4346eff6dc64e1ea99c951b2ec68604" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/notebooks/algorithms/centrality/Degree.ipynb b/notebooks/algorithms/centrality/Degree.ipynb new file mode 100644 index 00000000000..bb3779d9980 --- /dev/null +++ b/notebooks/algorithms/centrality/Degree.ipynb @@ -0,0 +1,323 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Degree Centrality\n", + "\n", + "In this notebook, we will compute the degree centrality for vertices in our test database using cuGraph and NetworkX. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n", + "\n", + "| Author Credit | Date | Update | cuGraph Version | Test Hardware |\n", + "| --------------|------------|------------------|-----------------|----------------|\n", + "| Don Acosta | 07/05/2022 | created | 22.08 nightly | DGX Tesla V100 CUDA 11.5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction\n", + "Degree centrality is the simplest measure of the relative importance based on counting the connections with each vertex. Vertices with the most connections are the most central by this measure.\n", + "\n", + "See [Degree Centrality on Wikipedia](https://en.wikipedia.org/wiki/Degree_centrality) for more details on the algorithm.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Degree centrality of a vertex 𝑣 is the sum of the edges incident on that node.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To compute the Degree centrality scores for a graph in cuGraph we use:
\n", + "__df_v = cugraph.degree_centrality(G)__\n", + " G: cugraph.Graph object\n", + " \n", + "\n", + "Returns:\n", + "\n", + " df: a cudf.DataFrame object with two columns:\n", + " df['vertex']: The vertex identifier for the vertex\n", + " df['degree_centrality']: The degree centrality score for the vertex\n", + "\n", + "\n", + "### _NOTICE_\n", + "cuGraph does not currently support the ‘endpoints’ and ‘weight’ parameters as seen in the corresponding networkX call. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Some notes about vertex IDs...\n", + "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n", + " * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n", + " * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Test Data\n", + "We will be using the Zachary Karate club dataset \n", + "*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of\n", + "Anthropological Research 33, 452-473 (1977).*\n", + "\n", + "\n", + "![Karate Club](../../img/zachary_black_lines.png)\n", + "\n", + "\n", + "Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prep" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import needed libraries\n", + "import cugraph\n", + "import cudf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# NetworkX libraries\n", + "import networkx as nx" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Some Prep" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define the path to the test data \n", + "datafile='../../data/karate-data.csv'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Read in the data - GPU\n", + "cuGraph depends on cuDF for data loading and the initial Dataframe creation\n", + "\n", + "The data file contains an edge list, which represents the connection of a vertex to another. The `source` to `destination` pairs is in what is known as Coordinate Format (COO). In this test case, the data is just two columns. However a third, `weight`, column is also possible" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "gdf = cudf.read_csv(datafile, delimiter='\\t', names=['src', 'dst'], dtype=['int32', 'int32'] )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create a Graph " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe \n", + "G = cugraph.Graph()\n", + "G.from_cudf_edgelist(gdf, source='src', destination='dst')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Call the Degree Centrality algorithm" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Call cugraph.degree_centrality \n", + "vertex_bc = cugraph.degree_centrality(G)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_It was that easy!_ \n", + "\n", + "----\n", + "\n", + "Let's now look at the results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Find the most important vertex using the scores\n", + "def print_top_scores(_df, txt) :\n", + " m = _df['degree_centrality'].max()\n", + " _d = _df.query('degree_centrality == @m')\n", + " print(txt)\n", + " print(_d)\n", + " print()\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_top_scores(vertex_bc, \"top degree centrality score\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# let's sort the data and look at the top 5 vertices\n", + "vertex_bc.sort_values(by='degree_centrality', ascending=False).head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Now compute using NetworkX" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Read the data, this also created a NetworkX Graph \n", + "file = open(datafile, 'rb')\n", + "Gnx = nx.read_edgelist(file)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dc_nx_vert = nx.degree_centrality(Gnx)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dc_nx_sv = sorted(((value, key) for (key,value) in dc_nx_vert.items()), reverse=True)\n", + "dc_nx_sv[:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As mentioned, the scores are different but the ranking is the same." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "___\n", + "Copyright (c) 2022, NVIDIA CORPORATION.\n", + "\n", + "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n", + "\n", + "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n", + "___" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.8.13 ('cugraph_dev')", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.13" + }, + "vscode": { + "interpreter": { + "hash": "cee8a395f2f0c5a5bcf513ae8b620111f4346eff6dc64e1ea99c951b2ec68604" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/notebooks/algorithms/centrality/Eigenvector.ipynb b/notebooks/algorithms/centrality/Eigenvector.ipynb new file mode 100644 index 00000000000..c7fc3506c89 --- /dev/null +++ b/notebooks/algorithms/centrality/Eigenvector.ipynb @@ -0,0 +1,340 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Eigenvector Centrality\n", + "\n", + "In this notebook, we will compute the Eigenvector centrality for vertice in our test database using cuGraph and NetworkX. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n", + "\n", + "| Author Credit | Date | Update | cuGraph Version | Test Hardware |\n", + "| --------------|------------|------------------|-----------------|----------------|\n", + "| Don Acosta | 07/05/2022 | created | 22.08 nightly | DGX Tesla V100 CUDA 11.5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction\n", + "Eigenvector centrality computes the centrality for a vertex based on the\n", + "centrality of its neighbors. The Eigenvector of a node measures influence within a graph by taking into account a vertex's connections to other highly connected vertices.\n", + "\n", + "\n", + "See [Eigenvector Centrality on Wikipedia](https://en.wikipedia.org/wiki/Eigenvector_centrality) for more details on the algorithm.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The eigenvector centrality for node i is the\n", + "i-th element of the vector x defined by the eigenvector equation.\n", + "\n", + "\n", + "\n", + "\n", + "Where M(v) is the adjacency list for the set of vertices(v) and λ is a constant.\n", + "\n", + "[Learn more about EigenVector Centrality](https://www.sci.unich.it/~francesc/teaching/network/eigenvector.html)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To compute the Eigenvector centrality scores for a graph in cuGraph we use:
\n", + "__df = cugraph.eigenvector_centrality(G, max_iter=100, tol=1.0e-6, normalized=True)__\n", + "\n", + " G : cuGraph.Graph or networkx.Graph\n", + " max_iter : int, optional (default=100)\n", + " The maximum number of iterations before an answer is returned. This can\n", + " be used to limit the execution time and do an early exit before the\n", + " solver reaches the convergence tolerance.\n", + " tol : float, optional (default=1e-6)\n", + " Set the tolerance the approximation, this parameter should be a small\n", + " magnitude value.\n", + " The lower the tolerance the better the approximation. If this value is\n", + " 0.0f, cuGraph will use the default value which is 1.0e-6.\n", + " Setting too small a tolerance can lead to non-convergence due to\n", + " numerical roundoff. Usually values between 1e-2 and 1e-6 are\n", + " acceptable.\n", + " normalized : bool, optional, default=True\n", + " If True normalize the resulting eigenvector centrality values\n", + "\n", + " \n", + "\n", + "Returns:\n", + "\n", + " df : cudf.DataFrame or Dictionary if using NetworkX\n", + " GPU data frame containing two cudf.Series of size V: the vertex\n", + " identifiers and the corresponding eigenvector centrality values.\n", + " df['vertex'] : cudf.Series\n", + " Contains the vertex identifiers\n", + " df['eigenvector_centrality'] : cudf.Series\n", + " Contains the eigenvector centrality of vertices\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Some notes about vertex IDs...\n", + "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n", + " * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n", + " * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Test Data\n", + "We will be using the Zachary Karate club dataset \n", + "*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of\n", + "Anthropological Research 33, 452-473 (1977).*\n", + "\n", + "\n", + "![Karate Club](../../img/zachary_black_lines.png)\n", + "\n", + "\n", + "Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prep" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import needed libraries\n", + "import cugraph\n", + "import cudf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# NetworkX libraries\n", + "import networkx as nx\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Some Prep" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define the path to the test data \n", + "datafile='../../data/karate-data.csv'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Method to show most influencial nodes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def topKVertices(eigen, col, k):\n", + " top = eigen.nlargest(n=k, columns=col)\n", + " top = top.sort_values(by=col, ascending=False)\n", + " return top\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Read in the data - GPU\n", + "cuGraph depends on cuDF for data loading and the initial Dataframe creation\n", + "\n", + "The data file contains an edge list, which represents the connection of a vertex to another. The `source` to `destination` pairs is in what is known as Coordinate Format (COO). In this test case, the data is just two columns. However a third, `weight`, column is also possible" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "gdf = cudf.read_csv(datafile, delimiter='\\t', names=['src', 'dst'], dtype=['int32', 'int32'] )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create a Graph " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe \n", + "G = cugraph.Graph(directed=True)\n", + "G.from_cudf_edgelist(gdf, source='src', destination='dst')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Call the Eigenvector Centrality algorithm" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Call cugraph.eigenvector_centrality \n", + "k_df = cugraph.eigenvector_centrality(G, max_iter=1000)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_It was that easy!_ \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Now compute using NetworkX" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Read the data, this also created a NetworkX Graph \n", + "file = open(datafile, 'rb')\n", + "Gnx = nx.read_edgelist(file)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Compute and Display the top 5 eigenvector centrality values with their nodes in NetworkX." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nx_evc = nx.eigenvector_centrality(Gnx)\n", + "evc_nx_sv = sorted(((value, key) for (key,value) in nx_evc.items()), reverse=True)\n", + "evc_nx_sv[:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Display the top 5 from the earlier cugraph-based computation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "k_df = k_df.rename(columns={\"eigenvector_centrality\": \"cu_eigen\"}, copy=False)\n", + "k_df = k_df.sort_values(\"cu_eigen\").reset_index(drop=True)\n", + "topKVertices(k_df, \"cu_eigen\", 5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As mentioned, the scores are slightly different but the ranking is the same." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "___\n", + "Copyright (c) 2022, NVIDIA CORPORATION.\n", + "\n", + "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n", + "\n", + "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n", + "___" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.8.13 ('cugraph_dev')", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.13" + }, + "vscode": { + "interpreter": { + "hash": "cee8a395f2f0c5a5bcf513ae8b620111f4346eff6dc64e1ea99c951b2ec68604" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/notebooks/centrality/Katz.ipynb b/notebooks/algorithms/centrality/Katz.ipynb similarity index 88% rename from notebooks/centrality/Katz.ipynb rename to notebooks/algorithms/centrality/Katz.ipynb index 6e2396efd47..f3537fe75e7 100755 --- a/notebooks/centrality/Katz.ipynb +++ b/notebooks/algorithms/centrality/Katz.ipynb @@ -8,16 +8,11 @@ "\n", "In this notebook, we will compute the Katz centrality of each vertex in our test datase using both cuGraph and NetworkX. Additionally, NetworkX also contains a Numpy implementation that will used. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n", "\n", - "Notebook Credits\n", - "* Original Authors: Bradley Rees\n", - "* Created: 10/15/2019\n", - "* Last Edit: 08/16/2020\n", - "\n", - "RAPIDS Versions: 0.14 \n", - "\n", - "Test Hardware\n", - "\n", - "* GV100 32G, CUDA 10.2\n" + "| Author Credit | Date | Update | cuGraph Version | Test Hardware |\n", + "| --------------|------------|------------------|-----------------|----------------|\n", + "| Brad Rees | 10/15/2019 | created | 0.14 | GV100, CUDA 10.2\n", + "| Brad Rees | 08/16/2020 | tested / updated | 0.15.1 nightly | RTX 3090 CUDA 11.4\n", + "| Don Acosta | 07/05/2022 | tested / updated | 22.08 nightly | DGX Tesla V100 CUDA 11.5" ] }, { @@ -75,7 +70,6 @@ "metadata": {}, "source": [ "#### Some notes about vertex IDs...\n", - "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n", "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n", " * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n", " * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n" @@ -91,7 +85,7 @@ "Anthropological Research 33, 452-473 (1977).*\n", "\n", "\n", - "![Karate Club](../img/zachary_black_lines.png)\n", + "\n", "\n", "\n", "Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.\n" @@ -101,7 +95,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Prep" + "### Importing needed Libraries" ] }, { @@ -110,7 +104,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Import needed libraries\n", + "# Import rapids libraries\n", "import cugraph\n", "import cudf" ] @@ -121,7 +115,7 @@ "metadata": {}, "outputs": [], "source": [ - "# NetworkX libraries\n", + "# Import NetworkX libraries\n", "import networkx as nx" ] }, @@ -129,7 +123,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Some Prep" + "### parameters\n", + "max_iter determines the number of iterations the algorithm will run to seek convergence\n", + "tol is the maximum difference that indicates convergence.\n", + "The algorithm will fail if it fails to converge within the tolerance (tol) after max_iter number of runs. This is often due to dataset characteristics.\n" ] }, { @@ -150,7 +147,7 @@ "outputs": [], "source": [ "# Define the path to the test data \n", - "datafile='../data/karate-data.csv'" + "datafile='../../data/karate-data.csv'" ] }, { @@ -247,30 +244,6 @@ "Let's now look at the results" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Find the most important vertex using the scores\n", - "# This methods should only be used for small graph\n", - "def find_top_scores(_df) :\n", - " m = _df['katz_centrality'].max()\n", - " return _df.query('katz_centrality >= @m')\n", - " " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "top_df = find_top_scores(gdf_katz)\n", - "top_df" - ] - }, { "cell_type": "code", "execution_count": null, @@ -364,7 +337,7 @@ "metadata": {}, "source": [ "___\n", - "Copyright (c) 2019-2020, NVIDIA CORPORATION.\n", + "Copyright (c) 2019-2022, NVIDIA CORPORATION.\n", "\n", "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n", "\n", @@ -375,9 +348,9 @@ ], "metadata": { "kernelspec": { - "display_name": "cugraph_dev", + "display_name": "Python 3.8.13 ('cugraph_dev')", "language": "python", - "name": "cugraph_dev" + "name": "python3" }, "language_info": { "codemirror_mode": { @@ -389,7 +362,12 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.6" + "version": "3.8.13" + }, + "vscode": { + "interpreter": { + "hash": "cee8a395f2f0c5a5bcf513ae8b620111f4346eff6dc64e1ea99c951b2ec68604" + } } }, "nbformat": 4, diff --git a/notebooks/algorithms/centrality/README.md b/notebooks/algorithms/centrality/README.md new file mode 100644 index 00000000000..608dd239029 --- /dev/null +++ b/notebooks/algorithms/centrality/README.md @@ -0,0 +1,41 @@ + +# cuGraph Centrality Notebooks + + + +cuGraph Centrality notebooks contain a collection of Jupyter Notebooks that demonstrate algorithms to identify and quantify importance of vertices to the structure of the graph. In the diagram above, the highlighted vertices are highly important and are likely answers to questions like: + +* Which vertices have the highest degree (most direct links) ? +* Which vertices are on the most efficient paths through the graph? +* Which vertices connect the most important vertices to each other? + +But which vertices are most important? The answer depends on which measure/algorithm is run. Manipulation of the data before or after the graph analytic is not covered here. Extended, more problem focused, notebooks are being created and available https://github.com/rapidsai/notebooks-extended + +## Summary + +|Algorithm |Notebooks Containing |Description | +| --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | +|Degree Centrality| [Centrality](centrality/Centrality.ipynb), [Degree](centrality/Degree.ipynb) |Measure based on counting direct connections for each vertex| +|Betweenness Centrality| [Centrality](centrality/Centrality.ipynb), [Betweenness](centrality/Betweenness.ipynb) |Number of shortest paths through the vertex| +|Eigenvector Centrality|[Centrality](centrality/Centrality.ipynb), [Eigenvector](centrality/Eigenvector.ipynb)|Measure of connectivity to other important vertices (which also have high connectivity) often referred to as the influence measure of a vertex| +|Katz Centrality|[Centrality](centrality/Centrality.ipynb), [Katz](centrality/Katz.ipynb) |Similar to Eigenvector but has tweaks to measure more weakly connected graph | +|Pagerank|[Centrality](centrality/Centrality.ipynb), [Pagerank](../../link_analysis/Pagerank.ipynb) |Classified as both a link analysis and centrality measure by quantifying incoming links from central vertices. | + +[System Requirements](../../README.md#requirements) + +| Author Credit | Date | Update | cuGraph Version | Test Hardware | +| --------------|------------|------------------|-----------------|----------------| +| Brad Rees | 04/19/2021 | created | 0.19 | GV100, CUDA 11.0 +| Don Acosta | 07/05/2022 | tested / updated | 22.08 nightly | DGX Tesla V100 CUDA 11.5 + +## Copyright + +Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +![RAPIDS](../../img/rapids_logo.png) diff --git a/notebooks/centrality/Centrality.ipynb b/notebooks/centrality/Centrality.ipynb deleted file mode 100644 index 8272d208cf5..00000000000 --- a/notebooks/centrality/Centrality.ipynb +++ /dev/null @@ -1,761 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Centrality" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this notebook, we will compute vertex centrality scores using the various cuGraph algorithms. We will then compare the similarities and differences.\n", - "\n", - "| Author Credit | Date | Update | cuGraph Version | Test Hardware |\n", - "| --------------|------------|------------------|-----------------|----------------|\n", - "| Brad Rees | 04/16/2021 | created | 0.19 | GV100, CUDA 11.0\n", - "| | 08/05/2021 | tested / updated | 21.10 nightly | RTX 3090 CUDA 11.4\n", - "\n", - " " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Centrality is measure of how important, or central, a node or edge is within a graph. It is useful for identifying influencer in social networks, key routing nodes in communication/computer network infrastructures, \n", - "\n", - "The seminal paper on centrality is: Freeman, L. C. (1978). Centrality in social networks conceptual clarification. Social networks, 1(3), 215-239.\n", - "\n", - "\n", - "__Degree centrality__ – _done but needs an API_
\n", - "Degree centrality is based on the notion that whoever has the most connections must be important. \n", - "\n", - "
\n", - " Cd(v) = degree(v)\n", - "
\n", - "\n", - "cuGraph currently does not have a Degree Centrality function call. However, since Degree Centrality is just the degree of a node, we can use _G.degree()_ function.\n", - "Degree Centrality for a Directed graph can be further divided in _indegree centrality_ and _outdegree centrality_ and can be obtained using _G.degrees()_\n", - "\n", - "\n", - "___Closeness centrality – coming soon___
\n", - "Closeness is a measure of the shortest path to every other node in the graph. A node that is close to every other node, can reach over other node in the fewest number of hops, means that it has greater influence on the network versus a node that is not close.\n", - "\n", - "__Betweenness Centrality__
\n", - "Betweenness is a measure of the number of shortest paths that cross through a node, or over an edge. A node with high betweenness means that it had a greater influence on the flow of information. \n", - "\n", - "Betweenness centrality of a node 𝑣 is the sum of the fraction of all-pairs shortest paths that pass through 𝑣\n", - "\n", - "
\n", - " \n", - "
\n", - "\n", - "To speedup runtime of betweenness centrailty, the metric can be computed on a limited number of nodes (randomly selected) and then used to estimate the other scores. For this example, the graphs are relatively small (under 5,000 nodes) so betweenness on every node will be computed.\n", - "\n", - "___Eigenvector Centrality - coming soon___
\n", - "Eigenvectors can be thought of as the balancing points of a graph, or center of gravity of a 3D object. High centrality means that more of the graph is balanced around that node.\n", - "\n", - "__Katz Centrality__
\n", - "Katz is a variant of degree centrality and of eigenvector centrality. \n", - "Katz centrality is a measure of the relative importance of a node within the graph based on measuring the influence across the total number of walks between vertex pairs. \n", - "\n", - "
\n", - " \n", - "
\n", - "\n", - "See:\n", - "* [Katz on Wikipedia](https://en.wikipedia.org/wiki/Katz_centrality) for more details on the algorithm.\n", - "* https://www.sci.unich.it/~francesc/teaching/network/katz.html\n", - "\n", - "__PageRank__
\n", - "PageRank is classified as both a Link Analysis tool and a centrality measure. PageRank is based on the assumption that important nodes point (directed edge) to other important nodes. From a social network perspective, the question is who do you seek for an answer and then who does that person seek. PageRank is good when there is implied importance in the data, for example a citation network, web page linkages, or trust networks. \n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Test Data\n", - "We will be using the Zachary Karate club dataset \n", - "*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of\n", - "Anthropological Research 33, 452-473 (1977).*\n", - "\n", - "\n", - "![Karate Club](../img/zachary_black_lines.png)\n", - "\n", - "\n", - "Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "# Import the modules\n", - "import cugraph\n", - "import cudf" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import pandas as pd \n", - "from IPython.display import display_html " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Functions\n", - "using underscore variable names to avoid collisions. \n", - "non-underscore names are expected to be global names" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "# Compute Centrality\n", - "# the centrality calls are very straightforward with the graph being the primary argument\n", - "# we are using the default argument values for all centrality functions\n", - "\n", - "def compute_centrality(_graph) :\n", - " # Compute Degree Centrality\n", - " _d = _graph.degree()\n", - " \n", - " # Compute the Betweenness Centrality\n", - " _b = cugraph.betweenness_centrality(_graph)\n", - "\n", - " # Compute Katz Centrality\n", - " _k = cugraph.katz_centrality(_graph)\n", - " \n", - " # Compute PageRank Centrality\n", - " _p = cugraph.pagerank(_graph)\n", - " \n", - " return _d, _b, _k, _p" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "# Print function\n", - "# being lazy and requiring that the dataframe names are not changed versus passing them in\n", - "def print_centrality(_n):\n", - " dc_top = dc.sort_values(by='degree', ascending=False).head(_n).to_pandas()\n", - " bc_top = bc.sort_values(by='betweenness_centrality', ascending=False).head(_n).to_pandas()\n", - " katz_top = katz.sort_values(by='katz_centrality', ascending=False).head(_n).to_pandas()\n", - " pr_top = pr.sort_values(by='pagerank', ascending=False).head(_n).to_pandas()\n", - " \n", - " df1_styler = dc_top.style.set_table_attributes(\"style='display:inline'\").set_caption('Degree').hide_index()\n", - " df2_styler = bc_top.style.set_table_attributes(\"style='display:inline'\").set_caption('Betweenness').hide_index()\n", - " df3_styler = katz_top.style.set_table_attributes(\"style='display:inline'\").set_caption('Katz').hide_index()\n", - " df4_styler = pr_top.style.set_table_attributes(\"style='display:inline'\").set_caption('PageRank').hide_index()\n", - "\n", - " display_html(df1_styler._repr_html_()+df2_styler._repr_html_()+df3_styler._repr_html_()+df4_styler._repr_html_(), raw=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Read the data" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "# Define the path to the test data \n", - "datafile='../data/karate-data.csv'" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "cuGraph does not do any data reading or writing and is dependent on other tools for that, with cuDF being the preferred solution. \n", - "\n", - "The data file contains an edge list, which represents the connection of a vertex to another. The `source` to `destination` pairs is in what is known as Coordinate Format (COO). In this test case, the data is just two columns. However a third, `weight`, column is also possible" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "gdf = cudf.read_csv(datafile, delimiter='\\t', names=['src', 'dst'], dtype=['int32', 'int32'] )" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "it was that easy to load data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create a Graph" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe \n", - "G = cugraph.Graph()\n", - "G.from_cudf_edgelist(gdf, source='src', destination='dst')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Compute Centrality" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "dc, bc, katz, pr = compute_centrality(G)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Results\n", - "Typically, analysts just look at the top 10% of results. Basically just those vertices that are the most central or important. \n", - "The karate data has 32 vertices, so let's round a little and look at the top 5 vertices" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Degree
degree vertex
3434
321
2433
203
182
\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Betweenness
betweenness_centrality vertex
0.4376351
0.30407534
0.14524733
0.1436573
0.13827632
\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Katz
katz_centrality vertex
0.43625634
0.4184081
0.32865033
0.2960053
0.2566142
\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
PageRank
pagerank vertex
0.10091734
0.0969991
0.07169233
0.0570783
0.0528772
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "print_centrality(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### A Different Dataset\n", - "The Karate dataset is not that large or complex, which makes it a perfect test dataset since it is easy to visually verify results. Let's look at a larger dataset with a lot more edges" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "# Define the path to the test data \n", - "datafile='../data/netscience.csv'\n", - "\n", - "gdf = cudf.read_csv(datafile, delimiter=' ', names=['src', 'dst', 'wt'], dtype=['int32', 'int32', 'float'] )" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [], - "source": [ - "# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe \n", - "G = cugraph.Graph()\n", - "G.from_cudf_edgelist(gdf, source='src', destination='dst')" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(1461, 2742)" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "(G.number_of_nodes(), G.number_of_edges())" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "dc, bc, katz, pr = compute_centrality(G)" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Degree
degree vertex
6833
5434
5478
4254
40294
\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Betweenness
betweenness_centrality vertex
0.02657278
0.023090150
0.019135516
0.018074281
0.017088216
\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Katz
katz_centrality vertex
0.1581911429
0.1581911430
0.1581911431
0.154591645
0.1545911432
\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
PageRank
pagerank vertex
0.00418378
0.00377133
0.00280034
0.002387281
0.002373294
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "print_centrality(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can now see a larger discrepancy between the centrality scores and which nodes rank highest.\n", - "Which centrality measure to use is left to the analyst to decide and does require insight into the difference algorithms and graph structure." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### And One More Dataset\n", - "Let's look at a Cyber dataset. The vertex ID are IP addresses" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [], - "source": [ - "# Define the path to the test data \n", - "datafile='../data/cyber.csv'\n", - "\n", - "gdf = cudf.read_csv(datafile, delimiter=',', names=['idx', 'src', 'dst'], dtype=['int32', 'str', 'str'] )" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe \n", - "G = cugraph.Graph()\n", - "G.from_cudf_edgelist(gdf, source='src', destination='dst')" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(54, 174)" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "(G.number_of_nodes(), G.number_of_edges())" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [], - "source": [ - "dc, bc, katz, pr = compute_centrality(G)" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Degree
degree vertex
26175.45.176.1
26175.45.176.3
26175.45.176.0
26175.45.176.2
22149.171.126.6
\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Betweenness
betweenness_centrality vertex
0.11209110.40.85.1
0.052250224.0.0.5
0.04862110.40.182.1
0.033745175.45.176.1
0.033745175.45.176.3
\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Katz
katz_centrality vertex
0.213361149.171.126.6
0.20628959.166.0.4
0.20628959.166.0.1
0.20628959.166.0.5
0.20628959.166.0.2
\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
PageRank
pagerank vertex
0.038591175.45.176.1
0.038591175.45.176.3
0.038591175.45.176.0
0.038591175.45.176.2
0.02871610.40.85.1
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "print_centrality(5)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "There are differences in how each centrality measure ranks the nodes. In some cases, every algorithm returns similar results, and in others, the results are different. Understanding how the centrality measure is computed and what edge represent is key to selecting the right centrality metric." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "----\n", - "Copyright (c) 2019-2021, NVIDIA CORPORATION.\n", - "\n", - "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n", - "\n", - "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "cugraph_dev", - "language": "python", - "name": "cugraph_dev" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.10" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/notebooks/img/zachary_black_lines.png b/notebooks/img/zachary_black_lines.png index 937137e9649..3681dbfb902 100644 Binary files a/notebooks/img/zachary_black_lines.png and b/notebooks/img/zachary_black_lines.png differ diff --git a/notebooks/img/zachary_graph_centrality.png b/notebooks/img/zachary_graph_centrality.png new file mode 100644 index 00000000000..54a91314d26 Binary files /dev/null and b/notebooks/img/zachary_graph_centrality.png differ diff --git a/notebooks/img/zachary_graph_pagerank.png b/notebooks/img/zachary_graph_pagerank.png index 3a9c34d32bf..5a899de7403 100644 Binary files a/notebooks/img/zachary_graph_pagerank.png and b/notebooks/img/zachary_graph_pagerank.png differ