diff --git a/notebooks/applications/CostMatrix.ipynb b/notebooks/applications/CostMatrix.ipynb new file mode 100644 index 00000000000..687b1526069 --- /dev/null +++ b/notebooks/applications/CostMatrix.ipynb @@ -0,0 +1,641 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to compute a _Cost Matrix_ by replicating data\n", + "# Skip notebook test\n", + "\n", + "### Approach\n", + "A simple approach to creating a cost matrix is to run All-Source Shortest Path (ASSP), however cuGraph currently does not have an All-Source Shortest Path (ASSP) algorithm. One is on the roadmap, based on Floyd-Warshall, but that doesn't help us today. Luckily there is a work around if the graph to be processed is small. The hack is to run ASSP by creating a lot of copies of the graph and running the Single Source Shortest Path (SSSP) on one seed per graph copy. Since each SSSP run within its own disjoint component, there is no issue with path collisions between seeds. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Notebook Organization\n", + "The first portion of the notebook discusses each step independently. It gives insight into what is going on and how fast each step takes.\n", + "\n", + "The second section puts it all the steps together in a single function and times how long with would take to compute the matrix\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data\n", + "\n", + "In this notebook we will use the email-Eu-core\n", + "\n", + "* Number of Vertices: 1,005\n", + "* Number of Edges: 25,571\n", + "\n", + "We are using this dataset since it is small with a few communities, meaning that there are paths to be found." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Notebook Revisions\n", + "\n", + "| Author Credit | Date | Update | cuGraph Version | Test Hardware |\n", + "| --------------|------------|------------------|-----------------|----------------|\n", + "| Brad Rees | 06/21/2022 | created | 22.08 | V100 w 32 GB, CUDA 11.5\n", + "| Don Acosta | 06/28/2022 | modified | 22.08 | V100 w 32 GB, CUDA 11.5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### References\n", + "\n", + "* https://www.sciencedirect.com/topics/mathematics/cost-matrix\n", + "* https://en.wikipedia.org/wiki/Shortest_path_problem\n", + "\n", + "Dataset\n", + "* Hao Yin, Austin R. Benson, Jure Leskovec, and David F. Gleich. Local Higher-order Graph Clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017.\n", + "\n", + "* J. Leskovec, J. Kleinberg and C. Faloutsos. Graph Evolution: Densification and Shrinking Diameters. ACM Transactions on Knowledge Discovery from Data (ACM TKDD), 1(1), 2007. http://www.cs.cmu.edu/~jure/pubs/powergrowth-tkdd.pdf\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# system and other\n", + "import time\n", + "from time import perf_counter\n", + "import math\n", + "\n", + "# rapids\n", + "import cugraph\n", + "import cudf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "-----\n", + "# Reading the data\n", + "\n", + "Let's start with data read" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# simple function to read in the CSV data file\n", + "def read_data_cudf(datafile):\n", + " gdf = cudf.read_csv(datafile,\n", + " delimiter=\" \",\n", + " header=None,\n", + " names=['src','dst', 'wt'])\n", + " return gdf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# function to determine the number of nodes in the dataset\n", + "def find_number_of_nodes(df):\n", + " node = cudf.concat([df['src'], df['dst']])\n", + " node = node.unique()\n", + " return len(node)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Read the data and verify that it is zero based (e.g. first vertex is 0)\n", + "**IMPORTANT:** The node numbering must be zero based. We use the starting index on the replicated graph to be one larger than the number of vertices. If the starting index is not zero, then the graph copies will overlap in index space and not be independent (disjoint). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t1 = perf_counter()\n", + "gdf = read_data_cudf('../data/email-Eu-core.csv')\n", + "read_t = perf_counter() - t1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(f\" read {len(gdf)} edges in {read_t} seconds\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# verify that the starting ID is zero\n", + "min([gdf['src'].min(), gdf['dst'].min()])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the max ID\n", + "max([gdf['src'].max(), gdf['dst'].max()])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# the number of nodes should be one greater than the max ID\n", + "# that is the ID that we start the next instance of the data at\n", + "offset = find_number_of_nodes(gdf)\n", + "print(offset)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Now let's dive into how to replicate the data\n", + "We will use a model that doubles the data at each pass. That is a lot faster \n", + "than adding one copy at a time. \n", + "The number of disjoint versions of the data will be a power of 2.\n", + "Although the power of 2 replication results in faster data set growth and Graph building, the simple order one replication is shown here for illustration purposes.\n", + "\n", + "\n", + "![Data Duplicated](../../notebooks/img/graph_after_replication.png)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# This function creates additional version of the data \n", + "\n", + "def make_data(base_df, N):\n", + " id = find_number_of_nodes(base_df)\n", + " _d = base_df\n", + "\n", + " for x in range(N):\n", + " tmp = _d.copy()\n", + " tmp['src'] += id\n", + " tmp['dst'] += id\n", + " _d = cudf.concat([_d,tmp])\n", + " id = id * 2\n", + " return _d" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%timeit\n", + "_ = make_data(gdf, 3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "gdf2 = make_data(gdf, 3)\n", + "print()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# simple print to show tha there is not a lot more data\n", + "# print # of Edges and # of Nodes\n", + "print(f\"Old {len(gdf)} {find_number_of_nodes(gdf)}\")\n", + "print(f\"New {len(gdf2)} {find_number_of_nodes(gdf2)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Build the ghost node connection set\n", + "A ghost node is an artificially added node to parallelize/simulate the all-points shortest path algorithm which is not yet supported.\n", + "After the ghost node is added, the 2nd hop is actually the all points shortest path.\n", + "The Ghost node is later removed after the Shortest path algorithms are run.\n", + "\n", + "![Ghost Node](../../notebooks/img/graph_after_ghost.png)\n", + "\n", + "The Ghost Node is connected to a different corresponding node in each replication so all sources are covered.\n", + "\n", + "In this simple example of a four-node 'square' graph after complete replication and adding the ghost node, the graph looks like this:\n", + "\n", + "![Ghost Node](../../notebooks/img/Full-four_node_replication.png)\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def add_ghost_node(_df, N):\n", + " # get the size of the graph. That number will be the ghost node ID\n", + " ghost_node_id = find_number_of_nodes(_df)\n", + " \n", + " num_copies = math.floor(math.pow(2, N))\n", + "\n", + " seeds = cudf.DataFrame()\n", + " seeds['dst'] = [((offset * x) + x) for x in range(num_copies)]\n", + " seeds['src'] = ghost_node_id\n", + " \n", + " _d = cudf.concat([_df, seeds])\n", + " \n", + " return _d, ghost_node_id" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%timeit\n", + "_, _ = add_ghost_node(gdf2, 10)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "gdf_with_ghost, ghost_id = add_ghost_node(gdf2, 10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create an Empty directed Graph" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "G = cugraph.Graph(directed=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Populate the new graph with an edgelist containing\n", + "* The original Data\n", + "* The replicated data copies\n", + "* Each replication connected to the Ghost Node by a single edge from a different node\n", + "in each copy of the graph." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%time\n", + "G.from_cudf_edgelist(gdf_with_ghost, source='src', destination='dst', renumber=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%time\n", + "G.number_of_edges()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run Single Source Shortest Path (SSSP) from the ghost node\n", + "The single Ghost node source becomes a all-source shortest path after one hop since all the\n", + "replicated data is connected through that node. This will include extraneous ghost node related data which will be removed in later steps." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%timeit\n", + "X = cugraph.sssp(G, ghost_id)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "X = cugraph.sssp(G, ghost_id)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This result will contain a ghost node like the simple example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "X.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Now reset vertex IDs and convert to a cost matrix\n", + "All edges with the ghost node as a source are removed here." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# drop the ghost node which doesnt exist so remove from matrix.\n", + "X = X[X['predecessor'] != ghost_id]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Apply the CuGraph filter which removes all nodes not encountered during the graph traversal. In this case the SSSP." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# drop unreachable\n", + "X = cugraph.filter_unreachable(X)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Remove the path cost that was incurred by going to the single seed in each copy from the ghost node." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# adjust distances so that they don't go to the ghost node\n", + "X['distance'] -= 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Now the Ghost node and tangential edges are removed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "X.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Calculate the seed for each copy. This is where it is critical that the original graph node numbering is zero based." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# add a new column for the seed\n", + "# since each seed was a different component with a different offset amount, exploit that to determine the seed number\n", + "X['seed'] = (X['vertex'] / offset).astype(int)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "X.head(5)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Now adjust all vertices to be in the correct range\n", + "# resets the seed number to the\n", + "X['v2'] = X['vertex'] - (X['seed'] * offset)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Finally just pull out the cost matrix\n", + "cost = X.drop(columns=['vertex', 'predecessor'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cost.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# cleanup \n", + "del G\n", + "del X\n", + "del gdf_with_ghost\n", + "del gdf2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "----\n", + "# Section 2: Do it all in a single function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Set the number of replications - 10 will produce 1,024 graphs\n", + "N = 10" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def build_cost_matrix(_gdf):\n", + " data = make_data(_gdf, N)\n", + " gdf_with_ghost, ghost_id = add_ghost_node(data, N)\n", + " \n", + " G = cugraph.Graph(directed=True)\n", + " G.from_cudf_edgelist(gdf_with_ghost, source='src', destination='dst', renumber=False)\n", + " \n", + " X = cugraph.sssp(G, ghost_id)\n", + " \n", + " X = X[X['predecessor'] != ghost_id]\n", + " X = cugraph.filter_unreachable(X)\n", + " X['distance'] -= 1\n", + " X['seed'] = (X['vertex'] / offset).astype(int)\n", + " X['v2'] = X['vertex'] - (X['seed'] * offset)\n", + " cost = X.drop(columns=['vertex', 'predecessor'])\n", + " \n", + " return cost" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%timeit\n", + "CM = build_cost_matrix(gdf)\n", + "CM" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "CM = build_cost_matrix(gdf)\n", + "CM.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "___\n", + "Copyright (c) 2022, NVIDIA CORPORATION.\n", + "\n", + "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n", + "\n", + "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n", + "___" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "cugraph_dev", + "language": "python", + "name": "cugraph_dev" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.13" + }, + "vscode": { + "interpreter": { + "hash": "cee8a395f2f0c5a5bcf513ae8b620111f4346eff6dc64e1ea99c951b2ec68604" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/notebooks/img/Full-four_node_replication.png b/notebooks/img/Full-four_node_replication.png new file mode 100644 index 00000000000..8cbc3cd1dca Binary files /dev/null and b/notebooks/img/Full-four_node_replication.png differ diff --git a/notebooks/img/graph_after_ghost.png b/notebooks/img/graph_after_ghost.png new file mode 100644 index 00000000000..256ac9b5425 Binary files /dev/null and b/notebooks/img/graph_after_ghost.png differ diff --git a/notebooks/img/graph_after_replication.png b/notebooks/img/graph_after_replication.png new file mode 100644 index 00000000000..3fba876e899 Binary files /dev/null and b/notebooks/img/graph_after_replication.png differ