DOC Fix for Renumber-2.ipynb (rapidsai#2335)

Fixed issue rapidsai#2178. Updated to the correct number of unique IPs. Added code to confirm cuGraph results using CuPy. Authors: - Ralph Liu (https://github.com/oorliu) Approvers: - Brad Rees (https://github.com/BradReesWork) URL: rapidsai#2335
oorliu · Jun 6, 2022 · 0dbb785 · 0dbb785
1 parent 305662f
commit 0dbb785
Show file tree

Hide file tree

Showing 19 changed files with 80 additions and 65 deletions.
diff --git a/notebooks/centrality/Betweenness.ipynb b/notebooks/centrality/Betweenness.ipynb
@@ -6,7 +6,7 @@
    "source": [
     "# Betweenness Centrality\n",
     "\n",
-    "In this notebook, we will compute the Betweenness centrality for both vertices and edges in our test datase using cuGraph and NetworkX. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n",
+    "In this notebook, we will compute the Betweenness centrality for both vertices and edges in our test database using cuGraph and NetworkX. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n",
     "\n",
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees\n",
@@ -25,7 +25,7 @@
    "metadata": {},
    "source": [
     "## Introduction\n",
-    "Betweenness centrality is a measure of the relative importance based on measuring the number of shortest paths that pass through each vertex or over each edge .  High betweenness centrality vertices have a greater number of path cross through the vertex.  Likewise, high centrality edges have more shortest paths that pass over the edge.\n",
+    "Betweenness centrality is a measure of the relative importance based on measuring the number of shortest paths that pass through each vertex or over each edge.  High betweenness centrality vertices have a greater number of path cross through the vertex.  Likewise, high centrality edges have more shortest paths that pass over the edge.\n",
     "\n",
     "See [Betweenness on Wikipedia](https://en.wikipedia.org/wiki/Betweenness_centrality) for more details on the algorithm.\n",
     "\n"
@@ -244,7 +244,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "print_top_scores(vertex_bc, \"top vertice centrality scores\")\n",
+    "print_top_scores(vertex_bc, \"top vertex centrality scores\")\n",
     "print_top_scores(edge_bc, \"top edge centrality scores\")"
    ]
   },

diff --git a/notebooks/centrality/Centrality.ipynb b/notebooks/centrality/Centrality.ipynb
@@ -53,7 +53,7 @@
     "    <img src=\"https://latex.codecogs.com/png.latex?c_B(v)&space;=\\sum_{s,t&space;\\in&space;V}&space;\\frac{\\sigma(s,&space;t|v)}{\\sigma(s,&space;t)}\" title=\"c_B(v) =\\sum_{s,t \\in V} \\frac{\\sigma(s, t|v)}{\\sigma(s, t)}\" />\n",
     "</center>\n",
     "\n",
-    "To speedup runtime of betweenness centrailty, the metric can be computed on a limited number of nodes (randomly selected) and then used to estimate the other scores.  For this example, the graphs are relatively smalled (under 5,000 nodes) so betweenness on every node will be computed.\n",
+    "To speedup runtime of betweenness centrailty, the metric can be computed on a limited number of nodes (randomly selected) and then used to estimate the other scores.  For this example, the graphs are relatively small (under 5,000 nodes) so betweenness on every node will be computed.\n",
     "\n",
     "___Eigenvector Centrality - coming soon___ <br>\n",
     "Eigenvectors can be thought of as the balancing points of a graph, or center of gravity of a 3D object.  High centrality means that more of the graph is balanced around that node.\n",
@@ -128,7 +128,7 @@
    "outputs": [],
    "source": [
     "# Compute Centrality\n",
-    "# the centrality calls are very straight forward with the graph being the primary argument\n",
+    "# the centrality calls are very straightforward with the graph being the primary argument\n",
     "# we are using the default argument values for all centrality functions\n",
     "\n",
     "def compute_centrality(_graph) :\n",
@@ -257,7 +257,7 @@
    "metadata": {},
    "source": [
     "### Results\n",
-    "Typically, analyst look just at the top 10% of results.  Basically just those vertices that are the most central or important.  \n",
+    "Typically, analysts just look at the top 10% of results.  Basically just those vertices that are the most central or important.  \n",
     "The karate data has 32 vertices, so let's round a little and look at the top 5 vertices"
    ]
   },

diff --git a/notebooks/centrality/Katz.ipynb b/notebooks/centrality/Katz.ipynb
@@ -214,7 +214,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Call the Karz algorithm"
+    "### Call the Katz algorithm"
    ]
   },
   {

diff --git a/notebooks/community/Louvain.ipynb b/notebooks/community/Louvain.ipynb
@@ -351,7 +351,7 @@
     }
    ],
    "source": [
-    "# How many Lieden partitions where found\n",
+    "# How many Leiden partitions were found\n",
     "part_ids_l = df_l[\"partition\"].unique()\n",
     "print(\"Leiden found \" + str(len(part_ids_l)) + \" partitions\")"
    ]

diff --git a/notebooks/community/Spectral-Clustering.ipynb b/notebooks/community/Spectral-Clustering.ipynb
@@ -187,7 +187,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# The algorithm requires that there are edge weights.  In this case all the weights are being ste to 1\n",
+    "# The algorithm requires that there are edge weights.  In this case all the weights are being set to 1\n",
     "gdf[\"data\"] = cudf.Series(np.ones(len(gdf), dtype=np.float32))"
    ]
   },
@@ -197,7 +197,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Look at the first few data records - the output should be two colums src and dst\n",
+    "# Look at the first few data records - the output should be two columns: 'src' and 'dst'\n",
     "gdf.head()"
    ]
   },
@@ -234,7 +234,7 @@
    "metadata": {},
    "source": [
     "----\n",
-    "#### Define and print function, but adjust vertex ID so that they match the illustration"
+    "#### Define and print function, but adjust vertex IDs so that they match the illustration"
    ]
   },
   {

diff --git a/notebooks/community/Triangle-Counting.ipynb b/notebooks/community/Triangle-Counting.ipynb
@@ -184,7 +184,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's seet how that compares to cuGraph\n",
+    "Let's see how that compares to cuGraph\n",
     "\n",
     "----"
    ]

diff --git a/notebooks/components/ConnectedComponents.ipynb b/notebooks/components/ConnectedComponents.ipynb
@@ -144,7 +144,7 @@
     "# Test file\n",
     "datafile='../data/netscience.csv'\n",
     "\n",
-    "# the datafile contains three columns,but we only want to use the first two. \n",
+    "# the datafile contains three columns, but we only want to use the first two. \n",
     "# We will use the \"usecols' feature of read_csv to ignore that column\n",
     "\n",
     "gdf = cudf.read_csv(datafile, delimiter=' ', names=['src', 'dst', 'wgt'], dtype=['int32', 'int32', 'float32'], usecols=['src', 'dst'])\n",

diff --git a/notebooks/cores/kcore.ipynb b/notebooks/cores/kcore.ipynb
@@ -220,7 +220,7 @@
    "metadata": {},
    "source": [
     "### Just for fun\n",
-    "Let's try specifying a K value.  Looking at the original network picture, it is easy to see that most vertices has at least degree two.  \n",
+    "Let's try specifying a K value.  Looking at the original network picture, it is easy to see that most vertices have at least degree two.  \n",
     "If we specify k = 2 then only one vertex should be dropped "
    ]
   },

diff --git a/notebooks/demo/batch_betweenness.ipynb b/notebooks/demo/batch_betweenness.ipynb
@@ -16,7 +16,7 @@
    "metadata": {},
    "source": [
     "## Introduction\n",
-    "Betweennes Centrality can be slow to compute on large graphs, in order to speed up the process we can leverage multiple GPUs.\n",
+    "Betweenness Centrality can be slow to compute on large graphs, in order to speed up the process we can leverage multiple GPUs.\n",
     "In this notebook we will showcase how it would have been done with a Single GPU approach, then we will show how it can be done using multiple GPUs."
    ]
   },

diff --git a/notebooks/link_analysis/HITS.ipynb b/notebooks/link_analysis/HITS.ipynb
@@ -185,7 +185,7 @@
    "metadata": {},
    "source": [
     "Running NetworkX is that easy.  \n",
-    "Let's seet how that compares to cuGraph\n",
+    "Let's see how that compares to cuGraph\n",
     "\n",
     "----"
    ]

diff --git a/notebooks/link_analysis/Pagerank.ipynb b/notebooks/link_analysis/Pagerank.ipynb
@@ -160,7 +160,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Read the data, this also created a NetworkX Graph \n",
+    "# Read the data, this also creates a NetworkX Graph \n",
     "file = open(datafile, 'rb')\n",
     "Gnx = nx.read_edgelist(file)"
    ]
@@ -232,7 +232,7 @@
    "metadata": {},
    "source": [
     "Running NetworkX is that easy.  \n",
-    "Let's seet how that compares to cuGraph\n",
+    "Let's see how that compares to cuGraph\n",
     "\n",
     "----"
    ]
@@ -335,7 +335,7 @@
    ],
    "source": [
     "# Find the most important vertex using the scores\n",
-    "# This methods should only be used for small graph\n",
+    "# These methods should only be used for small graph\n",
     "bestScore = gdf_page['pagerank'][0]\n",
     "bestVert = gdf_page['vertex'][0]\n",
     "\n",
@@ -351,7 +351,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The top PageRank vertex and socre match what was found by NetworkX"
+    "The top PageRank vertex and score match what was found by NetworkX"
    ]
   },
   {
@@ -360,7 +360,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# A better way to do that would be to find the max and then use that values in a query\n",
+    "# A better way to do that would be to find the max and then use the values in a query\n",
     "pr_max = gdf_page['pagerank'].max()"
    ]
   },

diff --git a/notebooks/link_prediction/Jaccard-Similarity.ipynb b/notebooks/link_prediction/Jaccard-Similarity.ipynb
@@ -253,7 +253,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Look at the first few data records - the output should be two colums src and dst\n",
+    "# Look at the first few data records - the output should be two columns: 'src' and 'dst'\n",
     "gdf.head()"
    ]
   },
@@ -311,7 +311,7 @@
    "outputs": [],
    "source": [
     "#%%time\n",
-    "# Call cugraph.nvJaccard \n",
+    "# Call cugraph.nvJaccard\n",
     "jdf = cugraph.jaccard(G)"
    ]
   },
@@ -424,7 +424,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Call Pagerank on the graph to get weights to use:\n",
+    "# Call PageRank on the graph to get weights to use:\n",
     "pr_df = cugraph.pagerank(G)"
    ]
   },
@@ -434,7 +434,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# take a peek at the page rank values\n",
+    "# take a peek at the PageRank values\n",
     "pr_df.head()"
    ]
   },
@@ -451,8 +451,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "pr_df.rename(columns={'pagerank': 'weight'}, inplace=True)",
-    "# Call weighted Jaccard using the Pagerank scores as weights:\n",
+    "pr_df.rename(columns={'pagerank': 'weight'}, inplace=True)\n",
+    "# Call weighted Jaccard using the PageRank scores as weights:\n",
     "wdf = cugraph.jaccard_w(G, pr_df)"
    ]
   },

diff --git a/notebooks/link_prediction/Overlap-Similarity.ipynb b/notebooks/link_prediction/Overlap-Similarity.ipynb
@@ -271,7 +271,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Look at the first few data records - the output should be two colums src and dst\n",
+    "# Look at the first few data records - the output should be two columns: 'src' and 'dst'\n",
     "gdf.head()"
    ]
   },
@@ -467,7 +467,7 @@
    "outputs": [],
    "source": [
     "# print all similarities over a threshold, in this case 0.5\n",
-    "#also, drop duplicates\n",
+    "# also, drop duplicates\n",
     "odf_s2 = ol2.query('source < destination').sort_values(by='overlap_coeff', ascending=False)\n",
     "\n",
     "print_overlap_threshold(odf_s2, 0.74)"

diff --git a/notebooks/sampling/RandomWalk.ipynb b/notebooks/sampling/RandomWalk.ipynb
@@ -162,13 +162,6 @@
     "\n",
     "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {

diff --git a/notebooks/structure/Renumber-2.ipynb b/notebooks/structure/Renumber-2.ipynb
@@ -14,15 +14,19 @@
     "\n",
     "\n",
     "Notebook Credits\n",
-    "* Original Authors: Bradley Rees\n",
-    "* Created:   08/13/2019\n",
-    "* Updated:   07/08/2020\n",
     "\n",
-    "RAPIDS Versions: 0.13    \n",
+    "| Author        |    Date    |  Update             |\n",
+    "| --------------|------------|---------------------|\n",
+    "| Brad Rees     | 08/13/2019 | created             |\n",
+    "| Brad Rees     | 07/08/2020 | updated             |\n",
+    "| Ralph Liu     | 06/01/2022 | docs & code change  |\n",
+    "\n",
+    "RAPIDS Versions: 0.13  \n",
+    "cuGraph Version: 22.06  \n",
     "\n",
     "Test Hardware\n",
     "\n",
-    "* GV100 32G, CUDA 10.2\n",
+    "* GV100 32G, CUDA 11.5\n",
     "\n",
     "\n",
     "## Introduction\n",
@@ -68,8 +72,9 @@
     "# Import needed libraries\n",
     "import cugraph\n",
     "import cudf\n",
+    "import cupy as cp\n",
     "\n",
-    "from cugraph.structure import NumberMap\n"
+    "from cugraph.structure import NumberMap"
    ]
   },
   {
@@ -133,17 +138,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The data has 2.5 million edges that span a range of 3,758,096,389 \n",
+    "The data has 2.5 million edges that span a range of 3,758,096,389.\n",
     "Even if every vertex ID was unique per edge, that would only be 5 million values versus the 3.7 billion that is currently there.  \n",
-    "In the curret state, the produced matrix would 3.7 billion by 3.7 billion - that is a lot of wasted space."
+    "In the current state, the produced matrix would 3.7 billion by 3.7 billion - that is a lot of wasted space."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Time to Renumber\n",
-    "One good best practice is to have the returned edge pairs appended to the original dataframe. That will help merge results back into the datasets"
+    "One good best practice is to have the returned edge pairs appended to the original Dataframe. That will help merge results back into the datasets"
    ]
   },
   {
@@ -198,7 +203,21 @@
    "metadata": {},
    "source": [
     "Just saved 3.7 billion unneeded spaces in the matrix!<br>\n",
-    "And we can now see that there are only 50 unique IP addresses in the dataset"
+    "And we can now see that there are only 52 unique IP addresses in the dataset.<br>\n",
+    "Let's confirm the number of unique values."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Merge the renumbered columns\n",
+    "src, dst = gdf['src_r'].to_cupy(), gdf['dst_r'].to_cupy()\n",
+    "merged = cp.concatenate((src, dst))\n",
+    "\n",
+    "print(\"Unique IPs: \" + str(len(cp.unique(merged))))"
    ]
   },
   {
@@ -216,8 +235,11 @@
   }
  ],
  "metadata": {
+  "interpreter": {
+   "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
+  },
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3.6.9 64-bit",
    "language": "python",
    "name": "python3"
   },
@@ -231,7 +253,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.6"
+   "version": "3.6.9"
   }
  },
  "nbformat": 4,