Skip to content

Commit

Permalink
DOC Fix for Renumber-2.ipynb (rapidsai#2335)
Browse files Browse the repository at this point in the history
Fixed issue rapidsai#2178. 

Updated to the correct number of unique IPs. Added code to confirm cuGraph results using CuPy.

Authors:
  - Ralph Liu (https://github.com/oorliu)

Approvers:
  - Brad Rees (https://github.com/BradReesWork)

URL: rapidsai#2335
  • Loading branch information
oorliu authored Jun 6, 2022
1 parent 305662f commit 0dbb785
Show file tree
Hide file tree
Showing 19 changed files with 80 additions and 65 deletions.
6 changes: 3 additions & 3 deletions notebooks/centrality/Betweenness.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"source": [
"# Betweenness Centrality\n",
"\n",
"In this notebook, we will compute the Betweenness centrality for both vertices and edges in our test datase using cuGraph and NetworkX. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n",
"In this notebook, we will compute the Betweenness centrality for both vertices and edges in our test database using cuGraph and NetworkX. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n",
"\n",
"Notebook Credits\n",
"* Original Authors: Bradley Rees\n",
Expand All @@ -25,7 +25,7 @@
"metadata": {},
"source": [
"## Introduction\n",
"Betweenness centrality is a measure of the relative importance based on measuring the number of shortest paths that pass through each vertex or over each edge . High betweenness centrality vertices have a greater number of path cross through the vertex. Likewise, high centrality edges have more shortest paths that pass over the edge.\n",
"Betweenness centrality is a measure of the relative importance based on measuring the number of shortest paths that pass through each vertex or over each edge. High betweenness centrality vertices have a greater number of path cross through the vertex. Likewise, high centrality edges have more shortest paths that pass over the edge.\n",
"\n",
"See [Betweenness on Wikipedia](https://en.wikipedia.org/wiki/Betweenness_centrality) for more details on the algorithm.\n",
"\n"
Expand Down Expand Up @@ -244,7 +244,7 @@
"metadata": {},
"outputs": [],
"source": [
"print_top_scores(vertex_bc, \"top vertice centrality scores\")\n",
"print_top_scores(vertex_bc, \"top vertex centrality scores\")\n",
"print_top_scores(edge_bc, \"top edge centrality scores\")"
]
},
Expand Down
6 changes: 3 additions & 3 deletions notebooks/centrality/Centrality.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
" <img src=\"https://latex.codecogs.com/png.latex?c_B(v)&space;=\\sum_{s,t&space;\\in&space;V}&space;\\frac{\\sigma(s,&space;t|v)}{\\sigma(s,&space;t)}\" title=\"c_B(v) =\\sum_{s,t \\in V} \\frac{\\sigma(s, t|v)}{\\sigma(s, t)}\" />\n",
"</center>\n",
"\n",
"To speedup runtime of betweenness centrailty, the metric can be computed on a limited number of nodes (randomly selected) and then used to estimate the other scores. For this example, the graphs are relatively smalled (under 5,000 nodes) so betweenness on every node will be computed.\n",
"To speedup runtime of betweenness centrailty, the metric can be computed on a limited number of nodes (randomly selected) and then used to estimate the other scores. For this example, the graphs are relatively small (under 5,000 nodes) so betweenness on every node will be computed.\n",
"\n",
"___Eigenvector Centrality - coming soon___ <br>\n",
"Eigenvectors can be thought of as the balancing points of a graph, or center of gravity of a 3D object. High centrality means that more of the graph is balanced around that node.\n",
Expand Down Expand Up @@ -128,7 +128,7 @@
"outputs": [],
"source": [
"# Compute Centrality\n",
"# the centrality calls are very straight forward with the graph being the primary argument\n",
"# the centrality calls are very straightforward with the graph being the primary argument\n",
"# we are using the default argument values for all centrality functions\n",
"\n",
"def compute_centrality(_graph) :\n",
Expand Down Expand Up @@ -257,7 +257,7 @@
"metadata": {},
"source": [
"### Results\n",
"Typically, analyst look just at the top 10% of results. Basically just those vertices that are the most central or important. \n",
"Typically, analysts just look at the top 10% of results. Basically just those vertices that are the most central or important. \n",
"The karate data has 32 vertices, so let's round a little and look at the top 5 vertices"
]
},
Expand Down
2 changes: 1 addition & 1 deletion notebooks/centrality/Katz.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Call the Karz algorithm"
"### Call the Katz algorithm"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion notebooks/community/Louvain.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -351,7 +351,7 @@
}
],
"source": [
"# How many Lieden partitions where found\n",
"# How many Leiden partitions were found\n",
"part_ids_l = df_l[\"partition\"].unique()\n",
"print(\"Leiden found \" + str(len(part_ids_l)) + \" partitions\")"
]
Expand Down
6 changes: 3 additions & 3 deletions notebooks/community/Spectral-Clustering.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@
"metadata": {},
"outputs": [],
"source": [
"# The algorithm requires that there are edge weights. In this case all the weights are being ste to 1\n",
"# The algorithm requires that there are edge weights. In this case all the weights are being set to 1\n",
"gdf[\"data\"] = cudf.Series(np.ones(len(gdf), dtype=np.float32))"
]
},
Expand All @@ -197,7 +197,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Look at the first few data records - the output should be two colums src and dst\n",
"# Look at the first few data records - the output should be two columns: 'src' and 'dst'\n",
"gdf.head()"
]
},
Expand Down Expand Up @@ -234,7 +234,7 @@
"metadata": {},
"source": [
"----\n",
"#### Define and print function, but adjust vertex ID so that they match the illustration"
"#### Define and print function, but adjust vertex IDs so that they match the illustration"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion notebooks/community/Triangle-Counting.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's seet how that compares to cuGraph\n",
"Let's see how that compares to cuGraph\n",
"\n",
"----"
]
Expand Down
2 changes: 1 addition & 1 deletion notebooks/components/ConnectedComponents.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@
"# Test file\n",
"datafile='../data/netscience.csv'\n",
"\n",
"# the datafile contains three columns,but we only want to use the first two. \n",
"# the datafile contains three columns, but we only want to use the first two. \n",
"# We will use the \"usecols' feature of read_csv to ignore that column\n",
"\n",
"gdf = cudf.read_csv(datafile, delimiter=' ', names=['src', 'dst', 'wgt'], dtype=['int32', 'int32', 'float32'], usecols=['src', 'dst'])\n",
Expand Down
2 changes: 1 addition & 1 deletion notebooks/cores/kcore.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@
"metadata": {},
"source": [
"### Just for fun\n",
"Let's try specifying a K value. Looking at the original network picture, it is easy to see that most vertices has at least degree two. \n",
"Let's try specifying a K value. Looking at the original network picture, it is easy to see that most vertices have at least degree two. \n",
"If we specify k = 2 then only one vertex should be dropped "
]
},
Expand Down
2 changes: 1 addition & 1 deletion notebooks/demo/batch_betweenness.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"metadata": {},
"source": [
"## Introduction\n",
"Betweennes Centrality can be slow to compute on large graphs, in order to speed up the process we can leverage multiple GPUs.\n",
"Betweenness Centrality can be slow to compute on large graphs, in order to speed up the process we can leverage multiple GPUs.\n",
"In this notebook we will showcase how it would have been done with a Single GPU approach, then we will show how it can be done using multiple GPUs."
]
},
Expand Down
2 changes: 1 addition & 1 deletion notebooks/link_analysis/HITS.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@
"metadata": {},
"source": [
"Running NetworkX is that easy. \n",
"Let's seet how that compares to cuGraph\n",
"Let's see how that compares to cuGraph\n",
"\n",
"----"
]
Expand Down
10 changes: 5 additions & 5 deletions notebooks/link_analysis/Pagerank.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Read the data, this also created a NetworkX Graph \n",
"# Read the data, this also creates a NetworkX Graph \n",
"file = open(datafile, 'rb')\n",
"Gnx = nx.read_edgelist(file)"
]
Expand Down Expand Up @@ -232,7 +232,7 @@
"metadata": {},
"source": [
"Running NetworkX is that easy. \n",
"Let's seet how that compares to cuGraph\n",
"Let's see how that compares to cuGraph\n",
"\n",
"----"
]
Expand Down Expand Up @@ -335,7 +335,7 @@
],
"source": [
"# Find the most important vertex using the scores\n",
"# This methods should only be used for small graph\n",
"# These methods should only be used for small graph\n",
"bestScore = gdf_page['pagerank'][0]\n",
"bestVert = gdf_page['vertex'][0]\n",
"\n",
Expand All @@ -351,7 +351,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The top PageRank vertex and socre match what was found by NetworkX"
"The top PageRank vertex and score match what was found by NetworkX"
]
},
{
Expand All @@ -360,7 +360,7 @@
"metadata": {},
"outputs": [],
"source": [
"# A better way to do that would be to find the max and then use that values in a query\n",
"# A better way to do that would be to find the max and then use the values in a query\n",
"pr_max = gdf_page['pagerank'].max()"
]
},
Expand Down
12 changes: 6 additions & 6 deletions notebooks/link_prediction/Jaccard-Similarity.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Look at the first few data records - the output should be two colums src and dst\n",
"# Look at the first few data records - the output should be two columns: 'src' and 'dst'\n",
"gdf.head()"
]
},
Expand Down Expand Up @@ -311,7 +311,7 @@
"outputs": [],
"source": [
"#%%time\n",
"# Call cugraph.nvJaccard \n",
"# Call cugraph.nvJaccard\n",
"jdf = cugraph.jaccard(G)"
]
},
Expand Down Expand Up @@ -424,7 +424,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Call Pagerank on the graph to get weights to use:\n",
"# Call PageRank on the graph to get weights to use:\n",
"pr_df = cugraph.pagerank(G)"
]
},
Expand All @@ -434,7 +434,7 @@
"metadata": {},
"outputs": [],
"source": [
"# take a peek at the page rank values\n",
"# take a peek at the PageRank values\n",
"pr_df.head()"
]
},
Expand All @@ -451,8 +451,8 @@
"metadata": {},
"outputs": [],
"source": [
"pr_df.rename(columns={'pagerank': 'weight'}, inplace=True)",
"# Call weighted Jaccard using the Pagerank scores as weights:\n",
"pr_df.rename(columns={'pagerank': 'weight'}, inplace=True)\n",
"# Call weighted Jaccard using the PageRank scores as weights:\n",
"wdf = cugraph.jaccard_w(G, pr_df)"
]
},
Expand Down
4 changes: 2 additions & 2 deletions notebooks/link_prediction/Overlap-Similarity.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Look at the first few data records - the output should be two colums src and dst\n",
"# Look at the first few data records - the output should be two columns: 'src' and 'dst'\n",
"gdf.head()"
]
},
Expand Down Expand Up @@ -467,7 +467,7 @@
"outputs": [],
"source": [
"# print all similarities over a threshold, in this case 0.5\n",
"#also, drop duplicates\n",
"# also, drop duplicates\n",
"odf_s2 = ol2.query('source < destination').sort_values(by='overlap_coeff', ascending=False)\n",
"\n",
"print_overlap_threshold(odf_s2, 0.74)"
Expand Down
7 changes: 0 additions & 7 deletions notebooks/sampling/RandomWalk.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -162,13 +162,6 @@
"\n",
"Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
46 changes: 34 additions & 12 deletions notebooks/structure/Renumber-2.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,19 @@
"\n",
"\n",
"Notebook Credits\n",
"* Original Authors: Bradley Rees\n",
"* Created: 08/13/2019\n",
"* Updated: 07/08/2020\n",
"\n",
"RAPIDS Versions: 0.13 \n",
"| Author | Date | Update |\n",
"| --------------|------------|---------------------|\n",
"| Brad Rees | 08/13/2019 | created |\n",
"| Brad Rees | 07/08/2020 | updated |\n",
"| Ralph Liu | 06/01/2022 | docs & code change |\n",
"\n",
"RAPIDS Versions: 0.13 \n",
"cuGraph Version: 22.06 \n",
"\n",
"Test Hardware\n",
"\n",
"* GV100 32G, CUDA 10.2\n",
"* GV100 32G, CUDA 11.5\n",
"\n",
"\n",
"## Introduction\n",
Expand Down Expand Up @@ -68,8 +72,9 @@
"# Import needed libraries\n",
"import cugraph\n",
"import cudf\n",
"import cupy as cp\n",
"\n",
"from cugraph.structure import NumberMap\n"
"from cugraph.structure import NumberMap"
]
},
{
Expand Down Expand Up @@ -133,17 +138,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The data has 2.5 million edges that span a range of 3,758,096,389 \n",
"The data has 2.5 million edges that span a range of 3,758,096,389.\n",
"Even if every vertex ID was unique per edge, that would only be 5 million values versus the 3.7 billion that is currently there. \n",
"In the curret state, the produced matrix would 3.7 billion by 3.7 billion - that is a lot of wasted space."
"In the current state, the produced matrix would 3.7 billion by 3.7 billion - that is a lot of wasted space."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Time to Renumber\n",
"One good best practice is to have the returned edge pairs appended to the original dataframe. That will help merge results back into the datasets"
"One good best practice is to have the returned edge pairs appended to the original Dataframe. That will help merge results back into the datasets"
]
},
{
Expand Down Expand Up @@ -198,7 +203,21 @@
"metadata": {},
"source": [
"Just saved 3.7 billion unneeded spaces in the matrix!<br>\n",
"And we can now see that there are only 50 unique IP addresses in the dataset"
"And we can now see that there are only 52 unique IP addresses in the dataset.<br>\n",
"Let's confirm the number of unique values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Merge the renumbered columns\n",
"src, dst = gdf['src_r'].to_cupy(), gdf['dst_r'].to_cupy()\n",
"merged = cp.concatenate((src, dst))\n",
"\n",
"print(\"Unique IPs: \" + str(len(cp.unique(merged))))"
]
},
{
Expand All @@ -216,8 +235,11 @@
}
],
"metadata": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
},
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3.6.9 64-bit",
"language": "python",
"name": "python3"
},
Expand All @@ -231,7 +253,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
"version": "3.6.9"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 0dbb785

Please sign in to comment.