diff --git a/ChangeLog.md b/ChangeLog.md index a3f9746b..3d183873 100644 --- a/ChangeLog.md +++ b/ChangeLog.md @@ -4,6 +4,7 @@ Starting with v1.31.6, this file will contain a record of major features and upd ## Upcoming - Support variable injection in `%%graph_notebook_config` magic ([Link to PR](https://github.com/aws/graph-notebook/pull/287)) +- Added three notebooks to show data science workflows with Amazon Neptune - Added JupyterLab startup script to auto-load magics extensions ([Link to PR](https://github.com/aws/graph-notebook/pull/277)) - Added includeWaiting option to %oc_status, fix same for %gremlin_status ([Link to PR](https://github.com/aws/graph-notebook/pull/272)) - Added `--store-to` option to %status ([Link to PR](https://github.com/aws/graph-notebook/pull/278)) diff --git a/src/graph_notebook/notebooks/05-Data-Science/00-Identifying-Fraud-Rings-Using-Social-Network-Analytics.ipynb b/src/graph_notebook/notebooks/05-Data-Science/00-Identifying-Fraud-Rings-Using-Social-Network-Analytics.ipynb new file mode 100644 index 00000000..907ba55a --- /dev/null +++ b/src/graph_notebook/notebooks/05-Data-Science/00-Identifying-Fraud-Rings-Using-Social-Network-Analytics.ipynb @@ -0,0 +1,608 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d5d54120", + "metadata": {}, + "source": [ + "# Identifying Fraud Rings Using Social Network Analytics\n", + "\n", + "Within the financial industry, an organization can expect to lose 3-6%, and up to 10%, of its [business to fraudulent activities](http://www.crowe.ie/wp-content/uploads/2019/08/The-Financial-Cost-of-Fraud-2019.pdf). Fraudulent activities not only impact financial aspects, but victims often have negative views of the company, leading to negative market sentiment. Overall, these fraudulent activities have a significant impact on a business, in terms of both consumer confidence and bottom-line revenue. Due to the impact of these illicit activities on the bottom line, companies expend significant time and money to detect and prevent fraud. \n", + "\n", + "When dealing with fraud, there are two main components to a robust fraud system: fraud detection and fraud prevention. In the fraud detection component of a system, the main goals are to develop a system and methodology that allows for the rapid discovery of fraudulent activities. This usually involves a posterior evaluation of data, such as transactions, users, credit cards, etc. to determine what patterns or combinations represent actual fraud. This process usually involves a human-in-the-loop system where automated processes flag likely or potential fraudulent activities, which are then evaluated by an expert in the domain to determine the legitimacy, or illegitimacy, of the activities flagged. The output of this process is a set of known and evolving patterns of fraud that are fed into a fraud prevention system. Generally, this consists of a real-time system that compares a transaction, or a set of transactions, against the known fraudulent patterns identified by the fraud detection system. The objective of this fraud prevention system is to reduce and prevent fraudulent activities from occurring in the first place. \n", + "\n", + "\n", + "## Challenges of Detecting Fraud\n", + "\n", + "When dealing with fraud, it is often helpful to understand some challenges of finding fraudulent activities when looking into data. Often this is aided by first understanding the definition and nature of fraud. While there are many definitions of fraud, my favorite is:\n", + "\n", + "[Fraud is an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types of forms .](https://www.amazon.com/Analytics-Descriptive-Predictive-Network-Techniques/dp/1119133122)\n", + "\n", + "\n", + "This definition highlights the complex nature of the problems we must address when working on fraud systems. First, fraud is *uncommon*. Within any system of recorded transactions, only a small fraction of these transactions consist of fraudulent or illicit activities. The sparse nature of these illicit activities complicates the nature of identifying these activities. Second, fraud is *well-considered* and *imperceptibly concealed,* meaning that fraudulent activities are rarely impulsive activities. Most fraudulent activities, at least at scale, involve multiple parties colluding together to perform actions specifically designed to exploit weaknesses in the system and elude detection. Finally, fraud is *time-evolving*. Fraudsters are continuously evolving and adapting their techniques as detection and prevention improve in an endless game of hide and seek.\n", + "\n", + "With these challenges in mind, many fraud detection systems take a multi-faceted approach to identifying illicit activities. In this notebook, we will focus on the use of social networks implicitly found in fraud data to identify fraud rings committing illicit transactions using a guilt-by-association approach.\n", + "\n", + "Fraud rings are a common issue within the industry and consist of multiple individuals or entities colluding to defraud a system. These rings may consist of family members, acquaintances, or even buyer/seller pairs that span both sides of a transaction. These rings exist across a wide array of industries, but an important aspect of these fraud rings, at least as it is related to graph analysis, is that they are strongly linked groups of entities. " + ] + }, + { + "cell_type": "markdown", + "id": "d1bdbb3e", + "metadata": {}, + "source": [ + "## Creating a fraud graph\n", + "\n", + "In this section we'll load a fraud graph and set some visualization options. We'll then use some openCypher queries to inspect the data model used throughout the solution.\n", + "\n", + "### Load data\n", + "The cell below loads the example fraud graph into your Neptune cluster. When you run the cell below, a graph for an example Fraud dataset will load, which will take about 5 minutes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b0589d25", + "metadata": {}, + "outputs": [], + "source": [ + "%seed --model Property_Graph --dataset fraud_graph --run" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "50febfd8", + "metadata": {}, + "source": [ + "Now we will install the library we will be using later on in this notebook. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e3987422", + "metadata": {}, + "outputs": [], + "source": [ + "pip install igraph -q" + ] + }, + { + "cell_type": "markdown", + "id": "43fdc65f", + "metadata": {}, + "source": [ + "### Set visualization and configuration options\n", + "\n", + "The cell below configures the visualization to use specific colors and icons for the different parts of the data model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "854a23c6", + "metadata": {}, + "outputs": [], + "source": [ + "%%graph_notebook_vis_options\n", + "\n", + "{\n", + " \"groups\": {\n", + " \"Account\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf2bb\",\n", + " \"color\": \"red\"\n", + " }\n", + " },\n", + " \"Transaction\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf155\",\n", + " \"color\": \"green\"\n", + " }\n", + " },\n", + " \"Merchant\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf290\",\n", + " \"color\": \"orange\"\n", + " }\n", + " },\n", + " \"DateOfBirth\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf1fd\",\n", + " \"color\": \"blue\"\n", + " }\n", + " },\n", + " \"EmailAddress\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf1fa\",\n", + " \"color\": \"blue\"\n", + " }\n", + " },\n", + " \"Address\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf015\",\n", + " \"color\": \"blue\"\n", + " }\n", + " },\n", + " \"IpAddress\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf109\",\n", + " \"color\": \"blue\"\n", + " }\n", + " },\n", + " \"PhoneNumber\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf095\",\n", + " \"color\": \"blue\"\n", + " }\n", + " }\n", + " },\n", + " \"edges\": {\n", + " \"color\": {\n", + " \"inherit\": false\n", + " },\n", + " \"smooth\": {\n", + " \"enabled\": true,\n", + " \"type\": \"straightCross\"\n", + " },\n", + " \"arrows\": {\n", + " \"to\": {\n", + " \"enabled\": false,\n", + " \"type\": \"arrow\"\n", + " }\n", + " },\n", + " \"font\": {\n", + " \"face\": \"courier new\"\n", + " }\n", + " },\n", + " \"interaction\": {\n", + " \"hover\": true,\n", + " \"hoverConnectedEdges\": true,\n", + " \"selectConnectedEdges\": false\n", + " },\n", + " \"physics\": {\n", + " \"minVelocity\": 0.75,\n", + " \"barnesHut\": {\n", + " \"centralGravity\": 0.1,\n", + " \"gravitationalConstant\": -50450,\n", + " \"springLength\": 95,\n", + " \"springConstant\": 0.04,\n", + " \"damping\": 0.09,\n", + " \"avoidOverlap\": 0.1\n", + " },\n", + " \"solver\": \"barnesHut\",\n", + " \"enabled\": true,\n", + " \"adaptiveTimestep\": true,\n", + " \"stabilization\": {\n", + " \"enabled\": true,\n", + " \"iterations\": 1\n", + " }\n", + " }\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "5b4ce1c2", + "metadata": {}, + "source": [ + "### Data model\n", + "The fraud graph included in this example models credit card accounts, account holder information, merchants, and the transactions performed when an account holder purchases goods or services from a merchant.\n", + "\n", + "**Account and features**\n", + "\n", + "An Account has a number of features, including physical Address, IpAddress, DateOfBirth of the account holder, EmailAddress, and contact PhoneNumber. An account holder can have multiple email addresses and phone numbers.\n", + "\n", + "In many graph data models, these features of the account holder would be modelled as properties of the account. With fraud detection, it's important to be able to link accounts based on shared features, and to find related accounts at query time based on one or more shared features. Hence, our fraud detection application graph data model stores each feature as a separate vertex. Multiple accounts that share the same feature value - the same physical address, for example - are connected to the single vertex representing that feature value. \n", + "\n", + "The following query shows a single account and its associated features. After running the query, click the Graph tab to see a visualization of the results." + ] + }, + { + "cell_type": "markdown", + "id": "59e92609", + "metadata": {}, + "source": [ + "### What does my fraud graph look like for an account?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9c3a2b3", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "MATCH p=(n)-[]-()\n", + "WHERE id(n)='account-4398046519460' \n", + "RETURN p\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "c882feb8", + "metadata": {}, + "source": [ + "### What if I only want to look at the account properties, not the transactions?\n", + "\n", + "While the transaction information is very important, but it is often overwhelming, especially on very active accounts. Often we want to look at only account features, to see how that specific account is connected to features of that account." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad0e5c6c", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "MATCH p=(n)-[:FEATURE_OF_ACCOUNT]-()\n", + "WHERE id(n)='account-4398046519460' \n", + "RETURN p\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "9c68588e", + "metadata": {}, + "source": [ + "### What else connects to these same features?\n", + "\n", + "After isolating only the account features, we may want to extend our exploration out further to see if we can find other accounts that share these features." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "316b36ca", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "MATCH p=(n)-[:FEATURE_OF_ACCOUNT*1..2]-()\n", + "WHERE id(n)='account-4398046519460' \n", + "RETURN p\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "e6695649", + "metadata": {}, + "source": [ + "This is starting to show some interesting information. As we can see above, there are four other accounts that all share the same birthday and one that shares the same phone number. Having a shared birthday does not seem too suspicious, as there are only so many, but sharing a birthday and sharing a phone number seems unlikely.\n", + "\n", + "### Find accounts with shared features?\n", + "\n", + "Let's take a look at a visualization of all accounts that share a feature, other than a `DateOfBirth`, with another account." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a5058a53", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "MATCH p=(n:Account)<-[:FEATURE_OF_ACCOUNT]-(f)-[:FEATURE_OF_ACCOUNT]->(t:Account)\n", + "WHERE NOT f:DateOfBirth\n", + "RETURN p" + ] + }, + { + "cell_type": "markdown", + "id": "fdab6844", + "metadata": {}, + "source": [ + "The results show a lot of shared features. In fact, if we start to examine this data, it begins to look a lot like a social network. Using social network analytics on top of this type of graph allows us to find many interesting insights that are not apparent in the original transaction graph.\n", + "\n", + "## Finding Fraud Rings in your social graph" + ] + }, + { + "cell_type": "markdown", + "id": "839b0ce5", + "metadata": {}, + "source": [ + "When analyzing graph data, a frequent requirement is to infer information or connections from the underlying graph into a graph that is optimized to run analytics and algorithms. Many graphs in Neptune contain a rich collection of entities and attributes, such as our transaction graph, that while useful for transactional queries, are not required to perform analytical tasks.\n", + "\n", + "### Inferring a social graph from your fraud graph\n", + "\n", + "In this example, we are going to create a social graph by inferring that any accounts that are connected by a feature, other than `DateOfBirth`, are connected within this social network. We will then use this social network to search for rings of fraudulent accounts within a social graph.\n", + "\n", + "Up until now we have been working with Amazon Neptune directly. For this analysis we are going to leverage an integration between Neptune and Pandas DataFrames, supplied by [AWS Data Wrangler](https://github.com/awslabs/aws-data-wrangler), to read and write data from Neptune and the [iGraph](https://igraph.org/) library to perform network analysis and run graph algorithms on top of this data.\n", + "\n", + "**Note:** The AWS Data Wrangler library is suitable for use with small-medium sized graphs in Neptune as large as a few 100 million edges depending on the Neptune instance size.\n", + "\n", + "Running the cell below will retrieve that needed data from Neptune, in this case an [edge list](https://en.wikipedia.org/wiki/Edge_list), and load it into a Pandas DataFrame that we will use for later analysis. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e42b967", + "metadata": {}, + "outputs": [], + "source": [ + "import awswrangler as wr\n", + "import pandas as pd\n", + "import igraph as ig\n", + "import graph_notebook as gn\n", + "from graph_notebook.configuration.generate_config import AuthModeEnum\n", + "\n", + "# Get the configuration information for the notebook\n", + "config = gn.configuration.get_config.get_config()\n", + "iam=True if config.auth_mode==AuthModeEnum.IAM else False\n", + "\n", + "# Retrieve Data from neptune\n", + "client = wr.neptune.connect(config.host, config.port, iam_enabled=iam)\n", + "query = \"\"\"MATCH (s:Account)<-[:FEATURE_OF_ACCOUNT]-(f)-[:FEATURE_OF_ACCOUNT]->(t:Account) \n", + "WHERE NOT f:DateOfBirth RETURN id(s) as source, id(t) as target\"\"\"\n", + "df = wr.neptune.execute_opencypher(client, query)\n", + "display(df.head(10))" + ] + }, + { + "cell_type": "markdown", + "id": "cc0f52c2", + "metadata": {}, + "source": [ + "### Detecting strongly linked communities\n", + "\n", + "Fraud rings are a common issue within the industry and consist of multiple individuals or entities colluding to defraud a system. These rings may consist of family members, acquaintances, or even buyer/seller pairs that span both sides of a transaction. These rings exist across a wide array of industries, but an important aspect of these fraud rings, at least as it is related to graph analysis, is that they are strongly linked groups of entities. \n", + "\n", + "In networking analysis, there is an entire class of graph algorithms, known as community detection algorithms, that evaluates how groups of nodes are connected or partitioned from one another. Use community detection algorithms to determine strongly linked groups of entities. While there are a large number of these algorithms, the most commonly used community detection algorithms are Weakly Connected Components, Louvain, and Label Propagation. \n", + "\n", + "#### Finding strongly linked communities\n", + "\n", + "Now that we have created our social graph, the first step in analyzing this graph for fraud rings is to find the communities of entities that are highly connected to each other, which may represent collusion. Running the cell below will run the Weakly Connected Components algorithm on the data, which will assign a unique component value to all vertices in the graph that are connected by a path, regardless of the edge direction." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b11d766b", + "metadata": {}, + "outputs": [], + "source": [ + "# Create a graph from the results returned\n", + "g = ig.Graph.TupleList(df.itertuples(index=False), directed=True, weights=False)\n", + "\n", + "# Run WCC on the data\n", + "wcc = g.clusters(mode=\"weak\")\n", + "print(f'Their are {len(wcc)} communities in this data set.')\n", + "\n", + "giant=wcc.giant()\n", + "print(f'The largest community has {len(giant.vs)} accounts.')\n", + "\n", + "print(f'The size histogram for these clusters is:')\n", + "print(wcc.size_histogram())\n" + ] + }, + { + "cell_type": "markdown", + "id": "ef834545", + "metadata": {}, + "source": [ + "Running this algorithm we see that our dataset contains 123 communities with the largest one having 18 accounts. From looking at the size histogram, we can also see that there are two clusters which have an anomalously large number of accounts. Since this is often a sign of collusion between accounts, let us further examine the largest of these communities.\n", + "\n", + "### Identifying most important members in a community\n", + "\n", + "Once a set of communities are identified, the next step involves identifying the most important entities in one or more of those communities. This is accomplished using a centrality algorithm, which will identify the most important entities, enabling analysts to start their investigation with the “big fish” in the potential ring. There are many different centrality algorithms, each providing a slightly different method for determining the importance of a node. The most commonly used ones are: Degree, PageRank, Betweenness, Closeness, and Eigenvector.\n", + "\n", + "For our analysis below we will run [PageRank](https://en.wikipedia.org/wiki/PageRank), which is an algorithm that was originally developed to rank web pages but is now commonly applied to many other problems. PageRank returns a value that represents the relative importance of an account within a community based on its relationships and the importance of the corresponding accounts. In the end, the higher the PageRank value returned, the more important the account is inside the community." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab56de60", + "metadata": {}, + "outputs": [], + "source": [ + "pg = giant.pagerank()\n", + "print('\\n'.join(map(str, pg)))" + ] + }, + { + "cell_type": "markdown", + "id": "34ae6284", + "metadata": {}, + "source": [ + "Great, we have now analyzed our inferred social network to identify the strongly connected communities within that network and prioritized the most important accounts within the anomalous community. Now that we have done this analysis we need to save this data back into our original graph to enable real-time investigations by trained fraud analysts. \n", + "\n", + "### Storing the risk values back into the graph\n", + "\n", + "To store our data back into our original graph, we can once again use our AWS Data Wrangler integration to save a Pandas DataFrame into Neptune. To accomplish this, we will first construct a DataFrame consisting of the account id, as well as the community/component value, as well as the associated PageRank value. We then save this DataFrame back to Neptune using the [`to_property_graph()`](https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.neptune.to_property_graph.html#awswrangler.neptune.to_property_graph) method. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fcd4dcbb", + "metadata": {}, + "outputs": [], + "source": [ + "rows=[]\n", + "for idx, c in enumerate(wcc):\n", + " for item in c:\n", + " rows.append({'~id': str(g.vs[item]['name']), 'component(single)': idx})\n", + "\n", + "for idx, v in enumerate(giant.vs):\n", + " r = next(s for s in rows if s['~id'] == v['name'])\n", + " r['pg(single)'] = pg[idx]\n", + "\n", + "new_df=pd.DataFrame(rows, columns=['~id','component(single)', 'pg(single)'])\n", + "res = wr.neptune.to_property_graph(client, new_df, use_header_cardinality=True, batch_size=100)\n", + "print(\"Save Complete\")" + ] + }, + { + "cell_type": "markdown", + "id": "301adab7", + "metadata": {}, + "source": [ + "## Analyzing the results\n", + "\n", + "\n", + "A critical and ongoing part of any fraud workflow is to have a mechanism to enable analysts to investigate and prove/disprove that a potentially fraudulent activity exists. " + ] + }, + { + "cell_type": "markdown", + "id": "d9ffa22f", + "metadata": {}, + "source": [ + "### Find the most important accounts to examine first\n", + "\n", + "A common workflow for a skilled fraud analyst is to start by retrieving a list of prioritized accounts to look at based on the “risk” associated with the entity. In the query below, we will find the top 5 most important accounts to examine." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b1ec9f1", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "\n", + "MATCH (a:Account) WHERE a.pg IS NOT NULL RETURN id(a), a.component, a.pg ORDER BY a.pg DESC LIMIT 5" + ] + }, + { + "cell_type": "markdown", + "id": "ad105702", + "metadata": {}, + "source": [ + "### Explore their connections\n", + "The ability to visually explore graphs is a powerful tool that helps fraud analysts understand how certain account are connected. Let's take a look at the most important account information and see the graph of the surrounding connections. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc0a9eb4", + "metadata": { + "scrolled": false + }, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "\n", + "MATCH p=(a)-[*1..2]-()\n", + "WHERE id(a)='account-4398046511937'\n", + "RETURN p" + ] + }, + { + "cell_type": "markdown", + "id": "6d60c51f", + "metadata": {}, + "source": [ + "### Mark as Fraud/Not Fraud\n", + "\n", + "Visual inspection, combined with the domain expertise of a fraud analyst, is a critical factor in being able to determine if anomalous patterns in a graph represent actual fraud or legitimate activity. Expert analysts are skilled in looking at the patterns of transactions and connections and the structural connections between items to determine the legitimacy of an account/transaction. Once they have made this determination, they will often flag these accounts/transactions as fraudulent in the graph to aid in future investigations.\n", + "\n", + "Let's mark the account above as fraudulent by setting the `isFraud` property to `True`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e4b36504", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "MATCH (a)\n", + "WHERE id(a)='account-4398046511937'\n", + "SET a.isFraud=True\n", + "RETURN a" + ] + }, + { + "cell_type": "markdown", + "id": "1b11de2f", + "metadata": {}, + "source": [ + "### Find all fraudulent accounts within five hops \n", + "\n", + "Now that we have completed our investigation of `account-4398046511937` let's take a look at another account from our list above `account-21990232559534`. In addition to looking at the connections for an account, as shown above, another common use of graphs when analyzing anomalous activity is to look how closely an account is connected to a known fraudulent account. \n", + "\n", + "Let's take a look at this new account and see if it is connected to any marked fraudulent accounts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74dffa64", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "\n", + "MATCH p=(a)-[:FEATURE_OF_ACCOUNT|ACCOUNT*1..5]-(b)\n", + "WHERE id(a)='account-21990232559534'\n", + "AND b.isFraud=True\n", + "RETURN p LIMIT 1" + ] + }, + { + "cell_type": "markdown", + "id": "b37c65a0", + "metadata": {}, + "source": [ + "Wow, that account shares a phone number with a known fraudster so it looks suspicious, we should pass this account on to our downstream processes for action. \n", + "\n", + "## Conclusion\n", + "\n", + "This notebook has shown how you can use Amazon Neptune and AWS Data Wrangler to run analytics on your data to detect fraud rings. We've used a credit card dataset with account- and transaction-centric queries to infer a social network from this data and then use that social network to look for tightly connected groups of individuals. We then identified the most influential people within these groups and stored this information within our graph. Using this information we were able to explore the connections around the most influential people to identify other potentially fraudulent accounts.\n", + "\n", + "Combating fraud is an ongoing challenge for any organization. The faster a team can identify fraud and the more they do, the more efficient anti-fraud systems become, preventing significant financial losses. Finding and understanding fraud rings is a problem that requires the ability to query, analyze, and explore the connections between accounts, transactions, and account features. Combining the ability to query a graph with the ability to run network analysis and graph algorithms on top of that data enables us to derive novel insights from this data. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/src/graph_notebook/notebooks/05-Data-Science/01-Identifying-1st-Person-Synthetic-Identity-Fraud-Using-Graph-Similarity.ipynb b/src/graph_notebook/notebooks/05-Data-Science/01-Identifying-1st-Person-Synthetic-Identity-Fraud-Using-Graph-Similarity.ipynb new file mode 100644 index 00000000..60ef4e9d --- /dev/null +++ b/src/graph_notebook/notebooks/05-Data-Science/01-Identifying-1st-Person-Synthetic-Identity-Fraud-Using-Graph-Similarity.ipynb @@ -0,0 +1,722 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d5d54120", + "metadata": {}, + "source": [ + "# Identifying First-Person/Synthetic Identity Fraud Using Graph Similarity\n", + "\n", + "Within the financial industry, an organization can expect to lose 3-6%, and up to 10%, of its [business to fraudulent activities](http://www.crowe.ie/wp-content/uploads/2019/08/The-Financial-Cost-of-Fraud-2019.pdf). Fraudulent activities not only impact financial aspects, but victims often have negative views of the company, leading to negative market sentiment. Overall, these fraudulent activities have a significant impact on a business, in terms of both consumer confidence and bottom-line revenue. Due to the impact of these illicit activities on the bottom line, companies expend significant time and money to detect and prevent fraud. \n", + "\n", + "When dealing with fraud, there are two main components to a robust fraud system: fraud detection and fraud prevention. In the fraud detection component of a system, the main goals are to develop a system and methodology that allows for the rapid discovery of fraudulent activities. This usually involves a posterior evaluation of data, such as transactions, users, credit cards, etc. to determine what patterns or combinations represent actual fraud. This process usually involves a human-in-the-loop system where automated processes flag likely or potential fraudulent activities, which are then evaluated by an expert in the domain to determine the legitimacy, or illegitimacy, of the activities flagged. The output of this process is a set of known and evolving patterns of fraud that are fed into a fraud prevention system. Generally, this consists of a real-time system that compares a transaction, or a set of transactions, against the known fraudulent patterns identified by the fraud detection system. The objective of this fraud prevention system is to reduce and prevent fraudulent activities from occurring in the first place. \n", + "\n", + "\n", + "## Challenges of Detecting Fraud\n", + "\n", + "When dealing with fraud, it is often helpful to understand some challenges of finding fraudulent activities when looking into data. Often this is aided by first understanding the definition and nature of fraud. While there are many definitions of fraud, my favorite is:\n", + "\n", + "[Fraud is an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types of forms .](https://www.amazon.com/Analytics-Descriptive-Predictive-Network-Techniques/dp/1119133122)\n", + "\n", + "\n", + "This definition highlights the complex nature of the problems we must address when working on fraud systems. First, fraud is *uncommon*. Within any system of recorded transactions, only a small fraction of these transactions consist of fraudulent or illicit activities. The sparse nature of these illicit activities complicates the nature of identifying these activities. Second, fraud is *well-considered* and *imperceptibly concealed,* meaning that fraudulent activities are rarely impulsive activities. Most fraudulent activities, at least at scale, involve multiple parties colluding together to perform actions specifically designed to exploit weaknesses in the system and elude detection. Finally, fraud is *time-evolving*. Fraudsters are continuously evolving and adapting their techniques as detection and prevention improve in an endless game of hide and seek.\n", + "\n", + "With these challenges in mind, many fraud detection systems take a multi-faceted approach to identifying illicit activities. In this notebook, we will focus on the use of social networks implicitly found in fraud data to identify fraud rings committing illicit transactions using a guilt-by-association approach.\n", + "\n", + "First-person fraud occurs when a user supplies false information to a company. Synthetic identity fraud is when users combine real and fake information to create a new synthetic identity. This is a very common type of fraud that is rampant in most financial institutions. Solving this type of fraud using a graph relies heavily on using entity resolution type techniques combined with graph analysis to identify similar or connected entities." + ] + }, + { + "cell_type": "markdown", + "id": "d1bdbb3e", + "metadata": {}, + "source": [ + "## Creating a fraud graph\n", + "\n", + "In this section, we'll load a fraud graph and set some visualization options. We'll then use some openCypher queries to inspect the data model used throughout the solution.\n", + "\n", + "### Load data\n", + "The cell below loads the example fraud graph into your Neptune cluster. When you run the cell below, a graph for an example fraud dataset will load, which will take about 5 minutes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b0589d25", + "metadata": {}, + "outputs": [], + "source": [ + "%seed --model Property_Graph --dataset fraud_graph --run" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "5ca8bf2f", + "metadata": {}, + "source": [ + "Now we will install the library we will be using later on in this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15ee6b87", + "metadata": {}, + "outputs": [], + "source": [ + "pip install igraph -q" + ] + }, + { + "cell_type": "markdown", + "id": "dad950ed", + "metadata": {}, + "source": [ + "### Set visualization and configuration options\n", + "\n", + "The cell below configures the visualization to use specific colors and icons for the different parts of the data model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "854a23c6", + "metadata": {}, + "outputs": [], + "source": [ + "%%graph_notebook_vis_options\n", + "\n", + "{\n", + " \"groups\": {\n", + " \"Account\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf2bb\",\n", + " \"color\": \"red\"\n", + " }\n", + " },\n", + " \"Transaction\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf155\",\n", + " \"color\": \"green\"\n", + " }\n", + " },\n", + " \"Merchant\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf290\",\n", + " \"color\": \"orange\"\n", + " }\n", + " },\n", + " \"DateOfBirth\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf1fd\",\n", + " \"color\": \"blue\"\n", + " }\n", + " },\n", + " \"EmailAddress\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf1fa\",\n", + " \"color\": \"blue\"\n", + " }\n", + " },\n", + " \"Address\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf015\",\n", + " \"color\": \"blue\"\n", + " }\n", + " },\n", + " \"IpAddress\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf109\",\n", + " \"color\": \"blue\"\n", + " }\n", + " },\n", + " \"PhoneNumber\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf095\",\n", + " \"color\": \"blue\"\n", + " }\n", + " }\n", + " },\n", + " \"edges\": {\n", + " \"color\": {\n", + " \"inherit\": false\n", + " },\n", + " \"smooth\": {\n", + " \"enabled\": true,\n", + " \"type\": \"straightCross\"\n", + " },\n", + " \"arrows\": {\n", + " \"to\": {\n", + " \"enabled\": false,\n", + " \"type\": \"arrow\"\n", + " }\n", + " },\n", + " \"font\": {\n", + " \"face\": \"courier new\"\n", + " }\n", + " },\n", + " \"interaction\": {\n", + " \"hover\": true,\n", + " \"hoverConnectedEdges\": true,\n", + " \"selectConnectedEdges\": false\n", + " },\n", + " \"physics\": {\n", + " \"minVelocity\": 0.75,\n", + " \"barnesHut\": {\n", + " \"centralGravity\": 0.1,\n", + " \"gravitationalConstant\": -50450,\n", + " \"springLength\": 95,\n", + " \"springConstant\": 0.04,\n", + " \"damping\": 0.09,\n", + " \"avoidOverlap\": 0.1\n", + " },\n", + " \"solver\": \"barnesHut\",\n", + " \"enabled\": true,\n", + " \"adaptiveTimestep\": true,\n", + " \"stabilization\": {\n", + " \"enabled\": true,\n", + " \"iterations\": 1\n", + " }\n", + " }\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "8df75803", + "metadata": {}, + "source": [ + "### Data model\n", + "The fraud graph included in this example models credit card accounts, account holder information, merchants, and the transactions performed when an account holder purchases goods or services from a merchant.\n", + "\n", + "**Account and features**\n", + "\n", + "An Account has a number of features, including physical Address, IpAddress, DateOfBirth of the account holder, EmailAddress, and contact PhoneNumber. An account holder can have multiple email addresses and phone numbers.\n", + "\n", + "In many graph data models, these features of the account holder would be modelled as properties of the account. With fraud detection, it's important to be able to link accounts based on shared features, and to find related accounts at query time based on one or more shared features. Hence, our fraud detection application graph data model stores each feature as a separate vertex. Multiple accounts that share the same feature value - the same physical address, for example - are connected to the single vertex representing that feature value.\n", + "\n", + "The following query shows a single account and its associated features. After running the query, click the Graph tab to see a visualization of the results." + ] + }, + { + "cell_type": "markdown", + "id": "59e92609", + "metadata": {}, + "source": [ + "### What does my fraud graph look like for an account?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9c3a2b3", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "MATCH p=(n)-[]-()\n", + "WHERE id(n)='account-4398046519460' \n", + "RETURN p\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "c882feb8", + "metadata": {}, + "source": [ + "### What if I only want to look at the account properties, not the transactions?\n", + "\n", + "While the transaction information is very important, but it is often overwhelming, especially on very active accounts. Often we want to look at only account features, to see how that specific account is connected to features of that account." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad0e5c6c", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "MATCH p=(n)-[:FEATURE_OF_ACCOUNT]-()\n", + "WHERE id(n)='account-4398046519460' \n", + "RETURN p\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "9c68588e", + "metadata": {}, + "source": [ + "### What else connects to these same features?\n", + "\n", + "After isolating only the account features, we may want to extend our exploration out further to see if we can find other accounts that sharing these features." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "316b36ca", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "MATCH p=(n)-[:FEATURE_OF_ACCOUNT*1..2]-()\n", + "WHERE id(n)='account-4398046519460' \n", + "RETURN p\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "e6695649", + "metadata": {}, + "source": [ + "This is starting to show some interesting information. As we can see above, there are four other accounts that all share the same birthday and one that shares the same phone number. Having a shared birthday does not seem too suspicious, as there are only so many, but sharing a birthday and sharing a phone number seems unlikely.\n", + "\n", + "### Find accounts with shared features?\n", + "\n", + "Let's take a look at a visualization of all accounts that share a feature with another account." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7cabd521", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "MATCH p=(n)-[:FEATURE_OF_ACCOUNT*1..2]-()\n", + "RETURN p LIMIT 1000" + ] + }, + { + "cell_type": "markdown", + "id": "fdab6844", + "metadata": {}, + "source": [ + "The results show a lot of shared features. Accounts that share a an anomalous amount of shared features compared to others in a graph are often times a sign that some sort of identity fraud is occurring. \n", + "\n", + "## Finding Synthetic identities in your graph\n", + "\n", + "When analyzing graph data, a frequent requirement is to infer information or connections from the underlying graph into a graph that is optimized to run analytics and algorithms. Many graphs in Neptune contain a rich collection of entities and attributes, such as our transaction graph, that while useful for transactional queries, are not required to perform analytical tasks." + ] + }, + { + "cell_type": "markdown", + "id": "839b0ce5", + "metadata": {}, + "source": [ + "### Inferring a feature graph from your transaction graph\n", + "\n", + "In this example, we are going to create a feature graph that connects all accounts that share a feature within this network. We will then use this graph similarity algorithms to decide which accounts are most similar to other accounts\n", + "\n", + "Up until now we have been working with Amazon Neptune directly. For this analysis we are going to leverage an integration between Neptune and Pandas DataFrames, supplied by AWS Data Wrangler, to read and write data from Neptune and the iGraph library to perform network analysis/graph algorithms on top of this data.\n", + "\n", + "Running the cell below will retrieve the required data from Neptune (in this case an edge list) and load it into a Pandas DataFrame that we will use for later analysis." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e42b967", + "metadata": {}, + "outputs": [], + "source": [ + "import awswrangler as wr\n", + "import pandas as pd\n", + "import igraph as ig\n", + "import graph_notebook as gn\n", + "from graph_notebook.configuration.generate_config import AuthModeEnum\n", + "\n", + "# Get the configuration information for the notebook\n", + "config = gn.configuration.get_config.get_config()\n", + "iam=True if config.auth_mode==AuthModeEnum.IAM else False\n", + "\n", + "# Retrieve Data from neptune\n", + "client = wr.neptune.connect(config.host, config.port, iam_enabled=iam)\n", + "query = \"\"\"MATCH p=(n)-[:FEATURE_OF_ACCOUNT*1..2]-()\n", + "with relationships(p) as rels\n", + "UNWIND rels as rel\n", + "Return id(startNode(rel)) as source, id(endNode(rel)) as target\"\"\"\n", + "df = wr.neptune.execute_opencypher(client, query)\n", + "display(df.head(10))" + ] + }, + { + "cell_type": "markdown", + "id": "cc0f52c2", + "metadata": {}, + "source": [ + "### Detecting similar nodes\n", + "\n", + "First-person fraud occurs when a user supplies false information to a company. Synthetic identity fraud is defined as a form of fraud where users combine real and fake information to create a new synthetic identity. This is a very common type of fraud that is rampant in most financial institutions today. Solving this type of fraud using a graph relies heavily on using entity resolution techniques combined with graph analysis to identify similar or connected entities. \n", + "\n", + "In networking analysis and graph algorithms, there is an entire class of algorithms, known as similarity algorithms, that evaluates how groups of nodes are connected or partitioned from one another and use this network of connections to identify similar nodes within the graph. While there are a large number of these algorithms, the most common similarity algorithms used are Jaccard, Overlap, and k-Nearest Neighbors.\n", + "\n", + "#### Finding graph similarity\n", + "\n", + "Now that we have created our feature graph, we can run a similarity algorithm to find other similar nodes. Running the cell below will run the Jaccard similarity algorithm on the data, which will compare nodes against all other nodes in the graph to look for nodes that share the same neighbors. These will then return a score between 0-1 showing how similar two nodes are to each other.\n", + "\n", + "Let's take a look and see what it might look like to find the nodes most similar to `account-4398046519460` that we looked at above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b11d766b", + "metadata": {}, + "outputs": [], + "source": [ + "# Create a graph from the results returned\n", + "g = ig.Graph.TupleList(df.itertuples(index=False), directed=True, weights=False)\n", + "\n", + "vids=[]\n", + "account_vs=[]\n", + "for idx, v in enumerate(g.vs):\n", + " if v['name'].startswith('account'):\n", + " vids.append(idx)\n", + " account_vs.append(v)\n", + " \n", + "sim = g.similarity_jaccard(vertices=vids, loops=False)\n", + "\n", + "for idx, s in enumerate(sim):\n", + " index = None\n", + " max=0\n", + " for i in range(1,len(s)):\n", + " if s[i]<1 and s[i] > max:\n", + " max = s[i]\n", + " index = i\n", + " if index:\n", + " account_vs[index]['similar_node']=account_vs[idx]['name']\n", + " account_vs[index]['similarity_score']=s[index]\n", + " \n", + "\n", + "similar_node={}\n", + "for idx, a in enumerate(account_vs):\n", + " if 'similar_node' in a.attributes() and a['similar_node'] and a['name'] == 'account-4398046519460':\n", + " similar_node = {'target_node': a['name'], 'similar_node': a['similar_node'], 'score': a['similarity_score']}\n", + " \n", + "\n", + "print(similar_node)" + ] + }, + { + "cell_type": "markdown", + "id": "a03ab4aa", + "metadata": {}, + "source": [ + "### Let's look at the similarities for these two accounts\n", + "\n", + "Looking at the data above, we see the account that is most similar to our target account. Let's take a look at what the connections around these two accounts looks like by running the cell below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "152c96eb", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "\n", + "MATCH p=(n)-[:FEATURE_OF_ACCOUNT]-()\n", + "WHERE id(n) in ['${similar_node[\"target_node\"]}', '${similar_node[\"similar_node\"]}']\n", + "RETURN p\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "48a3db1a", + "metadata": {}, + "source": [ + "These two accounts only seem to share a single common feature and do not look too similar to me. In addition to the fact that these two accounts are the most similar, we can also calculate a relative score to show how similar they are to one another. Let's see what the score is for these two accounts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f8c9bc44", + "metadata": {}, + "outputs": [], + "source": [ + "similar_node['score']" + ] + }, + { + "cell_type": "markdown", + "id": "7d0055cd", + "metadata": {}, + "source": [ + "Well that is not a very high score between 0-1, with 1 meaning identical, so let's see if we can find some more similar accounts to examine.\n", + "\n", + "### Find the most similar accounts\n", + "\n", + "Let's take a look at all the accounts in our feature graph and find the pair that is the most similar to each other." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d0b3924", + "metadata": {}, + "outputs": [], + "source": [ + "max_sim=0\n", + "most_sim=None\n", + "for idx, a in enumerate(account_vs):\n", + " if 'similarity_score' in a.attributes() and a['similarity_score'] and a['similarity_score']>max_sim:\n", + " max_sim = a['similarity_score']\n", + " most_sim=a\n", + "print(f\"The two most similar accounts are {most_sim['similar_node']} and {most_sim['name']}\")" + ] + }, + { + "cell_type": "markdown", + "id": "79dd4a70", + "metadata": {}, + "source": [ + "Now that we know the two most similar accounts, lets take a look at the graph of their connections." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "870023c8", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "\n", + "MATCH p=(n)-[:FEATURE_OF_ACCOUNT]-()\n", + "WHERE id(n) in [\"${most_sim['name']}\", \"${most_sim['similar_node']}\"]\n", + "RETURN p\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "34ae6284", + "metadata": {}, + "source": [ + "As we see, these nodes share a significant number of common features so we need to mark them as similar for further investigation. Now that we have done this analysis, we need to save this data back into our original graph to enable real-time investigations by trained fraud analysts.\n", + "\n", + "### Storing calculated edges with similarity values back into the graph\n", + "\n", + "To store our data back into our original graph, we can once again use our AWS Data Wrangler integration to save a Pandas DataFrame into Neptune. To accomplish this, we will first construct a DataFrame consisting of the target and source nodes for a `similar_to` edge and then add the similarity score as a property `jaccard_similarity` to that edge. Next, we save this DataFrame back to Neptune using the [`to_property_graph()`](https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.neptune.to_property_graph.html#awswrangler.neptune.to_property_graph) method. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fcd4dcbb", + "metadata": {}, + "outputs": [], + "source": [ + "import uuid\n", + "\n", + "# Delete existing 'similar_to' edges\n", + "query = \"\"\"MATCH p=()-[s:similar_to]->() DELETE s\"\"\"\n", + "df = wr.neptune.execute_opencypher(client, query)\n", + "\n", + "# Create DataFrame for new edges\n", + "rows=[]\n", + "for idx, a in enumerate(account_vs): \n", + " if 'similar_node' in a.attributes() and a['similar_node']:\n", + " rows.append({'~id': uuid.uuid4(), '~label': 'similar_to',\n", + " '~from': a['name'],\n", + " '~to': a['similar_node'],\n", + " 'jaccard_similarity(single)': a['similarity_score']})\n", + "new_df=pd.DataFrame(rows, columns=['~id','~label', '~from', '~to', 'jaccard_similarity(single)'])\n", + "\n", + "# Save new edges\n", + "res = wr.neptune.to_property_graph(client, new_df, use_header_cardinality=True, batch_size=100)" + ] + }, + { + "cell_type": "markdown", + "id": "301adab7", + "metadata": {}, + "source": [ + "## Analyzing the results\n", + "\n", + "A critical and ongoing part of any fraud workflow is to have a mechanism to enable analysts to investigate and prove/disprove that a potentially fraudulent activity exists. " + ] + }, + { + "cell_type": "markdown", + "id": "d9ffa22f", + "metadata": {}, + "source": [ + "### Find the most similar accounts to examine first\n", + "\n", + "A common workflow for a skilled fraud analyst is to start by retrieving a list of prioritized accounts to look at based on the similarity between two accounts. In the query below, we will find the top 5 most similar accounts to examine." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b1ec9f1", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -de jaccard_similarity -l 20 --store-to similar\n", + "\n", + "MATCH p=(a:Account)-[s:similar_to]->()\n", + "WHERE s.jaccard_similarity IS NOT NULL \n", + "RETURN p ORDER BY s.jaccard_similarity DESC LIMIT 5" + ] + }, + { + "cell_type": "markdown", + "id": "6e8a02f5", + "metadata": {}, + "source": [ + "To make the exploration a little easier, let's retrieve the source and target vertex ids." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "537183a2", + "metadata": {}, + "outputs": [], + "source": [ + "source = similar['results'][0]['p'][0]['~id']\n", + "target = similar['results'][2]['p'][0]['~id']" + ] + }, + { + "cell_type": "markdown", + "id": "ad105702", + "metadata": {}, + "source": [ + "### Explore their connections\n", + "\n", + "The ability to visually explore graphs is a powerful tool that helps fraud analysts understand how certain account are connected. Let's take a look at the account information for these two accounts and see the graph of the surrounding connections. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc0a9eb4", + "metadata": { + "scrolled": false + }, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "\n", + "MATCH p=(a)-[*1..2]-()\n", + "WHERE id(a) in ['${source}', \"${target}\"]\n", + "RETURN p" + ] + }, + { + "cell_type": "markdown", + "id": "517602e4", + "metadata": {}, + "source": [ + "### Mark as Fraud/Not Fraud\n", + "\n", + "Visual inspection, combined with the domain expertise of a fraud analyst, is a critical factor in being able to determine if anomalous patterns in a graph represent actual fraud or legitimate activity. Expert analysts are skilled in looking at the patterns of transactions and connections and the structural connections between items to determine the legitimacy of an account/transaction. Once they have made this determination, they will often flag these accounts/transactions as fraudulent in the graph to aid in future investigations.\n", + "\n", + "Let's mark the account above as fraudulent by setting the `isFraud` property to `True`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "69643124", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "MATCH (a)\n", + "WHERE id(a)='${source}'\n", + "SET a.isFraud=True\n", + "RETURN a" + ] + }, + { + "cell_type": "markdown", + "id": "39d2b822", + "metadata": {}, + "source": [ + "### Find all fraudulent accounts within five hops \n", + "\n", + "Now that we have completed our investigation of the source account, let's take a look at this new account and see if it is connected to any marked fraudulent accounts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65285215", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d value -l 20\n", + "\n", + "MATCH p=(a)-[:FEATURE_OF_ACCOUNT|ACCOUNT*1..5]-(b)\n", + "WHERE id(a)='${source}'\n", + "AND b.isFraud=True\n", + "RETURN p LIMIT 1" + ] + }, + { + "cell_type": "markdown", + "id": "a4623310", + "metadata": {}, + "source": [ + "Wow, that account shares two features with a known fraudster so it looks suspicious.\n", + "\n", + "## Conclusion\n", + "\n", + "This notebook has shown how you can use Amazon Neptune and AWS Data Wrangler to run analytics on your data to detect fraud rings. We've used a credit card dataset with account- and transaction-centric queries to infer a social network from this data and then use that social network to look for similar accounts. We then identified these accounts and stored this information within our graph. Using this information we were able to explore the connections around the most influential people to identify other potentially fraudulent accounts.\n", + "\n", + "Combating fraud is an ongoing challenge for any organization. The faster a team can identify fraud and the more they do, the more efficient anti-fraud systems become, preventing significant financial losses. Finding and understanding fraud rings is a problem that requires the ability to query, analyze, and explore the connections between accounts, transactions, and account features. Combining the ability to query a graph with the ability to run network analysis and graph algorithms on top of that data enables us to derive novel insights from this data." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/src/graph_notebook/notebooks/05-Data-Science/02-Logistics-Analysis-using-a-Transportation-Network.ipynb b/src/graph_notebook/notebooks/05-Data-Science/02-Logistics-Analysis-using-a-Transportation-Network.ipynb new file mode 100644 index 00000000..8cf4c176 --- /dev/null +++ b/src/graph_notebook/notebooks/05-Data-Science/02-Logistics-Analysis-using-a-Transportation-Network.ipynb @@ -0,0 +1,446 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7d0f97f8", + "metadata": {}, + "source": [ + "# Logistics Analysis using a Transportation Network\n", + "\n", + "For any company that deals with moving goods from one location to another, the logistical challenges and associated costs of moving people and goods from one location to another represent a significant amount of effort and resources. Companies often employ a variety of route optimization techniques to optimize both the costs and efficiency of their transportation logistics. When dealing with route logistics, there are a few main points of data that come into play including time, cost, distance, etc. Using graphs to represent the connections between data within these logistics systems provides a robust way to represent these attributes. Graphs allow us to represent attributes of connections as properties of those connections, which is a natural way to store data for a logistics system. Graphs also provide advantages in logistics systems as they enable certain types of graph algorithms, specifically around path finding, to drive insights that can feed into route optimizations.\n", + "\n", + "Route optimization can take many forms, be it minimizing distance, maximizing load, minimizing cost, maximizing speed, or balancing multiple factors to achieve an optimized solution. In this notebook, we will look at the types of inputs that graphs can help provide to these optimizations.\n", + "\n", + "\n", + "## Creating a logistics graph\n", + "\n", + "In this section, we'll load a graph focused on airports, routes between airports, and the distance of those routes.\n", + "\n", + "### Load data\n", + "The cell below loads the example graph into your Neptune cluster. When you run the cell below, a graph for an example transportation dataset will load, which will take about 1 minute." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d0ce4226", + "metadata": {}, + "outputs": [], + "source": [ + "%seed --model Property_Graph --run --dataset airports" + ] + }, + { + "cell_type": "markdown", + "id": "036f588e", + "metadata": {}, + "source": [ + "### Install Required Libraries\n", + "\n", + "For this notebook, we will also need to install [iGraph](https://igraph.org/) which is an open-source network analysis library. You could also perform similar tasks using other common Python libraries such as [NetworkX](https://networkx.org/) if you desire. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "234d9a68", + "metadata": {}, + "outputs": [], + "source": [ + "pip install igraph -q" + ] + }, + { + "cell_type": "markdown", + "id": "8860c93c", + "metadata": {}, + "source": [ + "### Set visualization and configuration options\n", + "\n", + "The cell below configures the visualization to use specific colors and icons for the different parts of the data model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10528702", + "metadata": {}, + "outputs": [], + "source": [ + "%%graph_notebook_vis_options\n", + "\n", + "{\n", + " \"groups\": {\n", + " \"airport\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf072\",\n", + " \"color\": \"blue\"\n", + " }\n", + " },\n", + " \"US\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf072\",\n", + " \"color\": \"blue\"\n", + " }\n", + " },\n", + " \"IS\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf072\",\n", + " \"color\": \"yellow\"\n", + " }\n", + " },\n", + " \"RU\": {\n", + " \"shape\": \"icon\",\n", + " \"icon\": {\n", + " \"face\": \"FontAwesome\",\n", + " \"code\": \"\\uf072\",\n", + " \"color\": \"red\"\n", + " }\n", + " }\n", + "}}" + ] + }, + { + "cell_type": "markdown", + "id": "3cc036d5", + "metadata": {}, + "source": [ + "### Data model\n", + "The transportation graph included in this example is relatively straightforward, consisting of airports, routes connecting those airports together, and a `distance` property specifying how long the flight between airports is in miles.\n", + "\n", + "The following query shows a single `airport` (Anchorage) and all the places you can fly to from that `airport`. After running the query, click the Graph tab to see a visualization of the results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "436c1b9b", + "metadata": { + "scrolled": false + }, + "outputs": [], + "source": [ + "%%oc -d code\n", + "\n", + "MATCH p=(a:airport {code: 'ANC'})-[:route]->()\n", + "RETURN p" + ] + }, + { + "cell_type": "markdown", + "id": "ad0c60cd", + "metadata": {}, + "source": [ + "# Examining our transportation graph\n", + "\n", + "Now that we have seen what our transportation graph looks like, let's start by looking at some overall statistics on the graph and see the shape of our transportation network.\n", + "\n", + "### How many airports and routes are in the graph?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e675a06", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "\n", + "MATCH (a:airport)-[r:route]->()\n", + "with count(distinct(a)) as num_airports, count(r) as num_routes \n", + "RETURN num_airports, num_routes, toFloat(num_routes)/num_airports as ratio" + ] + }, + { + "cell_type": "markdown", + "id": "c69044b7", + "metadata": {}, + "source": [ + "As we can see we have around 3400 airports and 50k routes in this dataset meaning that each airport has on average ~14 routes. \n", + "\n", + "Let's take a look at the airports with the most flight routes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "370d025e", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "\n", + "MATCH (a:airport)-[r:route]->()\n", + "RETURN a.desc, count(r) as num_routes\n", + "ORDER BY num_routes DESC LIMIT 5" + ] + }, + { + "cell_type": "markdown", + "id": "f1524970", + "metadata": {}, + "source": [ + "Well, no surprise there as all of these airports are well-known airport hubs.\n", + "\n", + "### What is the distribution of airports in the data?\n", + "\n", + "After looking at the overall distribution of items in the graph, another common use case is to group these by a property to look at the distribution. In this case, let's group by the `region` property to see how many airports are in each region/state." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd6c4a71", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "\n", + "MATCH (a:airport)\n", + "RETURN a.region, count(a) as cnt \n", + "ORDER BY cnt DESC " + ] + }, + { + "cell_type": "markdown", + "id": "e6a7baa4", + "metadata": {}, + "source": [ + "That is a bit unexpected. Alaska has 3 times more airports than any other region in this dataset. Let's take a moment to examine some of the details of the airports in Alaska.\n", + "\n", + "## Where can I fly to from Alaska?\n", + "\n", + "Running the query below will show you all the locations you can fly to, starting from an airport in Alaska." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5994fb2b", + "metadata": { + "scrolled": false + }, + "outputs": [], + "source": [ + "%%oc -d code -g country\n", + "\n", + "MATCH p=(a:airport {region: 'US-AK'})-[]->()\n", + "RETURN p" + ] + }, + { + "cell_type": "markdown", + "id": "671e8879", + "metadata": {}, + "source": [ + "# Examining Graph Logistics using Graph Analytics\n", + "\n", + "Now that we have taken a look at some of our characteristics of our transportation graph, let's start running some analysis on this graph to see how we can use it to help solve some logistics questions.\n", + "\n", + "\n", + "## Load a Pandas DataFrame for analysis\n", + "\n", + "Up until now we have been working with Amazon Neptune directly. For this analysis we are going to leverage an integration between Neptune and Pandas DataFrames, supplied by [AWS Data Wrangler](https://aws-data-wrangler.readthedocs.io/en/stable/#), to read and write data from Neptune and the [iGraph](https://igraph.org/) library to perform network analysis/graph algorithms on top of this data.\n", + "\n", + "Running the cell below will retrieve the required data from Neptune and load it into a Pandas DataFrame. We then break this result up into two DataFrames, one containing the airport information and the other containing the routes connecting the airports that we will use for later analysis." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f59d6b9e", + "metadata": {}, + "outputs": [], + "source": [ + "import awswrangler as wr\n", + "import pandas as pd\n", + "import igraph as ig\n", + "import graph_notebook as gn\n", + "from graph_notebook.configuration.generate_config import AuthModeEnum\n", + "from IPython.display import HTML, display\n", + "\n", + "def print_path(g, paths, fields=None):\n", + " result=[]\n", + " for idx, n in enumerate(paths):\n", + " path = []\n", + " for a in g.vs[n]:\n", + " if fields:\n", + " values={}\n", + " for f in fields:\n", + " values[f]=a[f]\n", + " path.append(values)\n", + " else:\n", + " path.append(a.attributes())\n", + " result.append(path)\n", + " display(pd.DataFrame(result))\n", + "\n", + "# Get the configuration information for the notebook\n", + "config = gn.configuration.get_config.get_config()\n", + "iam=True if config.auth_mode==AuthModeEnum.IAM else False\n", + "\n", + "# Retrieve Data from neptune\n", + "client = wr.neptune.connect(config.host, config.port, iam_enabled=iam)\n", + "query = \"\"\"MATCH ()-[r:route]->()\n", + "RETURN startnode(r) as source, endnode(r) as target, \n", + "id(startnode(r)) as source_id, id(endnode(r)) as target_id, r.dist\"\"\"\n", + "\n", + "df = wr.neptune.execute_opencypher(client, query)\n", + "\n", + "# Create the dataframe of airports and remove duplicates\n", + "airports = wr.neptune.flatten_nested_df(pd.concat([df['source'].apply(pd.Series).apply(pd.Series), \n", + " df['target'].apply(pd.Series).apply(pd.Series)]), seperator=\"_\")\n", + "airports = airports.drop_duplicates(subset='~id', keep=\"first\").drop('index', axis=1)\n", + "\n", + "\n", + "# remove the tildas from column names to make life easier\n", + "airports.columns = airports.columns.str[1:]\n", + "\n", + "# Create the routes dataframe\n", + "routes = pd.DataFrame(data=df,columns=['source_id', 'target_id', 'r.dist']).apply(pd.Series).apply(pd.Series)\n", + "routes.columns = routes.columns.str.replace(\".\", \"_\", regex=False)\n", + "display(airports.head(5))\n", + "display(routes.head(5))\n", + "\n", + "g = ig.Graph.DataFrame(routes, directed=True, vertices=airports)" + ] + }, + { + "cell_type": "markdown", + "id": "395f536d", + "metadata": {}, + "source": [ + "## Routing in a logistics graph\n", + "\n", + "One of the most common types of questions commonly asked when looking at logistics is how to move items from `Location A` to `Location B` most effectively. These types of questions are ones where logistics graphs and graph analytics excel through the use of a category of graph algorithms known as path finding algorithms. Path finding algorithms are a set of algorithms that traverse through a graph from a start to an end node to determine the \"most efficient\" path in terms of the number of connections or a weight, which can represent a relative attribute of the connection such as time, distance, cost, capacity, complexity, etc.\n", + "\n", + "For these examples, we will choose a city that presents some unique logistics problem as an example. We will use [Deadhorse, Alaska, USA](https://en.wikipedia.org/wiki/Deadhorse,_Alaska) as the target for logistics shipping. Deadhorse, AK is a small community in the far northern reaches of Alaska's Prudhoe Bay, approximately 495 miles (~800 kms) from the nearest city. It is also a hub of oil production in Alaska so the logistics of flying equipment and personnel to/from this location is a real challenge. Let's take a look at some of the ways we can utilize a graph to simplify these logistics.\n", + "\n", + "\n", + "### What is the fewest flights from Deadhorse to Miami?\n", + "\n", + "Let's start by taking a look at one of the simplest path finding queries, finding the shortest path between two locations, in this case, flying from Deadhorse, AK to Miami, FL, USA." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "46f93951", + "metadata": {}, + "outputs": [], + "source": [ + "path = g.get_all_shortest_paths(g.vs.find(properties_code='SCC'), g.vs.find(properties_code='MIA'))\n", + "print_path(g, path, ['properties_desc'])" + ] + }, + { + "cell_type": "markdown", + "id": "1be3757c", + "metadata": {}, + "source": [ + "As we see from the results, there are 12 different routes. However, some of these routes seem a bit out-of-the-way. For example, one route goes from `Deadhorse -> Anchorage -> Reykjavik, Iceland -> Miami` which definitely seems unnecessary. This is due to the fact that the shortest path algorithm we ran counted every connection as equal, so it minimized the number of connections. If we want to find the shortest flight distance, then we want to use a shortest path algorithm that takes into account the distance of each route which can be represented as a weight on the `route` edge.\n", + "\n", + "### Run shortest distance to fly from Deadhorse to Miami?\n", + "\n", + "Let's see what our route looks like if we take into account the distance of each flight." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "875ef710", + "metadata": {}, + "outputs": [], + "source": [ + "path = g.get_all_shortest_paths(g.vs.find(properties_code='SCC'), g.vs.find(properties_code='MIA'), \"r_dist\")\n", + "print_path(g, path, ['properties_desc'])" + ] + }, + { + "cell_type": "markdown", + "id": "50509654", + "metadata": {}, + "source": [ + "Now this route makes a lot of sense and seems very direct. As we can see, in situations where not all connections are equal, adding weights to edges and then using those weights as part of the shortest path calculation provides us a very powerful tool to help solve logistical challenges.\n", + "\n", + "\n", + "### Traveling to multiple locations\n", + "\n", + "So far we have only looked at what it takes to travel from one location to another, however it is also very common that you need a system that routes a piece of equipment/personnel across multiple locations. This sort of problem is known as a [Traveling Salesman Problem](https://en.wikipedia.org/wiki/Travelling_salesman_problem) and is an [NP-hard](https://en.wikipedia.org/wiki/NP-hardness) problem. Luckily, there are some ways we can approximate this problem such as using a [Minimum Spanning Tree](https://en.wikipedia.org/wiki/Minimum_spanning_tree) to find the shortest path that traverses all the entities.\n", + "\n", + "Let's take a look at what is the minimum distance required to travel to all 150 airports in Alaska, USA." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "520d26ea", + "metadata": {}, + "outputs": [], + "source": [ + "# Retrieve Data from neptune\n", + "client = wr.neptune.connect(config.host, config.port, iam_enabled=iam)\n", + "query = \"\"\"MATCH p=(a:airport)-[r:route]->(b:airport)\n", + "WHERE a.region=\"US-AK\"\n", + "AND b.region=\"US-AK\"\n", + "RETURN id(a) as source, id(b) as target, r.dist as dist\"\"\"\n", + "mst = wr.neptune.execute_opencypher(client, query)\n", + "\n", + "\n", + "# Run a minimum spanning tree\n", + "mst_g = ig.Graph.TupleList(mst.itertuples(index=False), directed=True, weights=True)\n", + "tree = mst_g.spanning_tree(return_tree=False)\n", + "print(\"The minimum spanning tree for all airports in Alaska, USA is:\")\n", + "for idx, n in enumerate(tree):\n", + " try:\n", + " print(g.vs.find(name=mst_g.vs[n]['name'])['properties_desc'])\n", + " except:\n", + " pass" + ] + }, + { + "cell_type": "markdown", + "id": "1aa1b0a6", + "metadata": {}, + "source": [ + "With this we can now see the most efficent path that we would need to take to travel to all the airports in Alaska.\n", + "\n", + "# Conclusion\n", + "Graph analysis provides unique insights into transportation/logistics graphs. The analysis demonstrated in this notebook only scratch the surface of the types of powerful insights you can draw using graphs and graph analytics. Finding and understanding these optimized paths and connection patterns between data is a strength of graph, graph databases, and graph analysis that serve questions dealing with logistics efficiently and effectively." + ] + }, + { + "cell_type": "markdown", + "id": "e058e8e5", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/src/graph_notebook/notebooks/05-Data-Science/__init__.py b/src/graph_notebook/notebooks/05-Data-Science/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/test/unit/notebooks/test_validate_notebooks.py b/test/unit/notebooks/test_validate_notebooks.py index b9ca372c..e2b0ace3 100644 --- a/test/unit/notebooks/test_validate_notebooks.py +++ b/test/unit/notebooks/test_validate_notebooks.py @@ -47,7 +47,10 @@ def test_no_extra_notebooks(self): f'{NOTEBOOK_BASE_DIR}/04-Machine-Learning/Neptune-ML-SPARQL/Neptune-ML-01-Introduction-to-Object-Classification-SPARQL.ipynb', f'{NOTEBOOK_BASE_DIR}/04-Machine-Learning/Neptune-ML-SPARQL/Neptune-ML-02-Introduction-to-Object-Regression-SPARQL.ipynb', f'{NOTEBOOK_BASE_DIR}/04-Machine-Learning/Neptune-ML-SPARQL/Neptune-ML-03-Introduction-to-Link-Prediction-SPARQL.ipynb', - f'{NOTEBOOK_BASE_DIR}/04-Machine-Learning/Sample-Applications/01-People-Analytics/People-Analytics-using-Neptune-ML.ipynb'] + f'{NOTEBOOK_BASE_DIR}/04-Machine-Learning/Sample-Applications/01-People-Analytics/People-Analytics-using-Neptune-ML.ipynb', + f'{NOTEBOOK_BASE_DIR}/05-Data-Science/00-Identifying-Fraud-Rings-Using-Social-Network-Analytics.ipynb', + f'{NOTEBOOK_BASE_DIR}/05-Data-Science/01-Identifying-1st-Person-Synthetic-Identity-Fraud-Using-Graph-Similarity.ipynb', + f'{NOTEBOOK_BASE_DIR}/05-Data-Science/02-Logistics-Analysis-using-a-Transportation-Network.ipynb'] notebook_paths = get_all_notebooks_paths() expected_paths.sort() notebook_paths.sort()