diff --git a/ChangeLog.md b/ChangeLog.md index 1c7ad918..fd5919bc 100644 --- a/ChangeLog.md +++ b/ChangeLog.md @@ -5,6 +5,8 @@ Starting with v1.31.6, this file will contain a record of major features and upd ## Upcoming - New Neptune ML notebook - Real Time Fraud Detection using Inductive Inference ([Link to PR](https://github.com/aws/graph-notebook/pull/338)) - Path: 04-Machine-Learning > Sample-Applications > 03-Real-Time-Fraud-Detection-Using-Inductive-Inference.ipynb +- New openCypher Language Tutorial notebooks + - Path: 06-Language-Tutorials > 02-openCypher - Added `--profile-misc-args` option to `%%gremlin` ([Link to PR](https://github.com/aws/graph-notebook/pull/443)) - Added error messaging for incompatible host-specific `%%graph_notebok_config` parameters ([Link to PR](https://github.com/aws/graph-notebook/pull/456)) - Ensure default assignments for all Gremlin nodes when using grouping ([Link to PR](https://github.com/aws/graph-notebook/pull/448)) diff --git a/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/01-Basic-Read-Queries.ipynb b/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/01-Basic-Read-Queries.ipynb new file mode 100644 index 00000000..6a7021c0 --- /dev/null +++ b/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/01-Basic-Read-Queries.ipynb @@ -0,0 +1,889 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "eab505f3", + "metadata": {}, + "source": [ + "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n", + "SPDX-License-Identifier: Apache-2.0\n", + "\n", + "# Learning openCypher - Basic Read Queries\n", + "\n", + "This notebook is the first in a series of notebooks that walk through how to write queries using openCypher. In this notebook, we will examine the basics of openCypher read queries and how these queries fit into the \"Find\", \"Filter\", \"Format\" paradigm. Let's begin by loading some sample data into our Neptune cluster. \n", + "\n", + "\n", + "\n", + "\n", + "## Getting Started\n", + "\n", + "For these notebooks, we will be leveraging a dataset from the book [Graph Databases in Action](https://www.manning.com/books/graph-databases-in-action?a_aid=bechberger) from Manning Publications. \n", + "\n", + "\n", + "**Note:** These notebooks do not cover data modeling or building a data loading pipeline. If you would like a more detailed description about how this dataset is constructed and the design of the data model came from, then please read the book.\n", + "\n", + "To get started, the first step is to load data into the cluster. Assuming the cluster is empty, this can be accomplished by running the cell below which will load the Dining By Friends dataset, installed as part of the notebook.\n", + "\n", + "### Loading Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "03dd0507", + "metadata": {}, + "outputs": [], + "source": [ + "%seed --model Property_Graph --dataset dining_by_friends --run" + ] + }, + { + "attachments": { + "image-3.png": { + "image/png": "" + } + }, + "cell_type": "markdown", + "id": "48e9a85f", + "metadata": {}, + "source": [ + "### Looking at our graph data\n", + "\n", + "Now that we have loaded our data, let's take a moment to look at what our data model looks like:\n", + "\n", + "\n", + "![image-3.png](attachment:image-3.png)\n", + "\n", + "\n", + " \n", + " \n", + "\n", + " \n", + "
Element (Node/Edge) Counts
\n", + " \n", + "|Node Label|Count|\n", + "|:--|:--|\n", + "|review|109|\n", + "|restaurant|40|\n", + "|cuisine|24|\n", + "|person|8|\n", + "|state|2|\n", + "|city|2|\n", + " \n", + "\n", + "\n", + "|Edge Label|Count|\n", + "|:--|:--|\n", + "|wrote|218|\n", + "|about|218|\n", + "|within|84|\n", + "|serves|80|\n", + "|friends|20|\n", + "|lives|16|\n", + "\n", + "
\n", + "\n", + "This dataset represents a fictitious, but realistic, restaurant recommendation application that contains:\n", + "\n", + "* Users, represented by `person` nodes\n", + "* Users connected to Users via `friends` edges\n", + "* Restaurants and their associated information (`city`, `state`, `cusine`)\n", + "* Reviews include the body and ratings\n", + "* Ratings of reviews (helpful/not helpful)\n", + "\n", + "This application contains three main aspects to the data it collects. First, it contains a social network consisting of `person` nodes connected to other `person` nodes via a `friends` edge. Second, it contains a restaurant review aspect consisting of `restaurant` nodes, information about those restaurants (`city`/`state`/`cuisine`), and `review` nodes for that restaurant. The third, and final aspect, consists of a personalization component where a `person` can rate a `review`, which allows for better recommendations based on a person's preferences.\n", + "\n", + "Throughout this set of notebooks, we will leverage the different aspects of this data to highlight different fundamental types of common property graph queries, namely neighborhood traversals, hierarchies, paths, and collaborative filtering.\n", + "\n", + "Now let's get started." + ] + }, + { + "cell_type": "markdown", + "id": "dfa24286", + "metadata": {}, + "source": [ + "### Setting up the visualizations\n", + "\n", + "Run the next two cells to configure various display options for our notebook, which we will use later on to display our results in a pleasing visual way. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e655017", + "metadata": {}, + "outputs": [], + "source": [ + "%%graph_notebook_vis_options\n", + "{\n", + " \"groups\": { \n", + " \"person\": {\n", + " \"color\": \"#9ac7bf\"\n", + " },\n", + " \"review\": {\n", + " \"color\": \"#f8cecc\"\n", + " },\n", + " \"city\": {\n", + " \"color\": \"#d5e8d4\"\n", + " },\n", + " \"state\": {\n", + " \"color\": \"#dae8fc\"\n", + " },\n", + " \"review_rating\": {\n", + " \"color\": \"#e1d5e7\"\n", + " },\n", + " \"restaurant\": {\n", + " \"color\": \"#ffe6cc\"\n", + " },\n", + " \"cusine\": {\n", + " \"color\": \"#fff2cc\"\n", + " }\n", + " }\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d5c80800", + "metadata": {}, + "outputs": [], + "source": [ + "node_labels = '{\"person\":\"first_name\",\"city\":\"name\",\"state\":\"name\",\"restaurant\":\"name\",\"cusine\":\"name\"}'" + ] + }, + { + "cell_type": "markdown", + "id": "ab986dac", + "metadata": {}, + "source": [ + "\n", + "## Finding your Data\n", + "\n", + "When working with openCypher, the most common usage of the language is to find data. openCypher accomplishes this using a couple of clauses:\n", + "\n", + "* `MATCH` - specifies the pattern of data to look for, see below for the pattern syntax\n", + "* `RETURN` - defines what and how the data will be returned to the user\n", + "* `LIMIT` - while not required, this is useful to minimze the data returned by specifying the maximum number of matching patterns returned\n", + "\n", + "The `MATCH` clause is a fundamental part of every query. It uses an ASCII art-based syntax to define the pattern of nodes and edges that you would like to match within the graph. These matches are then used as the basis for the filter and format portions of the query. \n", + "\n", + "The pattern matching syntax used in is highlighted in the table below.\n", + "\n", + "#### Pattern Matching Syntax\n", + "\n", + "| openCypher Pattern|Description|\n", + "|:--|:--|\n", + "|`( )`|A node|\n", + "|`[ ]`|An edge|\n", + "|`-->`|Follow outgoing edges from a node|\n", + "|`<--`|Follow incoming edges from a node|\n", + "|`--`|Follow edges in either direction|\n", + "|`-[]->`|Include the outgoing edges in the query (to check a label or property for example)|\n", + "|`<-[]-`|Include the incoming edges in the query (to check a label or property for example)|\n", + "|`-[]-` |Include edges in either direction in the query|\n", + "|`-[]->( )`|The node on the other end of an outgoing edge|\n", + "|`<-[]-()`|The node on the other end of an incoming edge|\n", + "\n", + "\n", + "Looking at the syntax above, you may be curious about the concept of incoming and outgoing edges. In property graphs, all edges are defined as having a direction, meaning that they start at one node and end at another. However, graph query languages such as openCypher allow you to specify patterns where you traverse these edges in either direction. \n", + "\n", + "Now that we have a basic understanding of openCypher's pattern matching syntax, let's take a look at how this is applied to answer some common graph query patterns.\n", + "\n", + "### Finding Nodes\n", + "\n", + "The simplest pattern you can do in openCypher is to match nodes. In openCypher patterns, nodes are represented by parentheses (`()`).\n", + "\n", + "Patterns, or elements within a pattern, can be associated with a variable by assigning them within the related portion of the syntax. For nodes, this means adding it within the parentheses such as in `(n)` where `n` is the variable name. When we put a variable within a pattern, this portion of the pattern is then available to us later in the query for additional operations, such as filtering or formatting to return to the user, as shown below.\n", + "\n", + "Execute the query below to search for nodes and assign them to a variable `n` (`MATCH (n)`), return the nodes labeled `n` (`RETURN n`), but limit the number returned to 10." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "deabe58e", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n) //find me nodes and label them 'n'\n", + "RETURN n //return 'n'\n", + "LIMIT 10 //return only 10 results" + ] + }, + { + "cell_type": "markdown", + "id": "4eee92c8", + "metadata": {}, + "source": [ + "**Note:** If no label is applied to a portion of the pattern, then that portion will be used for pattern matching but will not be available for additional operations, as we will see in the next query.\n", + "\n", + "### Finding Edges\n", + "\n", + "The example above works, but it does not leverage the connections within the data, represented by edges in our graph. Edges make graph databases a powerful asset for working with connected data. \n", + "\n", + "To perform a search across multiple nodes and edges, we need to use our pattern to specify how the nodes and edges are related using the following syntax:\n", + "\n", + "\n", + "| openCypher Pattern|Description|\n", + "|:--|:--|\n", + "|`-->`|Follow outgoing edges from a node|\n", + "|`<--`|Follow incoming edges from a node|\n", + "|`--`|Follow edges in either direction|\n", + "|`-[]->`|Include the outgoing edges in the query (to check a label or property for example)|\n", + "|`<-[]-`|Include the incoming edges in the query (to check a label or property for example)|\n", + "|`-[]-` |Include edges in either direction in the query|\n", + "\n", + "Execute the query below to search for node-edge->node patterns, assign the edges to a variable `r` (`MATCH ()-[r]->()`), and return 10 edges labeled `r` (`RETURN r`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "168bcc76", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH ()-[r]->() //find me all node-edge->node patterns\n", + "RETURN r //return the edge\n", + "LIMIT 10 //return only 10 results" + ] + }, + { + "cell_type": "markdown", + "id": "6d36e1dd", + "metadata": {}, + "source": [ + "In the example above, we specified using outgoing edges, via the arrow direction `()-[r]->()`, but we could have also chosen to look for patterns using only incoming edges, `()<-[r]-()`, or ignoring edge direction, `()-[r]-()`. \n", + "\n", + "To build more complex patterns, we can use these basic constructs to link together multiple levels of connections to find more complex patterns. In the example below, we have extended our previous query to return 10 nodes that have both incoming and outgoing edges, by specifying a `node-edge->node<-edge-node` pattern." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55f2aaf9", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH ()-[]->(n)<-[]-() //find me all node-edge->node<-edge-node patterns\n", + "RETURN n \n", + "LIMIT 10 " + ] + }, + { + "cell_type": "markdown", + "id": "6f5ef9ca", + "metadata": {}, + "source": [ + "In the example above, we have returned matches based on a series of connected nodes and edges. When working with graphs, a series of connected nodes and edges may also be referred to as a 'path'. Often when we are looking for patterns within our graph, we would like to return not just a node or edge within the pattern but the path containing how these items are connected.\n", + "\n", + "\n", + "### Finding Paths\n", + "\n", + "To find paths within our graph, we combine two constructs we have already learned: pattern matching and variables, to specify that we want the path returned. In our previous queries, we assigned a node or edge in our pattern to a variable. When returning a path, we assign the entire pattern to a variable, as seen below where we assign the path to a variable `p`. \n", + "\n", + "Once we have assigned our path to the variable, we can return this as we have previous variables, except now our returned values will contain the `node-edge->node<-edge-node` information for the path that was matched.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1c3736ac", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH p=()-[]->(n)<-[]-() //assign my node-edge->node<-edge-node patterns to a variable 'p'\n", + "RETURN p \n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "df52d879", + "metadata": {}, + "source": [ + "## Filtering your Data\n", + "\n", + "So far, we have learned how to find specific patterns within our graph based on how the nodes and edges connect. However, most of the time you will want to use attributes of the nodes and edges to filter the results to return a more specific subset of data. \n", + "\n", + "We accomplish this using the `WHERE` clause. Within the `WHERE` clause, there are a variety of operators available to perform logical operations and comparisons of the data. Below is a listing of the operators supported by openCypher.\n", + "\n", + "**Operators**\n", + "\n", + "|Type|Operators|\n", + "| ----------- | ----------- |\n", + "|General|`DISTINCT, x.y (property access)`|\n", + "|Math|`+`, `-`, `*`, `/`|\n", + "|Comparison|`=`, `>`, `<`, `<>`, `<=`, `>=`, `IS NULL`, `IS NOT NULL`|\n", + "|Boolean|`AND`, `OR`, `NOT`, `XOR`|\n", + "|String|`STARTS WITH`, `ENDS WITH`, `CONTAINS`, `+`|\n", + "|LIST|`+`, `IN`, `[]`|\n", + "\n", + "\n", + "In the next section, we will look at some common ways to apply filters using these operators. \n", + "\n", + "### Filtering Nodes by Label\n", + "\n", + "One of the most common items you will want to filter on will be the label(s) associated with a node. There are two methods for adding label filters to a query. You can either:\n", + "\n", + "* Inline the filter as part of the match clause, which is done by adding a colon (`:`) followed by one or more label names (separated by a `|`) \n", + "* Use the `labels()` function in a `WHERE` clause to filter. \n", + "\n", + "Both of these will produce identical results so whether you choose one versus the other is a decision for the query writer. Below, we have included examples of both:\n", + "\n", + "#### Filtering using inline filters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "651a0f21", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d:person)\n", + "RETURN d \n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "61f35676", + "metadata": {}, + "source": [ + "#### Filtering using `labels()` and the `WHERE` clause" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0460223d", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d)\n", + "WHERE 'person' IN labels(d)\n", + "RETURN d\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "51893308", + "metadata": {}, + "source": [ + "One item you might notice is the use of `IN` to find if the label exists in the node instead of an equality (`=`) operator. In openCypher, nodes can have multiple labels associated with them. As a result, the `labels()` function returns a list of labels and the `IN` clause allows us to find a specific label inside that list. \n", + "\n", + "### Filtering Edge by Type\n", + "Another common item you to filter on is the type or label associated with an edge. As with nodes, there are two methods for adding edge label filters to a query, as shown here: \n", + "\n", + "* Inline the filter as part of the match clause, which is done by adding a colon (`:`) followed by one or more type names (separated by a |) \n", + "* Use the `type()` function in a `WHERE` clause to filter. \n", + "\n", + "As with nodes, these will produce identical results so whether you choose one versus the other is a decision for the query writer. Below we have included examples of both:\n", + "\n", + "#### Filtering using inline filters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3b673615", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH p=(d:person)-[:friends]->()\n", + "RETURN p\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "52375922", + "metadata": {}, + "source": [ + "#### Filtering using `type()` and `WHERE`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a76e67d6", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH p=(d:person)-[r]->()\n", + "WHERE type(r)='friends'\n", + "RETURN p\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "2d59628b", + "metadata": {}, + "source": [ + "Unlike when we filtered using `labels()`, when we use `type()` to filter edges we do not need to use `IN`. Edges can only have a single label associated with them, so the `type()` function will return a single value, not a list. Due to this difference, we can do a direct equality comparison.\n", + "\n", + "### Finding by Property\n", + "\n", + "The next common use case for filtering is to be able to filter on attribute values. \n", + "\n", + "As with node and edge labels, there are two methods for adding edge label filters to a query. Once again our options are either: \n", + "\n", + "* Inline the filter as part of the match clause, which is done by adding curly brackets containing the key/value you want to filter on (`{first_name: 'Dave'}`)\n", + "* Use the `WHERE` clause to filter using the operators\n", + "\n", + "Below, we have included examples of both:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e755b029", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d)\n", + "WHERE d.first_name='Dave'\n", + "RETURN d \n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "738bfc3f", + "metadata": {}, + "source": [ + "Alternatively, you include equality filters inline with the pattern by putting them into curly brackets (`{}`) with the property key name followed by the value, as shown below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "39ac2235", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d {first_name: 'Dave'})\n", + "RETURN d \n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "61497739", + "metadata": {}, + "source": [ + "In these two examples, both return the same results and are interpreted by the engine as the same query, so this begs the question of when to choose which one?\n", + "\n", + "Part of the answer to that question comes down to the preference of the query writer. However, here are some general guidelines:\n", + "\n", + "* Inlining of filters works only on equality filters for one or more properties. If you need to do non-equality filters (e.g. not equals, greater than, less than, `IN`) then you should use a `WHERE` clause\n", + "* `WHERE` clauses allow more complex filtering using `AND`, `OR`, and `NOT` to build up more complex expressions\n", + "* `WHERE` clauses support List and Pattern comprehension (discussed later) where in-lining filters does not" + ] + }, + { + "cell_type": "markdown", + "id": "4b8defd4", + "metadata": {}, + "source": [ + "### Filtering on Existence\n", + "Another common need when filtering data is to check for the existence of an attribute or additional piece of topology. This is most commonly accomplished using the `exists()` function as shown below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "968444f0", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d:person)\n", + "WHERE exists(d.last_name)\n", + "RETURN d \n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "ac3261a9", + "metadata": {}, + "source": [ + "You can also use the `exists()` function with the `NOT` predicate to filter on patterns that do not exist, such as in the example below where we find all `person` nodes that do not have a property `age`.\n", + "\n", + "**Note** - With this dataset it will return all 8 people as there is no `age` property in the data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7d5cd5ad", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d:person)\n", + "WHERE NOT exists(d.age)\n", + "RETURN d \n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "0addc153", + "metadata": {}, + "source": [ + "In the examples above, we checked the existence of a property using the `exists()` function. In addition to properties, we can also check for the existence of patterns, using the pattern matching syntax, as part of the `WHERE` clause. \n", + "\n", + "Below, we can see how we can leverage pattern matching to find all `person` nodes that do not have anyone who has connected to them with an incoming `friends` edge." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "09c11073", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d:person)\n", + "WHERE NOT (d)<-[:friends]-()\n", + "RETURN d \n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "2081d7b4", + "metadata": {}, + "source": [ + "## Formatting Results\n", + "\n", + "Having gone through the basics of finding and filtering data with openCypher, let's take a look at the last step, formatting our results. All openCypher read queries, and almost all mutation queries, end with a `RETURN` clause. This clause is used to specify what aspects of the data are returned and in what format.\n", + "\n", + "\n", + "### Returning all values\n", + "\n", + "The simplest way to return data is to return all values specified in the query. This can be accomplished by specifying a wildcard `*` in the `RETURN` clause as shown here." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d1af3bf", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d:person)\n", + "RETURN *\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "953027a0", + "metadata": {}, + "source": [ + "While this is easy to write, it is not very efficient, either in terms of the work the engine has to do to retrieve all the values or the amount of data that has to be transmitted over the wire.\n", + "\n", + "### Returning property values\n", + "\n", + "Most often, you want to be specific about the data elements (node/edges), attributes, or a combination of both, that a query returns. This provides for efficient processing, both at the database and client level, and efficient data transmission, since we are only retrieving, processing, and sending what is needed. \n", + "\n", + "To accomplish this, we use the `RETURN` clause to specify the variable associated with a node/edge/path or we specify the property or properties of elements using a `.` syntax. \n", + "\n", + "Below we have two queries, the first returns the person's first name and the second returns the first and last name values.\n", + "\n", + "**Note:** If you would like to assign a specific key to the values returned, you can rename them using the `AS` modifier, e.g., If you wanted to return the `first_name` as `name` you'd accomplish this via `RETURN d.first_name AS name`. If it is not specified, then the key name will default to the element or property variable name.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ba582e4", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d:person)\n", + "RETURN d.first_name\n", + "LIMIT 10" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea3c4fba", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d:person)\n", + "RETURN d.first_name, d.last_name\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "4f89c0ba", + "metadata": {}, + "source": [ + "### Returning unique values\n", + "\n", + "To return unique values in the results, use the `DISTINCT` clause in the `RETURN` statement to return just the unique values for the specified propery." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c6cf5578", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d:person)\n", + "RETURN DISTINCT d.last_name\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "1d9b3131", + "metadata": {}, + "source": [ + "### Returning static values\n", + "\n", + "While not as common as some of the other options shown here, openCypher also allows you to return static values for any matches. In the query below, we return a static literal `true` for each matched node." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d218be0a", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d)\n", + "RETURN true\n", + "LIMIT 1" + ] + }, + { + "cell_type": "markdown", + "id": "30566e03", + "metadata": {}, + "source": [ + "### Projecting new return values\n", + "\n", + "In addition to returning existing values, we can also perform operations on those values using functions, which will be discussed in the third notebook), such as `min()`, `max()`, `toUpper()`, or perform other operations using primitives, such as concatenating strings. \n", + "\n", + "In the example below, we show how you can return a full name by concatenating the `first_name` and `last_name` using one these primitive operators. The exact primitives available and their behavior vary depending on the incoming data types." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4f951169", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d:person)\n", + "RETURN d.first_name + ' ' + d.last_name AS full_name\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "809468ee", + "metadata": {}, + "source": [ + "### Constructing complex return types\n", + "\n", + "In addition to returning simple key-value pairs, the `RETURN` clause can construct more complex response types such as maps. This is a common requirement, especially when returning aggregations or when returning attributes from different variables in the matched patterns.\n", + "\n", + "These new projections are created by supplying curly brackets (`{}`) and specify the key names and associated return values. These return values are specified using the same syntax we have shown above including elements, properties, or operations. The example below shows how each of these different options can be returned." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b6d4d8c3", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (d:person)\n", + "RETURN {element: d, first: d.first_name, last: d.last_name, full: d.first_name + ' ' + d.last_name} AS full_name\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "870e2989", + "metadata": {}, + "source": [ + "## Exercises\n", + "\n", + "Now that we have gone through the basics of writing openCypher read queries, it's time to put it into practice. Below are several exercises you can complete to verify your understanding of the material covered in this notebook. As practice for what you have learned, please write the openCypher queries specified below.\n", + "\n", + "Using the social network portion (`person-friends->person`) of our Dining By Friends graph, let's answer the following questions. For each of these questions we are going to be working with the `friends` edge. Depending on the domain in which you are working, an edge such as `friends` could imply a mutual relationship or a one-way relationship. Sites such as LinkedIn and Facebook use a mutual friendship model where if `Person A-friends-Person B` then it means `Person B-friends-Person A`. Sites such as Twitter use a one-way friendship where a person may follow another person but that does not mean that they are followed back. For Dining By Friends, let's assume the `friends` edge represents a one-way relationship.\n", + "\n", + "\n", + "### Exercise B-1 Find the first name of Dave's friends\n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", + "* Find the friends of Dave (i.e. traverse the `friends` edge)\n", + "* Return the friends `first_name`\n", + "\n", + "The correct answer is four results: \"Josh\", \"Hank\", \"Jim\", \"Kelly\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee77c9c6", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n" + ] + }, + { + "cell_type": "markdown", + "id": "f4b6049f", + "metadata": {}, + "source": [ + "### Exercise B-2 Find the first name of the friends of Dave's friends\n", + "\n", + "For the next exercise, let's find the friends of Dave's friends. In this case we will not exclude \n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", + "* Find the friends of Dave (i.e. traverse the `friends` edge)\n", + "* Find the friends of that person (i.e. traverse the `friends` edge)\n", + "* Return the friends `first_name`\n", + "\n", + "The correct answer contains three results: \"Hank\", \"Denise\", \"Paras\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e6cc0978", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "91433ad6", + "metadata": {}, + "source": [ + "### Exercise B-3 Find out how the friends of Dave's friends are connected\n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", + "* Find the friends of Dave (i.e. traverse the `friends` edge)\n", + "* Find the friends of that person (i.e. traverse the `friends` edge)\n", + "* Return the path\n", + "\n", + "The correct answer contains three results: \"Hank\", \"Denise\", \"Paras\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e7b488b", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n" + ] + }, + { + "cell_type": "markdown", + "id": "c80aba29", + "metadata": {}, + "source": [ + "### Exercise B-4 Which friends should we recommend for Dave?\n", + "\n", + "A common use case for graphs in social networks is to recommend new connections. There is a significant amount of research in this area (example [here](https://www.science.org/doi/10.1126/sciadv.aax7310#:~:text=The%20triadic%20closure%20mechanism%20uses,features%20of%20empirical%20social%20networks)) but mainly there are two prevailing mechanisms at work in social networks that we can leverage to help provide efficient recommendations to a user. The first of these mechanisms is called homophily, which is the tendency of similar people to be connected. Homophily is a driving factor in many social networks, with an important outcome being that people connected to you, or connected to people that are connected to you, tend to be similar to you. This leads to the second mechanism in a graph, the concept of a triadic closure. Triadic closure is a way to create or recommend new connections based on common friends or acquaintances. \n", + "\n", + "\n", + "In this exercise, we are going to leverage triadic closure to recommend friends for Dave. To accomplish this, we will need to leverage the previously written queries but extend them to:\n", + "\n", + "* Find all the friends of friends that do not have a connection to Dave\n", + "\n", + "The correct answer contains three results: \"Hank\", \"Denise\", \"Paras\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fb26d6fd", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n" + ] + }, + { + "cell_type": "markdown", + "id": "d1f7523f", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this notebook, we explored the basics of writing openCypher queries and how they are represented in the \"Find\", \"Filter\", \"Format\" paradigm. First, we learned the basics of how to specify the pattern of data we would like to match in our queries. Next, we learned several different mechanisms for how to filter the data found by our queries to return the correct results. Finally, we learned how to specify the format of the data being returned from a query to make for efficient use of database and application resources.\n", + "\n", + "In the next notebook, we will take what we have learned in this notebook and extend it to show how to answer questions where the length of the patterns is variable or unknown." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/02-Variable-Length-Paths.ipynb b/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/02-Variable-Length-Paths.ipynb new file mode 100644 index 00000000..a111e41e --- /dev/null +++ b/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/02-Variable-Length-Paths.ipynb @@ -0,0 +1,490 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "eab505f3", + "metadata": {}, + "source": [ + "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n", + "SPDX-License-Identifier: Apache-2.0\n", + "\n", + "# Learning openCypher - Variable Length Path Queries\n", + "\n", + "This notebook is the second in a series of notebooks that walk through how to write queries using openCypher. In this notebook, we will examine the basics of how to perform variable length path queries in openCypher. \n", + "\n", + "\n", + "This notebook will build upon the items convered in the notebook \"01-Basic-Read-Queries\". If you have not loaded the data from those notebooks please follow the steps in the [Getting Started](#Getting-Started) section below. If you have loaded the data then you can jump ahead to the [Setting up the visualizations](#Setting-up-the-visualizations) section.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "babcb3a8-77d8-4a58-a303-ced6dfda76be", + "metadata": { + "tags": [] + }, + "source": [ + "## Getting Started \n", + "\n", + "For these notebooks, we will be leveraging a dataset from the book [Graph Databases in Action](https://www.manning.com/books/graph-databases-in-action?a_aid=bechberger) from Manning Publications. \n", + "\n", + "\n", + "**Note:** These notebooks do not cover data modeling or building a data loading pipeline. If you would like a more detailed description about how this dataset is constructed and the design of the data model came from, then please read the book.\n", + "\n", + "To get started, the first step is to load data into the cluster. Assuming the cluster is empty, this can be accomplished by running the cell below which will load our Dining By Friends data.\n", + "\n", + "### Loading Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2273d5e6-2769-472e-b458-7f738f9831ad", + "metadata": {}, + "outputs": [], + "source": [ + "%seed --model Property_Graph --dataset dining_by_friends --run" + ] + }, + { + "attachments": { + "image-3.png": { + "image/png": "" + } + }, + "cell_type": "markdown", + "id": "48e9a85f", + "metadata": {}, + "source": [ + "### Looking at our graph data\n", + "\n", + "Now that we have loaded our data, let's take a moment to look at what our data model looks like:\n", + "\n", + "\n", + "![image-3.png](attachment:image-3.png)\n", + "\n", + "\n", + " \n", + " \n", + "\n", + " \n", + "
Element (Node/Edge) Counts
\n", + " \n", + "|Node Label|Count|\n", + "|:--|:--|\n", + "|review|109|\n", + "|restaurant|40|\n", + "|cuisine|24|\n", + "|person|8|\n", + "|state|2|\n", + "|city|2|\n", + " \n", + "\n", + "\n", + "|Edge Label|Count|\n", + "|:--|:--|\n", + "|wrote|218|\n", + "|about|218|\n", + "|within|84|\n", + "|serves|80|\n", + "|friends|20|\n", + "|lives|16|\n", + "\n", + "
\n", + "\n", + "This dataset represents a fictitious, but realistic, restaurant recommendation application that contains:\n", + "\n", + "* Users, represented by `person` nodes\n", + "* Users connected to Users via `friends` edges\n", + "* Restaurants and their associated information (`city`, `state`, `cusine`)\n", + "* Reviews include the body and ratings\n", + "* Ratings of reviews (helpful/not helpful)\n", + "\n", + "This application contains three main aspects to the data it collects. First, it contains a social network consisting of `person` nodes connected to other `person` nodes via a `friends` edge. Second, it contains a restaurant review aspect consisting of `restaurant` nodes, information about those restaurants (`city`/`state`/`cuisine`), and `review` nodes for that restaurant. The third, and final aspect, consists of a personalization component where a `person` can rate a `review`, which allows for better recommendations based on a person's preferences.\n", + "\n", + "Throughout this set of notebooks, we will leverage the different aspects of this data to highlight different fundamental types of common property graph queries, namely neighborhood traversals, hierarchies, paths, and collaborative filtering.\n", + "\n", + "Now let's get started." + ] + }, + { + "cell_type": "markdown", + "id": "0c12469c", + "metadata": {}, + "source": [ + "### Setting up the visualizations\n", + "\n", + "Run the next two cells to configure various display options for our notebook, which we will use later on to display our results in a pleasing visual way. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e655017", + "metadata": {}, + "outputs": [], + "source": [ + "%%graph_notebook_vis_options\n", + "{\n", + " \"groups\": { \n", + " \"person\": {\n", + " \"color\": \"#9ac7bf\"\n", + " },\n", + " \"review\": {\n", + " \"color\": \"#f8cecc\"\n", + " },\n", + " \"city\": {\n", + " \"color\": \"#d5e8d4\"\n", + " },\n", + " \"state\": {\n", + " \"color\": \"#dae8fc\"\n", + " },\n", + " \"review_rating\": {\n", + " \"color\": \"#e1d5e7\"\n", + " },\n", + " \"restaurant\": {\n", + " \"color\": \"#ffe6cc\"\n", + " },\n", + " \"cusine\": {\n", + " \"color\": \"#fff2cc\"\n", + " }\n", + " }\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d5c80800", + "metadata": {}, + "outputs": [], + "source": [ + "node_labels = '{\"person\":\"first_name\",\"city\":\"name\",\"state\":\"name\",\"restaurant\":\"name\",\"cusine\":\"name\"}'" + ] + }, + { + "cell_type": "markdown", + "id": "ab986dac", + "metadata": {}, + "source": [ + "\n", + "## Variable Length Paths\n", + "\n", + "When working with any property graph, some of the most powerful queries you can write are ones where the number of connections between a source and a target entity is not known. These types of queries are so common that property graph query languages, such as openCypher, have first class support as a key piece of the query language. In openCypher, these queries are written using a mechanism known as Variable Length Path patterns or VLPs. VLPs allow us to specify a sequence of nodes and relationships, as well as the number of times to repeat the relationship in the pattern matching syntax. \n", + "\n", + "In openCypher, a basic VLP query to find all nodes within 1 to 3 hops looks like:\n", + "\n", + "```\n", + "MATCH p=(:person)-[:friends*1..3]->(:person)\n", + "RETURN p\n", + "```\n", + "\n", + "Examining this query we see that, while this looks familiar, there are a few new elements to the relationship syntax to highlight, specifically the `*1..3` portion. This portion begins with an asterisk(`*`) indicating that this is a VLP query. The next number represents the minimum length of the path, which is followed by two periods (`..`) and the maximum length of the path. \n", + "\n", + "While this is the basic pattern for VLP queries, there are several different variants which are shown in the table below: \n", + "\n", + "#### Variable Length Path Syntax\n", + "\n", + "| VLP Pattern|Description|\n", + "|:--|:--|\n", + "|`()-[*2]->()`|Find me a path containing 3 nodes and 2 edges|\n", + "|`()-[*2..3]->()`|Find me a path containing a minimum of 3 nodes and 2 edges and a maximum of 4 nodes and 3 relationships|\n", + "|`()-[*2..]->()`|Find me a path containing a minimum of 3 nodes and 2 edges, with no maximum|\n", + "|`()-[*..2]->()`|Find me a path containing a maximum of 3 nodes and 2 edges, with no minimum|\n", + "|`()-[*]->()`|Find me a path with no minimum or maximum|\n", + "\n", + "\n", + "Now that we have a basic understanding of openCypher's VLP syntax, let's look at how this is applied to answer some common graph query patterns.\n", + "\n", + "### Static Length paths\n", + "\n", + "The simplest VLP pattern you can do in openCypher is to specify a fixed number of loops/iterations for your pattern. This is accomplished using the syntax `()-[:friends*2]->()`. Let's execute the query below to search for patterns containing 3 nodes and 2 edges." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "deabe58e", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH p=()-[:friends*2]->()\n", + "RETURN p \n", + "LIMIT 10 " + ] + }, + { + "cell_type": "markdown", + "id": "4eee92c8", + "metadata": {}, + "source": [ + "Looking at the above query, you'll notice that this seems familiar to our Friends of Friends query from the last notebook. In truth, this query can be written using either VLP syntax or as we previously learned (`MATCH p=()-[:friends]->()-[:friends]->()`). Why would you want to use VLP syntax here?\n", + "\n", + "The main reason you might prefer VLP syntax here is that it can provide a significantly more readable query to the user. While the queries we are currently looking at are very straightforward, as they get more complex, and the patterns being matched get longer, it can be very helpful to have a shorter, more concise query to make it easier to understand what the expectations of the query. For example, the VLP pattern:\n", + "\n", + "```\n", + "MATCH p=()-[:friends*6]->()\n", + "```\n", + "Is much more understandable than the non-VLP pattern:\n", + "\n", + "`MATCH p=()-[:friends]->()-[:friends]->()-[:friends]->()-[:friends]->()-[:friends]->()-[:friends]->()`\n", + "\n", + "Even though they are functionally equivalent.\n", + "\n", + "### A range of lengths\n", + "\n", + "While the example above works on a static length of paths, sometimes we do not know the number of connections we need to traverse to answer a question. In this case, we can use the minimum and maximum range parameters of our VLP pattern to specify the range of potential connections.\n", + "\n", + "#### Minimum length\n", + "Let's take the query from the last cell and modify it to specify only the minimum length of the pattern instead of a fixed length. We accomplish this by replacing the `*2` with a `*2..`. \n", + "\n", + "\n", + "Execute the query below to see how many paths are connected via a minimum of 2 `friends` edges." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "168bcc76", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH p=()-[:friends*2..]->()\n", + "RETURN p \n", + "LIMIT 10 " + ] + }, + { + "cell_type": "markdown", + "id": "6d36e1dd", + "metadata": {}, + "source": [ + "#### Maximum length\n", + "Now, let's modify that same query to specify only the maximum length of the pattern, by adding `*..2`. \n", + "\n", + "Execute the query below to see how many paths are connected via a maximum of 2 `friends` edges." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55f2aaf9", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH p=()-[:friends*..2]->()\n", + "RETURN p \n", + "LIMIT 10 " + ] + }, + { + "cell_type": "markdown", + "id": "6f5ef9ca", + "metadata": {}, + "source": [ + "#### Minimum and Maximum length\n", + "Let's combine these last 2 queries together to specify both the minimum and the maximum length of the pattern, by adding `*1..2`. Execute the query below to see how many paths are connected via a minimum of 1 `friends` edge and maximum of 2 `friends` edges.\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1c3736ac", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH p=()-[:friends*1..2]->()\n", + "RETURN p \n", + "LIMIT 10 " + ] + }, + { + "cell_type": "markdown", + "id": "df52d879", + "metadata": {}, + "source": [ + "#### Unbounded Patterns\n", + "The final way to use VLP syntax is to not specify any minimum or maximum length and instead go for an unbounded range on the pattern. This is accomplished by adding just the asterisk (`*`). \n", + "\n", + "\n", + "**Important:** While this is a valid query, these sorts of queries tend to have a very high latency as they may traverse/touch a large portion of the graph, depending on how the graph is connected.\n", + "\n", + "Execute the query below to see how many paths are connected via any number of `friends` edges. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "651a0f21", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH p=()-[:friends*]->()\n", + "RETURN p \n", + "LIMIT 10 " + ] + }, + { + "cell_type": "markdown", + "id": "85de2ce8", + "metadata": {}, + "source": [ + "## Exercises\n", + "\n", + "Now that we have gone through the basics of VLP queries in openCypher, it's time to put it into practice. Below are several exercises you can complete to verify your understanding of the material covered in this notebook. As practice for what you have learned, please write the openCypher queries specified below.\n", + "\n", + "### Exercise VLP-1 Find the friends of Dave's Friends using a VLP\n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", + "* Find the friends of Dave (i.e. traverse the `friends` edge)\n", + "* Find the friends of that person (i.e. traverse the `friends` edge)\n", + "* Return the friends `first_name`\n", + "\n", + "The correct answer is three results: \"Hank\", \"Denise\", \"Paras\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d1cc0f0", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n" + ] + }, + { + "cell_type": "markdown", + "id": "6521e66f", + "metadata": {}, + "source": [ + "### Exercise VLP-2 Find all `person` nodes connected to Dave\n", + "\n", + "Starting at a single node and trying to find all connected children (a.k.a. root to leaf) or trying to find the parent of any child node (a.k.a leaf to root) are two very common hierarchical graph query patterns. Commonly, these queries supported bill of materials, information organization, or compliance use cases.\n", + "\n", + "In this exercise, we will be applying that same query pattern to find the hierarchy of people within our social network. We'll accomplish this vby writing a \"root to leaf\" type query where the root node is our `Dave` node in the social network.\n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", + "* Keep traversing the outgoing `friends` edge until there are no more outgoing `friends` edges\n", + "* Return all the paths\n", + "\n", + "The correct answer is nine results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10b4aa1f", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n" + ] + }, + { + "cell_type": "markdown", + "id": "bda5cbf3", + "metadata": {}, + "source": [ + "### Exercise VLP-3 Find if Dave and Denise are connected\n", + "\n", + "Attempting to see if, and how two entities in a graph are connected is a common path type query pattern. These types of queries containing unbounded path traversals with an OLTP graph database are best for calculating point to point or point to set path questions. Set to set or all pairs paths are best done with graph algorithms.\n", + "\n", + "In this exercise, we will be applying a path type query pattern to find out if two people are connected.\n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", + "* Find the friends of Dave (i.e. traverse the `friends` edge)\n", + "* Keep traversing the `friends` edge until you find `Denise`\n", + "* Return a single `True` as the result\n", + "\n", + "The correct answer is a single result: `True`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b44692f8", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n" + ] + }, + { + "cell_type": "markdown", + "id": "0ce0b6c8", + "metadata": {}, + "source": [ + "### Exercise VLP-4 Find all the ways Dave and Denise are connected\n", + "\n", + "A common extension to the path traversal query we wrote in VLP-3 is to return not just \"if\" someone is connected but \"how\" they are connected.\n", + "\n", + "In this exercise, we will be making a slight modification to the previous query to return \"how\" Dave and Denise are connected, not just that they are.\n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", + "* Find the friends of Dave (i.e. traverse the `friends` edge)\n", + "* Keep traversing the `friends` edge until you find `Denise`\n", + "* Return the path\n", + "\n", + "The correct answer has fifteen results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "007a9efd", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n" + ] + }, + { + "cell_type": "markdown", + "id": "b5acefc5", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this notebook, we explored writing variable length path queries in openCypher queries. These queries are a powerful and common way to explore connected data to answer questions, especially those where the exact number of connection is unknown. \n", + "\n", + "In the next notebook, we will take what we have learned in this notebook and extend it to demonstrate how to order, group, and aggregate values in queries." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/03-Ordering-Functions-Grouping.ipynb b/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/03-Ordering-Functions-Grouping.ipynb new file mode 100644 index 00000000..206fa58d --- /dev/null +++ b/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/03-Ordering-Functions-Grouping.ipynb @@ -0,0 +1,960 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "eab505f3", + "metadata": {}, + "source": [ + "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n", + "SPDX-License-Identifier: Apache-2.0\n", + "\n", + "# Learning openCypher - Ordering, Functions, and Grouping\n", + "\n", + "This notebook is the third in a series of notebooks that walk through how to write queries using openCypher. \n", + "\n", + "This notebook will build upon the items convered in the notebook \"01-Basic-Read-Queries\" and \"02-Variable-Length-Paths\". If you have not loaded the data from those notebooks, please follow the steps in the [Getting Started](#Getting-Started) section below. If you have loaded the data, then you can jump ahead to the [Setting up the visualizations](#Setting-up-the-visualizations) section.\n", + "\n", + "## Getting Started\n", + "\n", + "For these notebooks, we will be leveraging a dataset from the book [Graph Databases in Action](https://www.manning.com/books/graph-databases-in-action?a_aid=bechberger) from Manning Publications. \n", + "\n", + "\n", + "**Note:** These notebooks do not cover data modeling or building a data loading pipeline. If you would like a more detailed description about how this dataset is constructed and the design of the data model came from, then please read the book.\n", + "\n", + "To get started, the first step is to load data into the cluster. Assuming the cluster is empty, this can be accomplished by running the cell below which will load our Dining By Friends data.\n", + "\n", + "### Loading Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "df7f476a-95f7-43f3-8f87-79ac8c05c65f", + "metadata": {}, + "outputs": [], + "source": [ + "%seed --model Property_Graph --dataset dining_by_friends --run" + ] + }, + { + "attachments": { + "image-3.png": { + "image/png": "" + } + }, + "cell_type": "markdown", + "id": "48e9a85f", + "metadata": {}, + "source": [ + "### Looking at our graph data\n", + "\n", + "As we examined the data model in the previous notebook, we are not going to examine it, however we will leave the data schema for reference.\n", + "\n", + "![image-3.png](attachment:image-3.png)" + ] + }, + { + "cell_type": "markdown", + "id": "0c12469c", + "metadata": {}, + "source": [ + "### Setting up the visualizations\n", + "\n", + "Run the next two cells to configure various display options for our notebook, which we will use later on to display our results in a pleasing visual way. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e655017", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%graph_notebook_vis_options\n", + "{\n", + " \"groups\": { \n", + " \"person\": {\n", + " \"color\": \"#9ac7bf\"\n", + " },\n", + " \"review\": {\n", + " \"color\": \"#f8cecc\"\n", + " },\n", + " \"city\": {\n", + " \"color\": \"#d5e8d4\"\n", + " },\n", + " \"state\": {\n", + " \"color\": \"#dae8fc\"\n", + " },\n", + " \"review_rating\": {\n", + " \"color\": \"#e1d5e7\"\n", + " },\n", + " \"restaurant\": {\n", + " \"color\": \"#ffe6cc\"\n", + " },\n", + " \"cusine\": {\n", + " \"color\": \"#fff2cc\"\n", + " }\n", + " }\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d5c80800", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "node_labels = '{\"person\":\"first_name\",\"city\":\"name\",\"state\":\"name\",\"restaurant\":\"name\",\"cusine\":\"name\"}'" + ] + }, + { + "cell_type": "markdown", + "id": "ab986dac", + "metadata": {}, + "source": [ + "\n", + "## Ordering Results\n", + "\n", + "The second law of thermodynamics states that in any process the entropy of the system is always increasing. However, when working with data, one common requirement is to return that data in a consistent and ordered fashion. \n", + "\n", + "By default, data returned from an openCypher query does not have a specified order. To give our data a consistent order we must uss the `ORDER BY` clause. This clause enables you sort your results using the values that a query can return, such as nodes/edges, ID values, as well as via many expressions. \n", + "\n", + "**Note:** When the data being ordered contains a `null` value, these will be sorted to the end of the results for ascending sort order and the beginning of the list for descending sort order.\n", + "\n", + "\n", + "### Ordering by a property\n", + "\n", + "The simplest ordering in openCypher is to specify a single property. This is accomplished using the syntax `ORDER BY .`. By default, items are ordered in ascending order and descending order can be specified using `ORDER BY . DESC`. \n", + "\n", + "Let's first look at what our data looks like to find all the `restaurant` nodes in our graph and return the `name` property." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f7a0df81", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant) \n", + "RETURN n.name" + ] + }, + { + "cell_type": "markdown", + "id": "91e99c64", + "metadata": {}, + "source": [ + "As we see, there is no discernible order to the values returned. \n", + "\n", + "Let's see how to order our data by executing the query below to find all the `restaurant` nodes in our graph and order them by the `name` property in descending order." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "deabe58e", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant) \n", + "RETURN n.name\n", + "ORDER BY n.name DESC" + ] + }, + { + "cell_type": "markdown", + "id": "4eee92c8", + "metadata": {}, + "source": [ + "As we see, with the addition of the `ORDER BY` clause we get our data out in a nice organized manner. \n", + "\n", + "### Ordering by multiple properties\n", + "\n", + "A common need when ordering data is to use multiple properties as the ordering criteria. In openCypher, this is achieved by adding multiple options to the `ORDER BY` clause. When multiple properties are specified, the results are first ordered by the first property, then for equal values, the next property, and so on for all the specified properties. \n", + "\n", + "Let's see how this works by executing the query below to find all the `restaurant` nodes in our graph and order them by the `name` property, then by the `address` property." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "168bcc76", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant) \n", + "RETURN n.name, n.address\n", + "ORDER BY n.name, n.address" + ] + }, + { + "cell_type": "markdown", + "id": "6d36e1dd", + "metadata": {}, + "source": [ + "### Order by expressions\n", + "In addition to ordering by property values, you can use the elements themselves or expressions such as `id()` or `keys()` to order our data. In the example below, we first show how to order data using the element itself, and then by the id values of the elements being returned." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55f2aaf9", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant) \n", + "RETURN n.name\n", + "ORDER BY n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c63b78f6", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant) \n", + "RETURN n.name\n", + "ORDER BY id(n)" + ] + }, + { + "cell_type": "markdown", + "id": "9b78432a", + "metadata": {}, + "source": [ + "### Pagination\n", + "\n", + "One of the most common requirements for applications is the ability to return the data in chunks, or pages in the response. openCypher supports pagination through the use of two clauses: `SKIP` and `LIMIT`. \n", + "\n", + "We have already used the `LIMIT` clause to specify the maximum number of entities returned. When used with the `SKIP` clause, which specifies the number of records to ignore at the beginning of the result set, we can create an effective pagination mechanism. One important thing to note about pagination is that we need to explicitly order the results to retrieve a consistent set of data in our pages. Without ordering the results, we have no guarantee that results will be returned in a constant order, meaning that the data shown for a specific \"page\" may differ between calls.\n", + "\n", + "Let's take a look at how we could use `SKIP` and `LIMIT` to present a paginated view of the restaurants in our graph by retrieving the first page of results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0335cb50", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant) \n", + "RETURN n.name\n", + "ORDER BY n.name\n", + "SKIP 0 LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "70d12ad6", + "metadata": {}, + "source": [ + "Let's see what it looks like to retrieve the second page of data. To accomplish this, we need to set the value of `SKIP` to represent the page size we would like to skip." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7371ac06", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant) \n", + "RETURN n.name\n", + "ORDER BY n.name\n", + "SKIP 10 LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "978fff4c", + "metadata": {}, + "source": [ + "As we see, the data we retrieve from the second query represents the second page of results returned from our query. Please don't hesitate to try additional values for the `SKIP` and `LIMIT` values to see how the query reacts.\n", + "\n", + "## Functions\n", + "\n", + "openCypher contains a set of functions that enables customers to perform computations on the data in an optimized manner. There are a variety of types of functions supported in Neptune, which are highlighted in the table below:\n", + "\n", + "|Type|Function|\n", + "| ----------- | ----------- |\n", + "|Predicate|`exists()`|\n", + "|Scalar|`coalesce()`, `endNode()`, `head()`, `id()`, `last()`, `length()`, `properties()`, `size()`, `startNode()`, `timestamp()`, `toBoolean()`, `toFloat()`, `toInteger()`, `type()`|\n", + "|Aggregating|`avg()`, `collect()`, `count()`, `max()`, `min()`, `sum()`|\n", + "|List|`keys()`, `labels()`, `nodes()`, `range()`, `relationships()`, `reverse()`, `tail()`|\n", + "|Math - numeric|`abs()`, `ceil()`, `floor()`, `rand()`, `round()`, `sign()`|\n", + "|Math - logarithmic|`e()`, `exp()`, `log()`, `log10()`, `sqrt()`|\n", + "|String|`left()`, `lTrim()`, `replace()`, `reverse()`, `right()`, `rTrim()`, `split()`, `substring()`, `toLower()`, `toString()`, `toUpper()`, `trim()`|\n", + "\n", + "While each function has a specified input format, there are two main ways that these can be used, either as part of the formatting done in the `RETURN` clause or as part of the filtering done in the`WHERE` clauses. Below, we have provided examples of how to use these in each query sections.\n", + "\n", + "### Using Function in `RETURN`\n", + "\n", + "One common way to use functions is to apply them while formatting the results. To use a function, you first need to pass in a variable as the first parameter, like this `toUpper(n.name)`. The expected input, or if there is a second required parameter, depends on the exact function being used but this syntax is generally \n", + "\n", + "\n", + "In the example query below, we apply several different functions (`toUpper()` and `coalesce()`) to format the result based on the data matched in the find and filtering portions of the query." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e4ebade", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant) \n", + "RETURN toUpper(n.name) as name, \n", + " //returns the first non-null answer, since this property does not exist it should be `No Capacity Provided`\n", + " coalesce(n.max_capacity, 'No Capacity Provided') as capacity, \n", + " n.address as address\n", + "ORDER BY n.name\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "e299d1a4", + "metadata": {}, + "source": [ + "### Using functions in `WHERE`\n", + "\n", + "Another common way to use functions is as part of a comparison when filtering values in the `WHERE` portion of a query. The example below shows how you can perform a case-insensitive search on a restaurant name through the use of the `toUpper()` function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2ce3f5fb", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant) \n", + "WHERE toUpper(n.name) = 'HAND ROLL'\n", + "RETURN n.name" + ] + }, + { + "cell_type": "markdown", + "id": "8715cd5f", + "metadata": {}, + "source": [ + "While the examples above are the most common uses of functions, many of the non-aggregating functions may also be used in other portions of the query. The example below shows how functions can be used in an `ORDER BY` to order by the length of the restaurant name." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cd7101fc", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant) \n", + "RETURN n.name\n", + "ORDER BY size(n.name)\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "0654b084", + "metadata": {}, + "source": [ + "### Composing multiple functions\n", + "\n", + "Often you many need to chain or compose functions together to create more complex computations. While functions can be chained together in any part of a query where they are supported, the example below shows how to compose several functions together in the `RETURN`. \n", + "\n", + "\n", + "In this example, our query will find the average number of words in the `name` for restaurants in the graph." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "91b72e90", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant) \n", + "RETURN avg(size(split(n.name, \" \")))\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "952c7825", + "metadata": {}, + "source": [ + "Now that we have looked at the ordering and function capabilities in openCypher, it's time to take a look at another major set of functionality in formatting openCypher results, grouping.\n", + "\n", + "## Grouping Results\n", + "\n", + "Grouping results in openCypher is a bit different from how grouping works in other query languages, such as Gremlin or SQL. In most other query languages the grouping of results is done by explicitly calling a step or clause, such as SQL uses `GROUP BY` and Gremlin uses the `group()/groupCount()` step.\n", + "\n", + "In openCypher, grouping is controlled implicitly, via the use of aggregating expressions containing one or more aggregating functions (`avg()`, `collect()`, `count()`, `max()`, `min()`, `sum()`). Each aggregation function computes the groups formed by the output of that function. For this group to work, the final aggregation expressions has to be either:\n", + "\n", + "* An aggregation function (`RETURN count(*)`)\n", + "* A grouping key (`RETURN n, count(n)`)\n", + "* A local variable\n", + "\n", + "These expressions ensure that the aggregation is be computed over all the results within a group. Groups are determined through the grouping keys. Grouping keys are non-aggregate expressions, that are specfied in conjunction with the aggregate functions are are used to group the values. Let's look at an example to understand how this works.\n", + "\n", + "**Example**\n", + "|id|name|\n", + "|---|---|\n", + "|1|Dave|\n", + "|2|Josh|\n", + "|3|Kelly|\n", + "|4|Dave|\n", + "\n", + "```\n", + "MATCH (n)\n", + "RETURN n.name AS name, count(n.name) AS cnt\n", + "```\n", + "Results:\n", + "\n", + "|name|cnt|\n", + "|---|---|\n", + "|Dave|2|\n", + "|Josh|1|\n", + "|Kelly|1|\n", + "\n", + "In this example, we are returning 2 values from the query `name` and `cnt`. The first value, `name`, is not an aggregating function so it will be the grouping key that buckets similar items. The second value, `cnt`, is the output of an aggregating function which will calculate the result, in this case a count, based on the buckets created by the grouping key `name`.\n", + "\n", + "It may take a little bit of getting used to this manner of grouping items, so let's jump in and try out some common use cases for grouping.\n", + "\n", + "### Group by a property\n", + "\n", + "Running the query below, returns the count of restaurants with `name` attributes of a specific length." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f66d367b", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant) \n", + "RETURN count(*) AS count_in_length, size(n.name) AS name_length\n", + "ORDER BY name_length" + ] + }, + { + "cell_type": "markdown", + "id": "2c92cf2e", + "metadata": {}, + "source": [ + " In this example, the grouping key will the name length (`size(n.name)`) with the aggregating operation being the `count()`.\n", + " \n", + " ### Group on a pattern match\n", + " \n", + "Another common need is to use multiple different elements in a pattern to perform a grouping/aggregation query. To accomplish this, you combine what we know about `MATCH` and named variables with what we have just learned about grouping to achieve this aggregation.\n", + "\n", + "Let's take a look at what it would look like to find the average rating of the restaurants in our graph." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f646e881", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant)<-[:about]-(r:review)\n", + "RETURN n.name AS name, avg(r.rating) AS rating\n", + "ORDER BY rating desc" + ] + }, + { + "cell_type": "markdown", + "id": "60b2eab1", + "metadata": {}, + "source": [ + "## Combining Queries\n", + "\n", + "Now that we have learned about all the major features (finding, filtering, formatting, ordering, functions, and grouping) of openCypher, we have one more topic to discuss in this notebook, how to combine subqueries together to create more complex queries. In openCypher, there are three main mechanisms to achieve this: `UNION`, `UNION ALL`, and `WITH`.\n", + "\n", + "### UNION\n", + "\n", + "The `UNION` clause combines the results of 2 or more queries together and returns the combined result from both queries. In the case of `UNION`, the result will remove any duplicates. \n", + "\n", + "\n", + "Let's see what an example `UNION` query looks like:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "190af502", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant)\n", + "RETURN n \n", + "LIMIT 5\n", + "UNION\n", + "MATCH (n:review)\n", + "RETURN n\n", + "LIMIT 5" + ] + }, + { + "cell_type": "markdown", + "id": "fd55000f", + "metadata": {}, + "source": [ + "One common tripping point with `UNION` queries is that each query must return the same number of columns, and the columns must have identical names. If this is not the case, you will receive an error message like occurs when running the query below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "493ac5cf", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant)\n", + "RETURN n \n", + "LIMIT 5\n", + "UNION\n", + "MATCH (r:review)\n", + "RETURN r\n", + "LIMIT 5" + ] + }, + { + "cell_type": "markdown", + "id": "f0c9a4e2", + "metadata": {}, + "source": [ + "### UNION ALL\n", + "\n", + "The second way to combine queries is with the `UNION ALL` clause. This works fundamentally the same as the `UNION` clause, except that it will retain duplicates in the results. If we look at the two nearly identical queries below, we will see the first query is a `UNION` query and will return de-duplicated result to return only 5 results. The second query, which is a `UNION ALL` query, will return results including duplicates to total 10 rows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "baa0f999", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant)\n", + "RETURN n \n", + "LIMIT 5\n", + "UNION \n", + "MATCH (n:restaurant)\n", + "RETURN n \n", + "LIMIT 5" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9d97b3b3", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:restaurant)\n", + "RETURN n \n", + "LIMIT 5\n", + "UNION ALL\n", + "MATCH (n:restaurant)\n", + "RETURN n \n", + "LIMIT 5" + ] + }, + { + "cell_type": "markdown", + "id": "ae43a10d", + "metadata": {}, + "source": [ + "### `WITH` clause\n", + "\n", + "The final mechanism to combine subqueries together is the `WITH` clause which allows subqueries to be chained together instead of combining the results. \n", + "\n", + "The `WITH` clause allows you to manipulate the output of a query and pass all or part of the results of a subquery on for use in the next subquery. While there are a myriad of ways to leverage subqueries in openCypher, there are a couple of common usage patterns for the `WITH` clause:\n", + "\n", + "* Limiting the number of entries passed to other subqueries\n", + "* Introducing new intermediate results such from projection/aggregation/etc.,\n", + "* Filtering on aggregated values for subsequent queries\n", + "\n", + "In the cells below, we will show how to leverage each of these common patterns.\n", + "\n", + "#### Limiting Results\n", + "\n", + "Likely the most straightforward use of subqueries is to limit the results from an initial subquery for use as a starting point for the next subquery. In this example, we will find the first 5 restaurants and then find the reviews for those restaurants only. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e27d0b96", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "\n", + "MATCH (n:restaurant)\n", + "WITH n LIMIT 5\n", + "MATCH p=(n)<-[:about]-(:review)\n", + "RETURN p" + ] + }, + { + "cell_type": "markdown", + "id": "9ea7476e", + "metadata": {}, + "source": [ + "#### Intermediate Results\n", + "\n", + "Another common use of the `WITH` clause is to calculate some intermediate result that is used to aid in filtering results. In the example below, we are running a query to find all `restaurant` nodes that are connected to a `review`. We are then passing the `restaurant` and an upper case version of the `name` called `upperCaseName` to the subsequent query portions, where we filter for restaurants starting with `WITH`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9a2ddc2", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (r:restaurant)<--(rev:review)\n", + "WITH r, toUpper(r.name) AS upperCaseName\n", + "WHERE upperCaseName STARTS WITH 'WITH'\n", + "RETURN r.name" + ] + }, + { + "cell_type": "markdown", + "id": "b6b6b33f", + "metadata": {}, + "source": [ + "#### Filtering on Aggregated Values\n", + "\n", + "The final common pattern for using `WITH` in a query is to use an initial subquery to filter on some set of matched patterns, which are then passed along to the next portion of the query. In the example below, I am finding all restaurants with more than 5 reviews and then finding the city for those restaurants." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a2c6dc9c", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (r:restaurant)<--(rev:review)\n", + "WITH size(collect(rev)) as num_reviews, r\n", + "WHERE num_reviews >5\n", + "MATCH (r)-[:within]->(c:city)\n", + "RETURN c.name as city, r.name as name" + ] + }, + { + "cell_type": "markdown", + "id": "500ed04f", + "metadata": {}, + "source": [ + "#### Variable scope in `WITH`\n", + "\n", + "When working with the `WITH` clause, it is important to understand the scope of variables. Each subquery scopes its own variables to only exist within the subquery unless you explicitly pass the variable, or use the wildcard (`*`), to pass the variable on to the next subquery. Let's look at the previous query:\n", + "```\n", + "MATCH (r:restaurant)<--(rev:review)\n", + "WITH size(collect(rev)) as num_reviews, r\n", + "WHERE num_reviews >5\n", + "MATCH (r)-[:within]->(c:city)\n", + "RETURN c.name as city, r.name as name\n", + "```\n", + "\n", + "Examining the `WITH` clause here, we see that we identified both the calculated value `num_reviews` and one of the original variables `r` to be passed on to the subquery. This means that both `num_reviews` and `r` are in scope for the second portion of the query. However, the variable `rev` from the first part is no longer able to be accessed for the second half of this query. Running the cell below will display an error message." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "28ca4c56", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc\n", + "MATCH (r:restaurant)<--(rev:review)\n", + "WITH size(collect(rev)) as num_reviews, r\n", + "WHERE num_reviews >5\n", + "RETURN rev" + ] + }, + { + "cell_type": "markdown", + "id": "4fed7997", + "metadata": {}, + "source": [ + "To resolve this issue, we must modify the query slightly to first group together all the `rev` elements using the `WITH` clause into a variable named `revs`. This now leaves us with all our `review` nodes in scope. We can then filter the results of our query using the `size()` function on our `revs` element and return the matching values.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f38aa970", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc\n", + "MATCH (r:restaurant)<--(rev:review)\n", + "WITH collect(rev) as revs\n", + "WHERE size(revs)>5\n", + "RETURN revs" + ] + }, + { + "cell_type": "markdown", + "id": "f5d352e5", + "metadata": {}, + "source": [ + "## Exercises\n", + "\n", + "Now that we have gone through the main concepts of openCypher read queries, it's time to put it into practice. Below are several exercises you can complete to verify your understanding of the material covered in this notebook. As practice for what you have learned, please write the openCypher queries specified below.\n", + "\n", + "For these exercises, we will be leveraging the majority of the different entities in our data to show how we would build a common graph pattern known as \"collaborative filtering\" which is often used to provide recommendations to users based on other's reviews. Collaborative filtering works on the idea that if two people share the same opinion on a topic, such as a restaurant, then they are more likely to share similar opinions on other topics. With a graph, we can leverage these connections to help provide recommendations based on these patterns of connections. In these exercises, we will be recommending restaurants to our users based upon reviews.\n", + "\n", + "\n", + "### Exercise G-1 What are the 3 highest restaurants?\n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Find the 3 highest average restaurant rating\n", + "* Find the associated cuisine\n", + "* Return the restaurant name, the cuisine name, and the average rating\n", + "* Order the results by average rating descending\n", + "\n", + "The results for this query are:\n", + "\n", + "|Restaurant name|Cuisine|Avg Rating|\n", + "|---|---|---|\n", + "|Lonely Grape|bar|5.0|\n", + "|Perryman's|bar|4.5|\n", + "|Rare Bull|steakhouse|4.333333|\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "36adacaf", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (c:cuisine)\n", + "RETURN c" + ] + }, + { + "cell_type": "markdown", + "id": "04ab5b59", + "metadata": {}, + "source": [ + "### Exercise G-2 Find the top 3 highest rated restaurants in the city where Dave lives\n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", + "* Find the `city` that Dave lives in\n", + "* Find the average rating of restaurants in that city\n", + "* Find the top 3 average ratings\n", + "* Return the restaurant name, address, and average rating\n", + "* Order by the average rating descending\n", + "\n", + "The results for this query are:\n", + "\n", + "|Restaurant name|Address|Avg Rating|\n", + "|---|---|---|\n", + "|Dave's Big Deluxe|\t490 Ivan Cape|4.0|\n", + "|Pick & Go|4881 Upton Falls|3.75|\n", + "|Without Chaser|\t01511 Casper Fall|3.5|" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dae9d211", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n" + ] + }, + { + "cell_type": "markdown", + "id": "678a243d", + "metadata": {}, + "source": [ + "### Exercise G-3 What Mexican or Chinese restaurant near Dave that is the highest rated?\n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", + "* Find the `city` that Dave lives in\n", + "* Find the restaurants in that city that serve 'Mexican' or 'Chinese' food\n", + "* Find the average rating of those restaurants\n", + "* Return the restaurant name, address, and average rating\n", + "* Order by the average rating descending\n", + "* Return the top 1 result\n", + "\n", + "The results for this query are:\n", + "\n", + "|Restaurant name|Address|Avg Rating|\n", + "|---|---|---|\n", + "|With Salsa|24320 Williamson Causeway|3.5|" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f96b91e5", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n" + ] + }, + { + "cell_type": "markdown", + "id": "e93f266d", + "metadata": {}, + "source": [ + "### Exercise G-4 What are the top 3 restaurants, recommended by his friends, where Dave lives? (Personalized Recommendation)\n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", + "* Find the `city` that Dave lives in\n", + "* Find Dave's friends\n", + "* Find reviews written by Dave's friends in the city \"Dave\" lives in\n", + "* Find the average rating of those restaurants\n", + "* Return the restaurant name, address, and average rating\n", + "* Order by the average rating descending\n", + "* Return the top 3\n", + "\n", + "The results for this query are:\n", + "\n", + "|Restaurant name|Address|Avg Rating|\n", + "|---|---|---|\n", + "|Dave's Big Deluxe|490 Ivan Cape|4.0|\n", + "|With Salsa|24320 Williamson Causeway|4.0|\n", + "|Satiated|370 Hills Estates|3.666667|" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "81e5d4bc", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "b5acefc5", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this notebook, we explored ordering, functions, and grouping in openCypher queries. These queries are a powerful and common way to format and mutate data within your graph. This is also the last notebook in the set dedicated to writing read queries. In the next notebook we will take a look at how to write queries that mutate data through insert, update, and delete operations." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/04-Creating-Updating-Delete-Queries.ipynb b/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/04-Creating-Updating-Delete-Queries.ipynb new file mode 100644 index 00000000..c10f03bb --- /dev/null +++ b/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/04-Creating-Updating-Delete-Queries.ipynb @@ -0,0 +1,727 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "eab505f3", + "metadata": {}, + "source": [ + "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n", + "SPDX-License-Identifier: Apache-2.0\n", + "\n", + "# Learning openCypher - Create, Update, and Delete Queries\n", + "\n", + "This notebook is the fourth in a series of notebooks that walk through how to write queries using openCypher.\n", + "\n", + "This notebook will build upon the items convered in the notebook \"01-Basic-Read-Queries\", \"02-Variable-Length-Paths\", and \"03-Ordering-Functions-Grouping\". If you have not loaded the data from those notebooks, please follow the steps in the [Getting Started](#Getting-Started) section below. If you have loaded the data, then you can jump ahead to the [Setting up the visualizations](#Setting-up-the-visualizations) section.\n", + "\n", + "## Getting Started\n", + "\n", + "For these notebooks, we will be leveraging a dataset from the book [Graph Databases in Action](https://www.manning.com/books/graph-databases-in-action?a_aid=bechberger) from Manning Publications. \n", + "\n", + "\n", + "**Note:** These notebooks do not cover data modeling or building a data loading pipeline. If you would like a more detailed description about how this dataset is constructed and the design of the data model came from, then please read the book.\n", + "\n", + "To get started, the first step is to load data into the cluster. Assuming the cluster is empty, this can be accomplished by running the cell below which will load our Dining By Friends data.\n", + "\n", + "### Loading Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff989cbd-55c7-4c0e-81cc-7f0495b37389", + "metadata": {}, + "outputs": [], + "source": [ + "%seed --model Property_Graph --dataset dining_by_friends --run" + ] + }, + { + "attachments": { + "image-3.png": { + "image/png": "" + } + }, + "cell_type": "markdown", + "id": "48e9a85f", + "metadata": {}, + "source": [ + "### Looking at our graph data\n", + "\n", + "Now that we have loaded our data, let's take a moment to look at what our data model looks like:\n", + "\n", + "\n", + "![image-3.png](attachment:image-3.png)\n", + "\n", + "\n", + " \n", + " \n", + "\n", + " \n", + "
Element (Node/Edge) Counts
\n", + " \n", + "|Node Label|Count|\n", + "|:--|:--|\n", + "|review|109|\n", + "|restaurant|40|\n", + "|cuisine|24|\n", + "|person|8|\n", + "|state|2|\n", + "|city|2|\n", + " \n", + "\n", + "\n", + "|Edge Label|Count|\n", + "|:--|:--|\n", + "|wrote|218|\n", + "|about|218|\n", + "|within|84|\n", + "|serves|80|\n", + "|friends|20|\n", + "|lives|16|\n", + "\n", + "
\n", + "\n", + "This dataset represents a fictitious, but realistic, restaurant recommendation application that contains:\n", + "\n", + "* Users, represented by `person` nodes\n", + "* Users connected to Users via `friends` edges\n", + "* Restaurants and their associated information (`city`, `state`, `cusine`)\n", + "* Reviews include the body and ratings\n", + "* Ratings of reviews (helpful/not helpful)\n", + "\n", + "This application contains three main aspects to the data it collects. First, it contains a social network consisting of `person` nodes connected to other `person` nodes via a `friends` edge. Second, it contains a restaurant review aspect consisting of `restaurant` nodes, information about those restaurants (`city`/`state`/`cuisine`), and `review` nodes for that restaurant. The third, and final aspect, consists of a personalization component where a `person` can rate a `review`, which allows for better recommendations based on a person's preferences.\n", + "\n", + "Throughout this set of notebooks, we will leverage the different aspects of this data to highlight different fundamental types of common property graph queries, namely neighborhood traversals, hierarchies, paths, and collaborative filtering.\n", + "\n", + "Now let's get started." + ] + }, + { + "cell_type": "markdown", + "id": "0c12469c", + "metadata": {}, + "source": [ + "### Setting up the visualizations\n", + "\n", + "Run the next two cells to configure various display options for our notebook, which we will use later on to display our results in a pleasing visual way. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e655017", + "metadata": {}, + "outputs": [], + "source": [ + "%%graph_notebook_vis_options\n", + "{\n", + " \"groups\": { \n", + " \"person\": {\n", + " \"color\": \"#9ac7bf\"\n", + " },\n", + " \"review\": {\n", + " \"color\": \"#f8cecc\"\n", + " },\n", + " \"city\": {\n", + " \"color\": \"#d5e8d4\"\n", + " },\n", + " \"state\": {\n", + " \"color\": \"#dae8fc\"\n", + " },\n", + " \"review_rating\": {\n", + " \"color\": \"#e1d5e7\"\n", + " },\n", + " \"restaurant\": {\n", + " \"color\": \"#ffe6cc\"\n", + " },\n", + " \"cusine\": {\n", + " \"color\": \"#fff2cc\"\n", + " }\n", + " }\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d5c80800", + "metadata": {}, + "outputs": [], + "source": [ + "node_labels = '{\"person\":\"first_name\",\"city\":\"name\",\"state\":\"name\",\"restaurant\":\"name\",\"cusine\":\"name\"}'" + ] + }, + { + "cell_type": "markdown", + "id": "ab986dac", + "metadata": {}, + "source": [ + "\n", + "## Creating Data\n", + "\n", + "When working with any database, one of the most common tasks is adding new data. To add new nodes, edges, or path in openCypher we use the `CREATE` clause. \n", + "\n", + "\n", + "### Creating a node with a label and properties\n", + "The simpliest option to create a node in openCypher is to do a query similar to this:\n", + "\n", + "```\n", + "CREATE (n)\n", + "```\n", + "This query will create a node with a default label (`vertex`) and no properties. If we wanted to return the newly created element, we could by adding a `RETURN` clause like shown here:\n", + "\n", + "```\n", + "CREATE (n)\n", + "RETURN n\n", + "```\n", + "\n", + "We can also create multiple elements simultaneously by specifying all of them in the `CREATE` clause, as seen here:\n", + "\n", + "```\n", + "CREATE (n), (m)\n", + "RETURN n, m\n", + "```\n", + "\n", + "\n", + "While these examples help in understanding the basic syntax, they are not very realistic. In most scenarios you will not want to just add a node, instead you will want to add a node with a specific label and associated properties.\n", + "\n", + "Let's look at what our query looks like to create a new `person` node with the first name of `John` and a last name of `Doe`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f7a0df81", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "CREATE (n:person {first_name: 'John', last_name: 'Doe'})\n", + "RETURN n" + ] + }, + { + "cell_type": "markdown", + "id": "8364e41a", + "metadata": {}, + "source": [ + "In the example above, the first and last name properties were added by specifying them inline with the node being created. This can also be accomplished using the `SET` clause. The `SET` clause allows you to specify a specific property to be added/updated (`SET n.first_name='Jane'`) or it can be assigned a map of key-value pairs which will all be added/updated, as shown in the example below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc969671", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "CREATE (n:person)\n", + "SET n={first_name: 'Jane', last_name: 'Doe'}\n", + "RETURN n" + ] + }, + { + "cell_type": "markdown", + "id": "91e99c64", + "metadata": {}, + "source": [ + "### Creating edges\n", + "\n", + "Another common task is to create edges between nodes. To create edges, we first use the `MATCH` statement to get the two nodes we would like to connect and then add the relationship between them using the `CREATE` clause. \n", + "\n", + "In the query below, we find the nodes we created above for `John Doe` and `Jane Doe` and connect them with a `friends` edge." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "deabe58e", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (john:person {first_name: 'John', last_name:'Doe'}), (jane:person {first_name: 'Jane', last_name:'Doe'}) \n", + "CREATE (john)-[r:friends]->(jane)\n", + "RETURN r" + ] + }, + { + "cell_type": "markdown", + "id": "4eee92c8", + "metadata": {}, + "source": [ + "As with nodes, we can also set properties on these edges using the `SET` clause as shown here:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "168bcc76", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (john:person {first_name: 'John', last_name:'Doe'}), (jane:person {first_name: 'Jane', last_name:'Doe'}) \n", + "CREATE (john)-[r:friends]->(jane)\n", + "SET r.relationship='coworker'\n", + "RETURN r" + ] + }, + { + "cell_type": "markdown", + "id": "6d36e1dd", + "metadata": {}, + "source": [ + "### Creating paths\n", + "The last major task people want to do when adding data to their graph is to create entire paths containing both nodes and the connecting edges. Using what we have already learned, we can accomplish this using a query like this:\n", + "\n", + "```\n", + "CREATE (jim:person {first_name: 'Jim', last_name: 'Doe'})\n", + "CREATE (joe:person {first_name: 'Joe', last_name: 'Doe'})\n", + "CREATE (jim)-[:friends]->(joe)\n", + "```\n", + "\n", + "While this is reasonable approach, we can increase the readability of this query a bit by adding them all within a single statement as shown below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55f2aaf9", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "CREATE p= (jim:person {first_name: 'Jim', last_name: 'Doe'})-[:friends]->\n", + " (joe:person {first_name: 'Joe', last_name: 'Doe'})\n", + "RETURN p" + ] + }, + { + "cell_type": "markdown", + "id": "9b78432a", + "metadata": {}, + "source": [ + "## Updating Data\n", + "\n", + "After creating data, the next most common task is to update data within the graph. Lucky for us, we have already learned the building blocks we need to know to accomplish this task. In openCypher, we combine the `MATCH` and `SET` clauses to update attributes on nodes and edges. In the example below, let's update the `first_name` of the `Joe Doe` node we created in the previous step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0335cb50", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (joe:person {first_name: 'Joe', last_name: 'Doe'}) \n", + "SET joe.first_name='Joseph'\n", + "RETURN joe" + ] + }, + { + "cell_type": "markdown", + "id": "70d12ad6", + "metadata": {}, + "source": [ + "As we showed with the `CREATE` clause you can update the values by assigning a map of key-value pairs to an element. When updating data you can also use a map with a `+=` to append or overwrite the key-value pairs from the map to that element. In the example below, we are going to add a new property `age` to our `Joseph Doe` node." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7371ac06", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (joe:person {first_name: 'Joseph', last_name: 'Doe'}) \n", + "SET joe += {age: 43}\n", + "RETURN joe" + ] + }, + { + "cell_type": "markdown", + "id": "978fff4c", + "metadata": {}, + "source": [ + "Another feature when using `SET` is that you can remove a property by setting that property key to `null`, as shown in the example below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e4ebade", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (joe:person {first_name: 'Joseph', last_name: 'Doe'}) \n", + "SET joe = null\n", + "RETURN joe" + ] + }, + { + "cell_type": "markdown", + "id": "e299d1a4", + "metadata": {}, + "source": [ + "## Upserting Data\n", + "\n", + "We have learned how to create and update data in our graph. However, there is another important mutation operation that we want to cover. That operation is how to perform an upsert, where data is created if it doesn't exist or updated if it does. In openCypher, this operation is performed using the `MERGE` clause. `MERGE` can be thought of as a combination of `MATCH` and `CREATE` with the additional ability to specify what to do on a create or a match. \n", + "\n", + "**Note:** The `MERGE` clause does not support partial pattern matches. When matching against patterns in `MERGE`, the matches are either the entire pattern, or the entire pattern is created.\n", + "\n", + "The `MERGE` clause is divided into three sections:\n", + "\n", + "* `MERGE` - This specifies the pattern(s) that you want to match in the graph\n", + "* `ON CREATE` - This specifies the behavior to occur if the pattern is created\n", + "* `ON MATCH` - This specifies the behavior to occur if the pattern is matched\n", + "\n", + "Of these three portions, only the `MERGE` is required. \n", + "\n", + "### Upserting Nodes\n", + "\n", + "Let's take a look at what a simple `MERGE` statement looks like with a single node pattern match." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2ce3f5fb", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MERGE (p:person {first_name: \"Jamie\"})\n", + "RETURN p" + ] + }, + { + "cell_type": "markdown", + "id": "8715cd5f", + "metadata": {}, + "source": [ + "In this case, we created a new node as there are no matches for the specified pattern. However, let's say that if you created the node that you wanted to specify additional properties beyond the ones in the pattern match. This is where we can leverage the `ON CREATE` portion of the `MERGE` clause as shown below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cd7101fc", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MERGE (p:person {first_name: \"Jamie\"})\n", + "ON CREATE\n", + " SET p.creation = 'Now'\n", + "RETURN p" + ] + }, + { + "cell_type": "markdown", + "id": "0654b084", + "metadata": {}, + "source": [ + "We can also specify the behavior for when the pattern is matched via the `ON MATCH` portion, as shown here:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "91b72e90", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MERGE (p:person {first_name: \"Jamie\"})\n", + "ON CREATE\n", + " SET p.creation = 'Now'\n", + "ON MATCH\n", + " SET p.creation = 'Earlier'\n", + "RETURN p" + ] + }, + { + "cell_type": "markdown", + "id": "952c7825", + "metadata": {}, + "source": [ + "### Upserting edges\n", + "\n", + "The `MERGE` clause can also be used on to upsert edges. Let's take the query we used above to create edges and update it to use `MERGE` instead." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f66d367b", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (john:person {first_name: 'John', last_name:'Doe'}), (jane:person {first_name: 'Jane', last_name:'Doe'}) \n", + "MERGE (john)-[r:friends]->(jane)\n", + "RETURN r" + ] + }, + { + "cell_type": "markdown", + "id": "2c92cf2e", + "metadata": {}, + "source": [ + "### `UNWIND` with `MERGE`\n", + "\n", + "So far, the options we have shown you work with a single item. However, a common requirement is to send in a list of values and have them all added iteratively. To accomplish this, we can leverage the `MERGE` clause as well as another new clause `UNWIND`. `UNWIND` is a clause in openCypher that takes a list and expands it back to individual rows. For example, if we ran the query below:\n", + "\n", + "```\n", + "UNWIND [1,2,3] as x\n", + "RETURN x\n", + "```\n", + "\n", + "We would get back the following results:\n", + "\n", + "|x|\n", + "|--|\n", + "|1|\n", + "|2|\n", + "|3|\n", + "\n", + "**Note:** One important thing to know is that while we are demonstrating the `UNWIND` clause in the context of creating/merging data, this is not the only way it can be used. The `UNWIND` clause has multiple other uses when dealing with lists beyond what is shown here.\n", + "\n", + "The ability to pass in a list of values and have them transformed into individual rows to optimize insertion of data. In the example below, we use a `WITH` clause to inject an array of maps. We then use the `UNWIND` clause to turn this into 3 individual rows, which are then used as inputs to the `MERGE` statement." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f646e881", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + " WITH [{first_name:\"Dave\"},{first_name:\"Josh\"},{first_name:\"Steve\"}] as names\n", + "\n", + "UNWIND names as name\n", + "MERGE (n:person {first_name: name.first_name})\n", + "RETURN n" + ] + }, + { + "cell_type": "markdown", + "id": "60b2eab1", + "metadata": {}, + "source": [ + "## Deleting Data\n", + "\n", + "Now that we have learned about how to add and update data in our graph, the final operation we need to learn is how to delete data. In openCypher, deletion of data is done with the `DELETE` clause for removing nodes or edges.\n", + "\n", + "### Removing a Node\n", + "\n", + "To remove a node(s) in openCypher, we first need to match the items we want to delete, via `MATCH`, and then remove them using `DELETE`. In the example below, we will remove any nodes with the `first_name` of `Steve` from our graph." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "190af502", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:person {first_name: 'Steve'})\n", + "DELETE n\n", + "RETURN n" + ] + }, + { + "cell_type": "markdown", + "id": "a68c7a83", + "metadata": {}, + "source": [ + "### Removing an Edge\n", + "To remove an edge(s) in openCypher is very similar to removing a node, except that we need to pass the edge to `DELETE`. In the example below, we will remove any edges associated with nodes with the `first_name` of `Joseph` from our graph." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2beb5c72", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "MATCH (n:person {first_name: 'Joseph'})-[r]-()\n", + "DELETE r\n", + "RETURN r" + ] + }, + { + "cell_type": "markdown", + "id": "fd55000f", + "metadata": {}, + "source": [ + "### Deleting Nodes and Edges\n", + "\n", + "One common tripping point with `DELETE` queries is that if you are trying to delete a node that is attached to an edge. If you do this, then the error will be returned:\n", + "\n", + "```\n", + "\"Cannot delete node, because it still has relationships. To delete this node, you must first delete its relationships.\"\n", + "```\n", + "\n", + "To avoid this, we can delete all the relationships and then the node, like this:\n", + "\n", + "```\n", + "MATCH (n:person {first_name: 'Joseph'})-[r]-()\n", + "DELETE r\n", + "DELETE n\n", + "```\n", + "\n", + "The other option is to use the `DETACH DELETE` version of the clause which allows you to specify just the node and it will remove all adjacent edges and then the node. The example below removes all edges and nodes with a name of `John Doe`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "493ac5cf", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n", + "MATCH (n:person {first_name: 'John', last_name: 'Doe'})\n", + "DETACH DELETE n\n", + "RETURN n" + ] + }, + { + "cell_type": "markdown", + "id": "f5d352e5", + "metadata": {}, + "source": [ + "## Exercises\n", + "\n", + "Now that we have gone through the concepts of openCypher mutation queries, it's time to put it into practice. Below are several exercises you can complete to verify your understanding of the material covered in this notebook. As practice for what you have learned, please write the openCypher queries specified below.\n", + "\n", + "### Exercise M-1 Create a new person `Leonhard Euler` and connect them to `Dave`?\n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Create a new `person` node with a name of `Leonhard Euler` \n", + "* Connect the new node to `Dave` via a `friends` edge\n", + "* Return the new connection\n", + "\n", + "The results for this query are the new edge's id" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "36adacaf", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n" + ] + }, + { + "cell_type": "markdown", + "id": "04ab5b59", + "metadata": {}, + "source": [ + "### Exercise M-2 Upsert the a list of followers and add an edge to `Dave`?\n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Given the following list:\n", + " ```\n", + " [{first_name: 'Taylor', last_name: 'Hall'},{first_name: 'Kelvin', last_name: 'Fernsby'},{first_name: 'Ian', last_name: 'Rochester'}]\n", + " ```\n", + "* Add or update `person` nodes for each item in the list\n", + "* Add or update a `follower` relationship between each new node and `Dave`\n", + "* If the edge is created write a property `creation` with a value `Created`\n", + "* If the edge already exists write a property `creation` with a value `Updated`\n", + "* Return the new edge elements\n", + "* This query should be re-runable without creating new nodes or edges\n", + "\n", + "The results for this query are the three edge elements" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dae9d211", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n" + ] + }, + { + "cell_type": "markdown", + "id": "678a243d", + "metadata": {}, + "source": [ + "### Exercise M-3 Delete all `followers` edges and remove any connected nodes with no other edges?\n", + "\n", + "Using the data model above, write a query that will:\n", + "\n", + "* Find all the followers edges and connected nodes and remove the edges\n", + "* For each of the connected nodes see if they have any other edges\n", + "* If they have edges then ignore them\n", + "* If they have no edges then remove them\n", + "* Return the number of edges removed and the number of nodes removed\n", + "\n", + "The results for this query are:\n", + "\n", + "|node_cnt|edge_cnt|\n", + "|--|--|\n", + "|3|6|" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f96b91e5", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc -d $node_labels\n" + ] + }, + { + "cell_type": "markdown", + "id": "b5acefc5", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this notebook, we explored how to write queries to mutate data in openCypher. This brings us to the end of our learning session on openCypher. Throughout these notebooks, we have walked through the most common usage patterns for openCypher and applied those to common query patterns for property graph queries. We hope that you now feel that you have a strong foundational understanding of how to approach and write openCypher queries." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/__init__.py b/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/openCypher-Exercises-Answer-Key.ipynb b/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/openCypher-Exercises-Answer-Key.ipynb new file mode 100644 index 00000000..63c40676 --- /dev/null +++ b/src/graph_notebook/notebooks/06-Language-Tutorials/02-openCypher/openCypher-Exercises-Answer-Key.ipynb @@ -0,0 +1,435 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b6612db4", + "metadata": {}, + "source": [ + "# Answer Key\n", + "\n", + "Below are the answers to the exercises given in the companion notebooks but first let's setup the visualization options.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "16995693-2a2c-42bf-8461-a32062ebaac3", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%graph_notebook_vis_options\n", + "{\n", + " \"groups\": { \n", + " \"person\": {\n", + " \"color\": \"#9ac7bf\"\n", + " },\n", + " \"review\": {\n", + " \"color\": \"#f8cecc\"\n", + " },\n", + " \"city\": {\n", + " \"color\": \"#d5e8d4\"\n", + " },\n", + " \"state\": {\n", + " \"color\": \"#dae8fc\"\n", + " },\n", + " \"review_rating\": {\n", + " \"color\": \"#e1d5e7\"\n", + " },\n", + " \"restaurant\": {\n", + " \"color\": \"#ffe6cc\"\n", + " },\n", + " \"cusine\": {\n", + " \"color\": \"#fff2cc\"\n", + " }\n", + " }\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "e76dc888-a0c7-4ed1-bac5-52e96578dfea", + "metadata": {}, + "source": [ + "## 01-Basic-Read-Queries\n", + "\n", + "### Exercise B-1 Find the first name of Dave's friends" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "28840d3b", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc \n", + "\n", + "MATCH (:person {first_name: 'Dave'})-[:friends]->(f:person) \n", + "RETURN f.first_name" + ] + }, + { + "cell_type": "markdown", + "id": "78ec7c0d", + "metadata": {}, + "source": [ + "### Exercise B-2 Find the first name of the friends of Dave's friends" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65e46798", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc \n", + "\n", + "MATCH (:person {first_name:'Dave'})-[:friends]->()-[:friends]->(p:person) \n", + "RETURN DISTINCT p.first_name" + ] + }, + { + "cell_type": "markdown", + "id": "17a15554", + "metadata": {}, + "source": [ + "### Exercise B-3 Find out how the friends of Dave's friends are connected" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cec9a543", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "\n", + "MATCH p=(d:person {first_name:'Dave'})-[:friends]->()-[:friends]->(:person) \n", + "RETURN p" + ] + }, + { + "cell_type": "markdown", + "id": "7d193248", + "metadata": {}, + "source": [ + "### Exercise B-4 Which friends should we recommend for Dave?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "670edcbc", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "\n", + "MATCH p=(d:person {first_name:'Dave'})-[:friends]->()-[:friends]->(foff:person) \n", + "WHERE NOT (foff)-[:friends]->(d)\n", + "RETURN DISTINCT foff.first_name" + ] + }, + { + "cell_type": "markdown", + "id": "c222fc68", + "metadata": {}, + "source": [ + "## 02-Variable-Length-Paths\n", + "\n", + "### Exercise VLP-1 Find the first name of Dave's friends" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a2064f90", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "\n", + "MATCH (:person {first_name:'Dave'})-[:friends*2]->(p:person) \n", + "RETURN DISTINCT p.first_name" + ] + }, + { + "cell_type": "markdown", + "id": "f6f7229b", + "metadata": {}, + "source": [ + "### Exercise VLP-2 Find all `person` nodes connected to Dave" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "60a9d820", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "\n", + "MATCH p=(:person {first_name:'Dave'})-[:friends*]->(:person) \n", + "RETURN p" + ] + }, + { + "cell_type": "markdown", + "id": "615c2be9", + "metadata": {}, + "source": [ + "### Exercise VLP-3 Find if Dave and Denise are connected" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fd4b86e2", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "\n", + "MATCH p=(:person {first_name:'Dave'})-[:friends*1..]-(:person {first_name:'Denise'})\n", + "RETURN True \n", + "LIMIT 1" + ] + }, + { + "cell_type": "markdown", + "id": "62d52023", + "metadata": {}, + "source": [ + "### Exercise VLP-4 Find all the ways Dave and Denise are connected" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "044ea076", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "\n", + "MATCH p=(:person {first_name:'Dave'})-[:friends*1..]-(:person {first_name:'Denise'})\n", + "RETURN p " + ] + }, + { + "cell_type": "markdown", + "id": "76439a5a", + "metadata": {}, + "source": [ + "## 03-Ordering-Functions-Grouping\n", + "\n", + "### Exercise G-1 What are the 3 highest restaurants?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43555803", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "MATCH (r:restaurant)<-[:about]-(rev:review)\n", + "WITH r, avg(rev.rating) AS avg_rating\n", + "ORDER BY avg_rating DESC\n", + "LIMIT 3\n", + "MATCH (r)-[:serves]->(c:cuisine)\n", + "RETURN r.name, c.name, avg_rating\n", + "ORDER BY avg_rating Desc" + ] + }, + { + "cell_type": "markdown", + "id": "85285f40", + "metadata": {}, + "source": [ + "### Exercise G-2 Find the top 3 highest rated restaurants in the city where Dave lives" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9726904", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "MATCH (p:person {first_name: 'Dave'})-[:lives]->(:city)<-[:within]-(r:restaurant)<-[:about]-(v:review)\n", + "WITH r, avg(v.rating) AS rating_average, p\n", + "RETURN r.name AS name,\n", + "r.address AS address, rating_average\n", + "ORDER BY rating_average DESC \n", + "LIMIT 3" + ] + }, + { + "cell_type": "markdown", + "id": "d87c8a58", + "metadata": {}, + "source": [ + "### Exercise G-3 What Mexican or Chinese restaurant near Dave that is the highest rated?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d43663c", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "MATCH (p:person {first_name: 'Dave'})-[:lives]->(:city)<-[:within]-(r:restaurant)-[:serves]->(c:cuisine) \n", + "WHERE c.name IN ['Mexican', 'Chinese'] \n", + "WITH r\n", + "MATCH (r)<-[:about]-(v:review) \n", + "WITH r, avg(v.rating) AS rating_average \n", + "RETURN r.name AS name, \n", + " r.address AS address, rating_average\n", + "ORDER BY rating_average DESC \n", + "LIMIT 1" + ] + }, + { + "cell_type": "markdown", + "id": "323feeda", + "metadata": {}, + "source": [ + "### Exercise G-4 What are the top 3 restaurants, recommended by his friends, where Dave lives? (Personalized Recommendation)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b1f39a5", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "MATCH (p:person {first_name: 'Dave'})-[:lives]->(c:city)\n", + "MATCH (p)-[:friends]-()-[:wrote]-(v)-[:about]-\n", + "(r:restaurant)-[:within]-(c) \n", + "RETURN r.name AS name, r.address AS address, avg(v.rating) as rating_average\n", + "ORDER BY rating_average DESC\n", + "LIMIT 3" + ] + }, + { + "cell_type": "markdown", + "id": "755abc7f", + "metadata": {}, + "source": [ + "## 04-Creating-Updating-Delete-Queries\n", + "\n", + "### Exercise M-1 Create a new person `Leonhard Euler` and connect them to `Dave`?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c04c4f37", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc\n", + "CREATE (le:person {first_name: 'Leonhard', last_name: 'Euler'})\n", + "WITH le\n", + "MATCH (dave:person {first_name: 'Dave'})\n", + "CREATE (le)-[:friends]->(dave)\n", + "RETURN le" + ] + }, + { + "cell_type": "markdown", + "id": "882bb492", + "metadata": {}, + "source": [ + "### Exercise M-2 Upsert the a list of followers and add an edge to `Dave`?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5f323a03", + "metadata": {}, + "outputs": [], + "source": [ + "%%oc\n", + "WITH [{first_name: 'Taylor', last_name: 'Hall'},{first_name: 'Kelvin', last_name: 'Fernsby'},{first_name: 'Ian', last_name: 'Rochester'}] as followers\n", + "\n", + "UNWIND followers as f\n", + "MERGE (n:person {first_name: f.first_name, last_name: f.last_name})-[r:follower]->(:person {first_name: 'Dave'})\n", + "ON CREATE\n", + " SET r.creation='Created'\n", + "ON MATCH\n", + " SET r.creation='Updated'\n", + "RETURN r" + ] + }, + { + "cell_type": "markdown", + "id": "a2643abc", + "metadata": {}, + "source": [ + "### Exercise M-3 Delete all `followers` edges and remove any connected nodes with no other edges?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "48acc3db", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%oc\n", + "MATCH (a)-[r:followers]-()\n", + "DELETE r\n", + "WITH collect(DISTINCT a) as a, count(r) as edge_cnt\n", + "UNWIND a as n\n", + "MATCH (n)\n", + "WHERE size((n)--()) = 0\n", + "DELETE n\n", + "WITH count(n) as node_cnt, edge_cnt\n", + "RETURN node_cnt, edge_cnt" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/test/unit/notebooks/test_validate_notebooks.py b/test/unit/notebooks/test_validate_notebooks.py index 66fdafad..479f1917 100644 --- a/test/unit/notebooks/test_validate_notebooks.py +++ b/test/unit/notebooks/test_validate_notebooks.py @@ -56,7 +56,12 @@ def test_no_extra_notebooks(self): f'{NOTEBOOK_BASE_DIR}/05-Data-Science/00-Identifying-Fraud-Rings-Using-Social-Network-Analytics.ipynb', f'{NOTEBOOK_BASE_DIR}/05-Data-Science/01-Identifying-1st-Person-Synthetic-Identity-Fraud-Using-Graph-Similarity.ipynb', f'{NOTEBOOK_BASE_DIR}/05-Data-Science/02-Logistics-Analysis-using-a-Transportation-Network.ipynb', - f'{NOTEBOOK_BASE_DIR}/06-Language-Tutorials/01-SPARQL/01-SPARQL-Basics.ipynb'] + f'{NOTEBOOK_BASE_DIR}/06-Language-Tutorials/01-SPARQL/01-SPARQL-Basics.ipynb', + f'{NOTEBOOK_BASE_DIR}/06-Language-Tutorials/02-openCypher/01-Basic-Read-Queries.ipynb', + f'{NOTEBOOK_BASE_DIR}/06-Language-Tutorials/02-openCypher/02-Variable-Length-Paths.ipynb', + f'{NOTEBOOK_BASE_DIR}/06-Language-Tutorials/02-openCypher/03-Ordering-Functions-Grouping.ipynb', + f'{NOTEBOOK_BASE_DIR}/06-Language-Tutorials/02-openCypher/04-Creating-Updating-Delete-Queries.ipynb', + f'{NOTEBOOK_BASE_DIR}/06-Language-Tutorials/02-openCypher/openCypher-Exercises-Answer-Key.ipynb'] notebook_paths = get_all_notebooks_paths() expected_paths.sort() notebook_paths.sort()