From 6ccaf6c3a71bd7c1508f507ec70590911af70d4c Mon Sep 17 00:00:00 2001 From: Ben Lackey Date: Fri, 19 Aug 2022 11:07:21 -0700 Subject: [PATCH] Add SageMaker Autopilot and Neo4j portfolio churn notebook. (#3505) * Add SageMaker Autopilot and Neo4j portfolio churn notebook. * update table of contents for graph embedding notebook * correct link * newline * note on edgar, s3 * notes on ASG * url anonymized * spelling * use s3 * spelling * name for link * comment drop * formatting * 20 minutes * more descriptive va name * branding issues * remove extra comment * note on validation * conclusion * no more ' * brackets on URL * black-nb -l 100 sagemaker_autopilot_neo4j_portfolio_churn.ipynb * incorporate Julia changes to downloadNotebook function * performance issue * working with large notebook * clear outputs. run linter one more time * typo * render link * format * remove link * insert link * no dash * fiddling w link * maybe it's a bad character escape? * AutoPilot caps * camel case SageMaker * bucket specfics * Bump version to 4.4.9 from 4.4.8 * add stack name, disk size * add note per Aramide on stack delete. * note * typos Co-authored-by: Julia Kroll <75504951+jkroll-aws@users.noreply.github.com> --- README.md | 1 + autopilot/index.rst | 1 + ...aker_autopilot_neo4j_portfolio_churn.ipynb | 944 ++++++++++++++++++ 3 files changed, 946 insertions(+) create mode 100644 autopilot/sagemaker_autopilot_neo4j_portfolio_churn.ipynb diff --git a/README.md b/README.md index 14d69da747..ff96af648a 100644 --- a/README.md +++ b/README.md @@ -75,6 +75,7 @@ These examples introduce SageMaker Autopilot. Autopilot automatically performs f - [Customer Churn AutoML](autopilot/) shows how to use SageMaker Autopilot to automatically train a model for the [Predicting Customer Churn](introduction_to_applying_machine_learning/xgboost_customer_churn) task. - [Targeted Direct Marketing AutoML](autopilot/) shows how to use SageMaker Autopilot to automatically train a model. - [Housing Prices AutoML](sagemaker-autopilot/housing_prices) shows how to use SageMaker Autopilot for a linear regression problem (predict housing prices). +- [Portfolio Churn Prediction with Amazon SageMaker Autopilot and Neo4j](autopilot/sagemaker_autopilot_neo4j_portfolio_churn.ipynb) shows how to use SageMaker Autopilot with graph embeddings to predict investment portfolio churn. ### Introduction to Amazon Algorithms diff --git a/autopilot/index.rst b/autopilot/index.rst index b8afd1c000..d8e2f4a3f7 100644 --- a/autopilot/index.rst +++ b/autopilot/index.rst @@ -8,6 +8,7 @@ Get started with Autopilot sagemaker_autopilot_direct_marketing sagemaker_autopilot_abalone_parquet_input + sagemaker_autopilot_neo4j_portfolio_churn Feature selection diff --git a/autopilot/sagemaker_autopilot_neo4j_portfolio_churn.ipynb b/autopilot/sagemaker_autopilot_neo4j_portfolio_churn.ipynb new file mode 100644 index 0000000000..3f8774acfe --- /dev/null +++ b/autopilot/sagemaker_autopilot_neo4j_portfolio_churn.ipynb @@ -0,0 +1,944 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "4rQ6G65n6OxV" + }, + "source": [ + "# Portfolio Churn Prediction with Amazon SageMaker AutoPilot and Neo4j\n", + "This notebook describes how to use Neo4j and SageMaker together. In it you connect to a Neo4j instance, load data and compute an embedding. You then load that data into Amazon S3. Finally, you use SageMaker to train a model using the new embedding as an additional feature. \n", + "\n", + "The data set represents a binary classification problem based on data from the SEC's EDGAR database. It was scraped from the EDGAR system using the code [here](https://github.com/neo4j-partners/neo4j-sec-edgar-form13). The data set consists of Form 13 data, the quarterly filings of asset managers with $100M or more of assets under management (AUM).\n", + "\n", + "**Important:** This example notebook is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IdMFRbqGzSqF" + }, + "source": [ + "## Deploy Neo4j\n", + "You're going to need a Neo4j deployment to run this lab. The easiest way to get that is via the [AWS Marketplace](https://aws.amazon.com/marketplace/seller-profile?id=23ec694a-d2af-4641-b4d3-b7201ab2f5f9). Select \"Neo4j Enterprise Edition\" and deploy that. Suggested parameters are:\n", + "\n", + "* Stack name - neo4j-ee\n", + "* Graph Database Version - 4.4.9\n", + "* Install Graph Data Science - True\n", + "* Graph Data Science License Key - None\n", + "* Install Bloom - False\n", + "* Bloom License Key - None\n", + "* Password - Enter something here\n", + "* Node Count - 1\n", + "* Instance Type - r6i.4xlarge\n", + "* Disk Size - 100\n", + "* SSH CIDR - 0.0.0.0/0\n", + "\n", + "The Marketplace listing deploys an Auto Scaling Group (ASG) and a Load Balancer (LB) in front of that. When deployment is complete, you can get the DNS name of your LB from the console and use that to connect. You can view deployed NLBs at [Load Balancer](https://console.aws.amazon.com/ec2/v2/home?#LoadBalancers:sort=loadBalancerName).\n", + "\n", + "If you need to change any parameters after you've deployed, you'll want to delete the stack and redeploy rather than attempting to update the stack." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9MwTYwKk6OxX" + }, + "source": [ + "## Using the Neo4j API\n", + "Now that we have a Neo4j deployment, let's connect to Neo4j. First off, install the Neo4j Graph Data Science package." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "FT0KaLYj6OxX" + }, + "outputs": [], + "source": [ + "%pip install graphdatascience" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sFokFbiL6OxY" + }, + "source": [ + "Now, you're going to need the connection string and credentials from the deployment you created above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "P41l_P4zzSqF" + }, + "outputs": [], + "source": [ + "# Edit these variables!\n", + "DB_URL = \"neo4j://.amazonaws.com:7687\"\n", + "DB_PASS = \"\"\n", + "\n", + "# You can leave this default\n", + "DB_USER = \"neo4j\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8lUkSvmozSqF" + }, + "outputs": [], + "source": [ + "from graphdatascience import GraphDataScience\n", + "\n", + "gds = GraphDataScience(DB_URL, auth=(DB_USER, DB_PASS))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_7-MlyTU6OxZ" + }, + "source": [ + "## Load Data into Neo4j\n", + "Now that we've got our connection object, let's load the dataset into Neo4j.\n", + "\n", + "The dataset is pulled from the SEC's EDGAR database. These are public filings of something called Form 13. Asset managers with over \\$100m AUM are required to submit Form 13 quarterly. That's then made available to the public over http. The csvs linked above were pulled from EDGAR using some python scripts linked above. We've filtered the data to only include filings over \\$10m in value.\n", + "\n", + "We're going to create constraints for our data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VxgUxjVQ6OxZ" + }, + "outputs": [], + "source": [ + "result = gds.run_cypher(\n", + " \"CREATE CONSTRAINT IF NOT EXISTS ON (p:Company) ASSERT (p.cusip) IS NODE KEY;\"\n", + ")\n", + "display(result)\n", + "\n", + "result = gds.run_cypher(\n", + " \"CREATE CONSTRAINT IF NOT EXISTS ON (p:Manager) ASSERT (p.filingManager) IS NODE KEY;\"\n", + ")\n", + "display(result)\n", + "\n", + "result = gds.run_cypher(\n", + " \"CREATE CONSTRAINT IF NOT EXISTS ON (p:Holding) ASSERT (p.filingManager, p.cusip, p.reportCalendarOrQuarter) IS NODE KEY;\"\n", + ")\n", + "display(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BdKOItse6Oxa" + }, + "source": [ + "Now let's load the nodes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JgCgdkCt6Oxa" + }, + "outputs": [], + "source": [ + "result = gds.run_cypher(\n", + " \"\"\"\n", + " LOAD CSV WITH HEADERS FROM \"https://neo4j-dataset.s3.amazonaws.com/form13/2021.csv\" AS row\n", + " MERGE (c:Company {cusip:row.cusip})\n", + " ON CREATE SET\n", + " c.nameOfIssuer=row.nameOfIssuer\n", + " \"\"\"\n", + ")\n", + "display(result)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MqJZYNES6Oxa" + }, + "outputs": [], + "source": [ + "result = gds.run_cypher(\n", + " \"\"\"\n", + " LOAD CSV WITH HEADERS FROM \"https://neo4j-dataset.s3.amazonaws.com/form13/2021.csv\" AS row\n", + " MERGE (m:Manager {filingManager:row.filingManager})\n", + " \"\"\"\n", + ")\n", + "display(result)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rERDJtCi6Oxa" + }, + "outputs": [], + "source": [ + "result = gds.run_cypher(\n", + " \"\"\"\n", + " LOAD CSV WITH HEADERS FROM \"https://neo4j-dataset.s3.amazonaws.com/form13/2021.csv\" AS row\n", + " MERGE (h:Holding {filingManager:row.filingManager, cusip:row.cusip, reportCalendarOrQuarter:row.reportCalendarOrQuarter})\n", + " ON CREATE SET\n", + " h.value=row.value, \n", + " h.shares=row.shares,\n", + " h.target=row.target,\n", + " h.nameOfIssuer=row.nameOfIssuer\n", + " \"\"\"\n", + ")\n", + "display(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vzdC3x316Oxa" + }, + "source": [ + "Now let's create relationships between those nodes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rggD5Yho6Oxa" + }, + "outputs": [], + "source": [ + "result = gds.run_cypher(\n", + " \"\"\"\n", + " LOAD CSV WITH HEADERS FROM \"https://neo4j-dataset.s3.amazonaws.com/form13/2021.csv\" AS row\n", + " MATCH (m:Manager {filingManager:row.filingManager})\n", + " MATCH (h:Holding {filingManager:row.filingManager, cusip:row.cusip, reportCalendarOrQuarter:row.reportCalendarOrQuarter})\n", + " MERGE (m)-[r:OWNS]->(h)\n", + " \"\"\"\n", + ")\n", + "display(result)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rpsRbdhe6Oxb" + }, + "outputs": [], + "source": [ + "result = gds.run_cypher(\n", + " \"\"\"\n", + " LOAD CSV WITH HEADERS FROM \"https://neo4j-dataset.s3.amazonaws.com/form13/2021.csv\" AS row\n", + " MATCH (h:Holding {filingManager:row.filingManager, cusip:row.cusip, reportCalendarOrQuarter:row.reportCalendarOrQuarter})\n", + " MATCH (c:Company {cusip:row.cusip})\n", + " MERGE (h)-[r:PARTOF]->(c)\n", + " \"\"\"\n", + ")\n", + "display(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZtJy4eO_zSqF" + }, + "source": [ + "## Graph Data Science\n", + "Now we're going to use Neo4j Graph Data Science to create an in-memory graph representation of the data. We'll enhance that representation with features we engineer using a graph embedding." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "x76ZEtR16Oxb" + }, + "outputs": [], + "source": [ + "result = gds.run_cypher(\n", + " \"\"\"\n", + " CALL gds.graph.project(\n", + " \"mygraph\",\n", + " [\"Company\", \"Manager\", \"Holding\"],\n", + " {\n", + " OWNS: {orientation: \"UNDIRECTED\"},\n", + " PARTOF: {orientation: \"UNDIRECTED\"}\n", + " }\n", + " )\n", + " YIELD\n", + " graphName AS graph,\n", + " relationshipProjection AS readProjection,\n", + " nodeCount AS nodes,\n", + " relationshipCount AS rels\n", + " \"\"\"\n", + ")\n", + "display(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HiwL552u6Oxb" + }, + "source": [ + "If you get an error saying the graph already exists, that's probably because you ran this code before. You can destroy it using this command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EPZIIIJc6Oxb" + }, + "outputs": [], + "source": [ + "# result = gds.run_cypher(\n", + "# \"\"\"\n", + "# CALL gds.graph.drop(\"mygraph\")\n", + "# \"\"\"\n", + "# )\n", + "# display(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zG1novOj6Oxb" + }, + "source": [ + "Now, let's list the details of the graph to make sure the projection was created as we want." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yyaw5itE6Oxb" + }, + "outputs": [], + "source": [ + "result = gds.run_cypher(\n", + " \"\"\"\n", + " CALL gds.graph.list()\n", + " \"\"\"\n", + ")\n", + "display(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XEQAChAa6Oxb" + }, + "source": [ + "Now we can generate an embedding from that graph. This is a new feature we can use in our predictions. We're using FastRP, which is a more full featured and higher performance of Node2Vec. You can learn more about that at the [Fast Random Projection\n", + "](https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/) documentation page.\n", + "\n", + "There are a bunch of parameters we could adjust in this. One of the most obvious is the embeddingDimension. The documentation covers many more." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qLFxuPb66Oxc" + }, + "outputs": [], + "source": [ + "result = gds.run_cypher(\n", + " \"\"\"\n", + " CALL gds.fastRP.mutate(\"mygraph\",{\n", + " embeddingDimension: 16,\n", + " randomSeed: 1,\n", + " mutateProperty:\"embedding\"\n", + " })\n", + " \"\"\"\n", + ")\n", + "display(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iRpgM-NV6Oxc" + }, + "source": [ + "That creates an embedding for each node type. However, we only want the embedding on the nodes of type holding.\n", + "\n", + "We're going to take the embedding from our projection and write it to the holding nodes in the underlying database." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3dBS16zD6Oxc" + }, + "outputs": [], + "source": [ + "result = gds.run_cypher(\n", + " \"\"\"\n", + " CALL gds.graph.writeNodeProperties(\"mygraph\", [\"embedding\"], [\"Holding\"])\n", + " YIELD writeMillis\n", + " \"\"\"\n", + ")\n", + "display(result)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mK6LeBne6Oxc" + }, + "outputs": [], + "source": [ + "result = gds.run_cypher(\n", + " \"\"\"\n", + " MATCH (n:Holding) RETURN n\n", + " \"\"\"\n", + ")\n", + "display(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1N_x38Ci6Oxc" + }, + "source": [ + "Note that this query will take 2-3 minutes to run as it's grabbing nearly half a million nodes along with all their properties and our new embedding." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "197ZaAH16Oxc" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "df = pd.DataFrame([dict(record.items()) for record in result[\"n\"]])\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A3esUO8s6Oxc" + }, + "source": [ + "Note that the embedding row is an array. To make this dataset more consumable, we should flatten that out into multiple individual features: embedding_0, embedding_1, ... embedding_n.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-i0_txCB6Oxc" + }, + "outputs": [], + "source": [ + "embeddings = pd.DataFrame(df[\"embedding\"].values.tolist()).add_prefix(\"embedding_\")\n", + "merged = df.drop(columns=[\"embedding\"]).merge(embeddings, left_index=True, right_index=True)\n", + "merged" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4Zb7lH366Oxc" + }, + "source": [ + "Now that we have the data formatted properly, let's split it into training, testing and validation sets. We'll write those to disk.\n", + "\n", + "Our data is, in some sense a time series. We're going to window over three quarters. Q4 of 2021 is used to generate labels, so it's not present in the data set. That leaves Q3 as our validation data set. Q2 becomes test and Q1 is for training.\n", + "\n", + "We take this approach rather than generating random folds or similar to avoid time based leakage." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uLg34zlu6Oxc" + }, + "outputs": [], + "source": [ + "df = merged\n", + "\n", + "train = df.loc[df[\"reportCalendarOrQuarter\"] == \"03-31-2021\"]\n", + "train.to_csv(\"train.csv\", index=False)\n", + "\n", + "test = df.loc[df[\"reportCalendarOrQuarter\"] == \"06-30-2021\"]\n", + "test = test.drop([\"target\"], axis=1)\n", + "test.to_csv(\"test.csv\", index=False)\n", + "\n", + "validate = df.loc[df[\"reportCalendarOrQuarter\"] == \"09-30-2021\"]\n", + "validate = validate.drop([\"target\"], axis=1)\n", + "validate.to_csv(\"validate.csv\", index=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## SageMaker Connection\n", + "Let's setup our SageMaker connection." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sagemaker\n", + "import boto3\n", + "\n", + "region = boto3.Session().region_name\n", + "\n", + "session = sagemaker.Session()\n", + "bucket = session.default_bucket()\n", + "prefix = \"sagemaker/form13\"\n", + "\n", + "role = sagemaker.get_execution_role()\n", + "\n", + "sm = boto3.Session().client(service_name=\"sagemaker\", region_name=region)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Upload to Amazon S3\n", + "Now we're going to upload the training and testing data to our default SageMaker bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "train_data_s3_path = session.upload_data(path=\"train.csv\", key_prefix=prefix + \"/train\")\n", + "print(\"Training data uploaded to: \" + train_data_s3_path)\n", + "\n", + "test_data_s3_path = session.upload_data(path=\"test.csv\", key_prefix=prefix + \"/test\")\n", + "print(\"Testing data uploaded to: \" + test_data_s3_path)\n", + "\n", + "validation_data_s3_path = session.upload_data(path=\"validate.csv\", key_prefix=prefix + \"/validate\")\n", + "print(\"Validation data uploaded to: \" + validation_data_s3_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting up the SageMaker AutoPilot Job\n", + "After uploading the dataset to Amazon S3, you can invoke AutoPilot to find the best ML pipeline to train a model on this dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "auto_ml_job_config = {\"CompletionCriteria\": {\"MaxCandidates\": 3}}\n", + "\n", + "input_data_config = [\n", + " {\n", + " \"DataSource\": {\n", + " \"S3DataSource\": {\n", + " \"S3DataType\": \"S3Prefix\",\n", + " \"S3Uri\": \"s3://{}/{}/train\".format(bucket, prefix),\n", + " }\n", + " },\n", + " \"TargetAttributeName\": \"target\",\n", + " }\n", + "]\n", + "\n", + "output_data_config = {\"S3OutputPath\": \"s3://{}/{}/output\".format(bucket, prefix)}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Launching the SageMaker AutoPilot Job\n", + "You can now launch the AutoPilot job by calling the `create_auto_ml_job` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from time import gmtime, strftime, sleep\n", + "\n", + "timestamp_suffix = strftime(\"%d-%H-%M-%S\", gmtime())\n", + "\n", + "auto_ml_job_name = \"automl-form13-\" + timestamp_suffix\n", + "print(\"AutoMLJobName: \" + auto_ml_job_name)\n", + "\n", + "sm.create_auto_ml_job(\n", + " AutoMLJobName=auto_ml_job_name,\n", + " InputDataConfig=input_data_config,\n", + " OutputDataConfig=output_data_config,\n", + " AutoMLJobConfig=auto_ml_job_config,\n", + " RoleArn=role,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Tracking SageMaker AutoPilot job progress\n", + "A SageMaker AutoPilot job consists of the following high-level steps : \n", + "\n", + "* Analyzing Data, where the dataset is analyzed and AutoPilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets. \n", + "* Feature Engineering, where AutoPilot performs feature transformation on individual features of the dataset as well as at an aggregate level. \n", + "* Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline).\n", + "\n", + "This job typically takes 20-80 minutes to run. That time presumably varies based on the underlying ML algorithm in AutoPilot as well as provisioning times for components of AutoPilot." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"JobStatus - Secondary Status\")\n", + "print(\"----------------------------\")\n", + "\n", + "describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)\n", + "print(describe_response[\"AutoMLJobStatus\"] + \" - \" + describe_response[\"AutoMLJobSecondaryStatus\"])\n", + "job_run_status = describe_response[\"AutoMLJobStatus\"]\n", + "\n", + "while job_run_status not in (\"Failed\", \"Completed\", \"Stopped\"):\n", + " describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)\n", + " job_run_status = describe_response[\"AutoMLJobStatus\"]\n", + "\n", + " print(\n", + " describe_response[\"AutoMLJobStatus\"] + \" - \" + describe_response[\"AutoMLJobSecondaryStatus\"]\n", + " )\n", + " sleep(30)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Results\n", + "Now use the describe_auto_ml_job API to look up the best candidate selected by the SageMaker AutoPilot job." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pprint\n", + "\n", + "best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)[\"BestCandidate\"]\n", + "best_candidate_name = best_candidate[\"CandidateName\"]\n", + "\n", + "print(\"CandidateName: \" + best_candidate_name)\n", + "print(\n", + " \"FinalAutoMLJobObjectiveMetricName: \"\n", + " + best_candidate[\"FinalAutoMLJobObjectiveMetric\"][\"MetricName\"]\n", + ")\n", + "print(\n", + " \"FinalAutoMLJobObjectiveMetricValue: \"\n", + " + str(best_candidate[\"FinalAutoMLJobObjectiveMetric\"][\"Value\"])\n", + ")\n", + "print()\n", + "pprint.pprint(best_candidate)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Batch Inference\n", + "Now that we completed the SageMaker AutoPilot job on the dataset, let's create a model from the best candidate with Inference Pipelines." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "model_name = \"automl-form13-model-\" + timestamp_suffix\n", + "model = sm.create_model(\n", + " Containers=best_candidate[\"InferenceContainers\"], ModelName=model_name, ExecutionRoleArn=role\n", + ")\n", + "print(\"Model ARN corresponding to the best candidate is: {}\".format(model[\"ModelArn\"]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use batch inference through Amazon SageMaker batch transform. The same model can also be deployed to perform online inference using Amazon SageMaker hosting." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "transform_job_name = \"automl-form13-transform-\" + timestamp_suffix\n", + "\n", + "transform_input = {\n", + " \"DataSource\": {\"S3DataSource\": {\"S3DataType\": \"S3Prefix\", \"S3Uri\": test_data_s3_path}},\n", + " \"ContentType\": \"text/csv\",\n", + " \"CompressionType\": \"None\",\n", + " \"SplitType\": \"Line\",\n", + "}\n", + "\n", + "transform_output = {\n", + " \"S3OutputPath\": \"s3://{}/{}/inference-results\".format(bucket, prefix),\n", + "}\n", + "\n", + "transform_resources = {\"InstanceType\": \"ml.m5.4xlarge\", \"InstanceCount\": 1}\n", + "\n", + "sm.create_transform_job(\n", + " TransformJobName=transform_job_name,\n", + " ModelName=model_name,\n", + " TransformInput=transform_input,\n", + " TransformOutput=transform_output,\n", + " TransformResources=transform_resources,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can watch the transform job for completion. That takes approximately 20 minutes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"JobStatus\")\n", + "print(\"---------\")\n", + "\n", + "describe_response = sm.describe_transform_job(TransformJobName=transform_job_name)\n", + "job_run_status = describe_response[\"TransformJobStatus\"]\n", + "print(job_run_status)\n", + "\n", + "while job_run_status not in (\"Failed\", \"Completed\", \"Stopped\"):\n", + " describe_response = sm.describe_transform_job(TransformJobName=transform_job_name)\n", + " job_run_status = describe_response[\"TransformJobStatus\"]\n", + " print(job_run_status)\n", + " sleep(30)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let’s get the URL of the transform job results. You can open this in S3." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "bucket = session.default_bucket()\n", + "key = \"{}/inference-results/test_data.csv.out\".format(prefix)\n", + "url = \"s3://\" + bucket + key\n", + "\n", + "print(url)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## View All Candidates\n", + "You can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by SageMaker AutoPilot and sort them by their final performance metric." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "candidates = sm.list_candidates_for_auto_ml_job(\n", + " AutoMLJobName=auto_ml_job_name, SortBy=\"FinalObjectiveMetricValue\"\n", + ")[\"Candidates\"]\n", + "index = 0\n", + "for candidate in candidates:\n", + " print(\n", + " str(index)\n", + " + \" \"\n", + " + candidate[\"CandidateName\"]\n", + " + \" \"\n", + " + str(candidate[\"FinalAutoMLJobObjectiveMetric\"][\"Value\"])\n", + " )\n", + " index += 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Candidate Generation Notebook\n", + "SageMaker AutoPilot also auto-generates a Candidate Definitions notebook. This notebook can be used to interactively step through the various steps taken by the SageMaker AutoPilot to arrive at the best candidate. This notebook can also be used to override various runtime parameters like parallelism, hardware used, algorithms explored, feature extraction scripts and more." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This code downloads a file from our SageMaker bucket using the SageMaker session." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def downloadNotebook(s3_path):\n", + " session = sagemaker.Session()\n", + " role = sagemaker.get_execution_role()\n", + "\n", + " # reformat the s3 URL into something boto3 can handle\n", + " s3_path_parts = s3_path.replace(\"s3://\", \"\").split(\"/\")\n", + " bucket, key, file = s3_path_parts[0], \"/\".join(s3_path_parts[1:]), s3_path_parts[-1]\n", + "\n", + " print(bucket)\n", + " print(key)\n", + " print(file)\n", + "\n", + " print(\"file\" + file)\n", + " notebook = session.read_s3_file(bucket, key)\n", + " with open(file, \"w\") as text_file:\n", + " text_file.write(notebook)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can download the notebook with the command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "notebook_s3_path = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)[\"AutoMLJobArtifacts\"][\n", + " \"CandidateDefinitionNotebookLocation\"\n", + "]\n", + "downloadNotebook(notebook_s3_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Exploration Notebook\n", + "SageMaker Autopilot also auto-generates a Data Exploration notebook. This code will download that notebook:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "notebook_s3_path = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)[\"AutoMLJobArtifacts\"][\n", + " \"DataExplorationNotebookLocation\"\n", + "]\n", + "downloadNotebook(notebook_s3_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleanup\n", + "SageMaker stores its data in an Amazon S3 bucket. You may want to the results of our job in that bucket once you're done working with it.\n", + "\n", + "The AWS Marketplace listing we deployed Neo4j Enterprise Edition with created a stack. To delete the deployment, you would navigate to Amazon [CloudFormation](https://console.aws.amazon.com/cloudformation) in the console and delete the stack there. Be sure to delete the entire stack as that will delete all the subcomponents of the stack." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "In this notebook, you deployed Neo4j Enterprise Edition. Within SageMaker Studio, you then loaded a data set in Neo4j Graph Database. You used Neo4j Graph Data Science to compute a graph embedding on that dataset. Using that embedding, you ran a SageMaker AutoPilot job and inspected the output.\n", + "\n", + "This same flow can be repurposed to add graph embeddings to your own machine learning jobs. Graph embeddings are just one sort of graph feature that can be used in machine learning. The approach we used here would apply to incorporating other features like betweeness or neighborhood as well." + ] + } + ], + "metadata": { + "colab": { + "name": "embedding.ipynb", + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3.9.5 64-bit", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.5" + }, + "vscode": { + "interpreter": { + "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}