diff --git a/introduction_to_amazon_algorithms/linear_learner_abalone/Linear_Learner_Regression_csv_format.ipynb b/introduction_to_amazon_algorithms/linear_learner_abalone/Linear_Learner_Regression_csv_format.ipynb new file mode 100644 index 0000000000..19129ca971 --- /dev/null +++ b/introduction_to_amazon_algorithms/linear_learner_abalone/Linear_Learner_Regression_csv_format.ipynb @@ -0,0 +1,506 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Regression with Amazon SageMaker Linear Learner algorithm\n", + "_**Single machine training for regression with Amazon SageMaker Linear Learner algorithm**_\n", + "\n", + "---\n", + "\n", + "---\n", + "## Contents\n", + "1. [Introduction](#Introduction)\n", + "2. [Setup](#Setup)\n", + " 1. [Fetching and exploring the dataset](#Fetching-and-exploring-the-dataset)\n", + " 2. [libvsm to csv convertion](#libvsm-to-csv-convertion)\n", + " 3. [Dividing the data](#Dividing-the-data)\n", + " 4. [Data Ingestion](#Data-ingestion)\n", + "3. [Training the Linear Learner model](#Training-the-Linear-Learner-model)\n", + "4. [Set up hosting for the model](#Set-up-hosting-for-the-model)\n", + "5. [Inference](#Inference)\n", + "6. [Delete the Endpoint](#Delete-the-Endpoint)\n", + "\n", + "---\n", + "## Introduction\n", + "\n", + "This notebook demonstrates the use of Amazon SageMaker’s implementation of the Linear Learner algorithm to train and host a regression model. We use the [Abalone data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html) originally from the [UCI data repository](https://archive.ics.uci.edu/ml/datasets/abalone). \n", + "\n", + "The dataset contains 9 fields, starting with the Rings number which is a number indicating the age of the abalone (as age equals to number of rings plus 1.5). Usually the number of rings are counted through microscopes to estimate the abalone's age. So we will use our algorithm to predict the abalone age based on the other features which are mentioned respectively as below within the dataset. \n", + "\n", + "'Rings','sex','Length','Diameter','Height','Whole Weight','Shucked Weight','Viscera Weight' and 'Shell Weight'\n", + "\n", + "The above features starting from sex to Shell.weight are physical measurements that can be measured using the correct tools, so we improve the complixety of having to examine the abalone under microscopes to understand it's age.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## Setup\n", + "\n", + "\n", + "This notebook was created and tested on an ml.m4.4xlarge notebook instance.\n", + "\n", + "Let's start by specifying:\n", + "1. The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.\n", + "1. The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import boto3\n", + "import re\n", + "import sagemaker\n", + "\n", + "\n", + "role = sagemaker.get_execution_role()\n", + "region = boto3.Session().region_name\n", + "\n", + "# S3 bucket for saving code and model artifacts.\n", + "# Feel free to specify a different bucket and prefix\n", + "bucket = sagemaker.Session().default_bucket()\n", + "prefix = 'sagemaker/DEMO-linear-learner-abalone-regression'\n", + "bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region, bucket)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Fetching and exploring the dataset\n", + "\n", + "First we are downloading the dataset in the original libvsm format, more info about this format can be found [here](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html).\n", + "\n", + "\n", + "The dataset contains a total of 9 fields. Throughout this notebook, they will be named as follows 'age','sex','Length','Diameter','Height','Whole.weight','Shucked.weight','Viscera.weight' and 'Shell.weight' respictively.\n", + "\n", + "the below data frame representation explain the value of each field.\n", + "Note: the age field is in integer representation and the rest of the features are in the format of \"feature_number\":\"feature_value\"\n", + "\n", + "```\n", + "**'data.frame'**:\n", + "age int 15 7 9 10 7 8 20 16 9 19 ...\n", + "Sex : Factor w/ 3 levels \"F\",\"I\",\"M\" values of 1,2,3\n", + "Length : float 0.455 0.35 0.53 0.44 0.33 0.425 ...\n", + "Diameter : float 0.365 0.265 0.42 0.365 0.255 0.3 ...\n", + "Height : float 0.095 0.09 0.135 0.125 0.08 0.095 ...\n", + "Whole.weight : float 0.514 0.226 0.677 0.516 0.205 ...\n", + "Shucked.weight : float 0.2245 0.0995 0.2565 0.2155 0.0895 ...\n", + "Viscera.weight : float 0.101 0.0485 0.1415 0.114 0.0395 ...\n", + "Shell.weight : float 0.15 0.07 0.21 0.155 0.055 0.12 ...\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "import urllib.request\n", + "\n", + "# Load the dataset\n", + "SOURCE_DATA = 'abalone'\n", + "urllib.request.urlretrieve(\"https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/abalone\", SOURCE_DATA)\n", + "\n", + "import pandas as pd# Read in csv and store in a pandas dataframe\n", + "df = pd.read_csv(SOURCE_DATA, sep=' ', encoding='latin1', names=['age','sex','Length','Diameter','Height','Whole.weight','Shucked.weight','Viscera.weight','Shell.weight'])\n", + "print(df.head(1))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## libvsm to csv convertion\n", + "\n", + "Then we convert this dataset into csv format which is one of the accepted formats by the Linear Learner Algorithm, more information [here](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html#ll-input_output). \n", + "\n", + "The value of the age field is parsed as integer and the value for the features is extracted from the format of \"feature_number\":\"feature_value\" to return only the value of the corresponding feature then the final frame is written to the output file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "\n", + "# import numpy and pandas libraries for working with data\n", + "import numpy as np\n", + "import pandas as pd# Read in csv and store in a pandas dataframe\n", + "df = pd.read_csv(SOURCE_DATA, sep=' ', encoding='latin1', names=['age','sex','Length','Diameter','Height','Whole.weight','Shucked.weight','Viscera.weight','Shell.weight'])\n", + "\n", + "#converting the age to int value\n", + "df['age']= df['age'].astype(int)\n", + "\n", + "#drop any null values\n", + "df.dropna(inplace = True)\n", + "\n", + "#Extracting the features values from the libvsm format\n", + "features=['sex','Length','Diameter','Height','Whole.weight','Shucked.weight','Viscera.weight','Shell.weight']\n", + "for feature in features:\n", + " if feature=='sex':\n", + " df[feature]= (df[feature].str.split(\":\", n = 1, expand=True)[1]).astype(int)\n", + " else:\n", + " df[feature]= (df[feature].str.split(\":\", n = 1, expand=True)[1]).astype(float)\n", + "\n", + "\n", + "# #writing the final data in the correct format\n", + "df.to_csv('new_data_set_float32.csv',sep=',',index=False,header=None)\n", + "\n", + "print(df.head(1))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dividing the data\n", + "\n", + "Following methods split the data into train/test/validation datasets and upload files to S3.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "\n", + "import io\n", + "import boto3\n", + "import random\n", + "\n", + "def data_split(FILE_DATA, FILE_TRAIN, FILE_VALIDATION, FILE_TEST, PERCENT_TRAIN, PERCENT_VALIDATION, PERCENT_TEST):\n", + " data = [l for l in open(FILE_DATA, 'r')]\n", + " train_file = open(FILE_TRAIN, 'w')\n", + " valid_file = open(FILE_VALIDATION, 'w')\n", + " tests_file = open(FILE_TEST, 'w')\n", + "\n", + " num_of_data = len(data)\n", + " num_train = int((PERCENT_TRAIN/100.0)*num_of_data)\n", + " num_valid = int((PERCENT_VALIDATION/100.0)*num_of_data)\n", + " num_tests = int((PERCENT_TEST/100.0)*num_of_data)\n", + "\n", + " data_fractions = [num_train, num_valid, num_tests]\n", + " split_data = [[],[],[]]\n", + "\n", + " rand_data_ind = 0\n", + "\n", + " for split_ind, fraction in enumerate(data_fractions):\n", + " for i in range(fraction):\n", + " rand_data_ind = random.randint(0, len(data)-1)\n", + " split_data[split_ind].append(data[rand_data_ind])\n", + " data.pop(rand_data_ind)\n", + "\n", + " for l in split_data[0]:\n", + " train_file.write(l)\n", + "\n", + " for l in split_data[1]:\n", + " valid_file.write(l)\n", + "\n", + " for l in split_data[2]:\n", + " tests_file.write(l)\n", + "\n", + " train_file.close()\n", + " valid_file.close()\n", + " tests_file.close()\n", + "\n", + "def write_to_s3(fobj, bucket, key):\n", + " return boto3.Session(region_name=region).resource('s3').Bucket(bucket).Object(key).upload_fileobj(fobj)\n", + "\n", + "def upload_to_s3(bucket, channel, filename):\n", + " fobj=open(filename, 'rb')\n", + " key = prefix+'/' + channel + '/' + filename\n", + " url = 's3://{}/{}'.format(bucket, key)\n", + " print('Writing to {}'.format(url))\n", + " write_to_s3(fobj, bucket, key)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data ingestion\n", + "\n", + "Next, we read the dataset from the existing repository into memory, for preprocessing prior to training. This processing could be done *in situ* by Glue, Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "# Load the dataset\n", + "FILE_DATA = 'new_data_set_float32.csv'\n", + "\n", + "#split the downloaded data into train/test/validation files\n", + "FILE_TRAIN = 'abalone_dataset1_train.csv'\n", + "FILE_VALIDATION = 'abalone_dataset1_validation.csv'\n", + "FILE_TEST = 'abalone_dataset1_test.csv'\n", + "PERCENT_TRAIN = 70\n", + "PERCENT_VALIDATION = 15\n", + "PERCENT_TEST = 15\n", + "data_split(FILE_DATA, FILE_TRAIN, FILE_VALIDATION, FILE_TEST, PERCENT_TRAIN, PERCENT_VALIDATION, PERCENT_TEST)\n", + "\n", + "#upload the files to the S3 bucket\n", + "upload_to_s3(bucket, 'train', FILE_TRAIN)\n", + "upload_to_s3(bucket, 'validation', FILE_VALIDATION)\n", + "upload_to_s3(bucket, 'test', FILE_TEST)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the sagemaker.session.s3_input objects from our data channels [link](https://sagemaker.readthedocs.io/en/v1.2.4/session.html#). These objects are then put in a simple dictionary, which the algorithm consumes. Notice that here we use a content_type as text/csv for the file that we have uploaded in previous step. We use two channels here one for training and the second one for validation. The testing samples from above will be used on the prediction step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#creating the inputs for the fit() function with the training and validation location\n", + "s3_train_data='s3://{}/{}/train'.format(bucket, prefix)\n", + "print('training files will be taken from: {}'.format(s3_train_data))\n", + "s3_validation_data='s3://{}/{}/validation'.format(bucket, prefix)\n", + "print('validtion files will be taken from: {}'.format(s3_validation_data))\n", + "output_location = 's3://{}/{}/output'.format(bucket, prefix)\n", + "print('training artifacts output location: {}'.format(output_location))\n", + "\n", + "#genrating the session.s3_input() format for fit() accepted by the sdk\n", + "train_data = sagemaker.inputs.TrainingInput(s3_train_data, distribution='FullyReplicated', \n", + " content_type='text/csv', s3_data_type='S3Prefix',record_wrapping=None,compression=None)\n", + "validation_data = sagemaker.inputs.TrainingInput(s3_validation_data, distribution='FullyReplicated', \n", + " content_type='text/csv', s3_data_type='S3Prefix',record_wrapping=None,compression=None)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Training the Linear Learner model\n", + "\n", + "First, we retrieve the image for the Linear Learner Algorithm according to the region.\n", + "\n", + "Then we create estimator from sagemaker sdk [link](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) using the Linear Learner container image and we setup the training parameters and hyperparameters configuration.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "#getting the linear learner image according to the region\n", + "from sagemaker.image_uris import retrieve\n", + "container = retrieve('linear-learner', boto3.Session().region_name, version='1')\n", + "print(container)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "import boto3\n", + "import sagemaker\n", + "from time import gmtime, strftime\n", + "\n", + "sess = sagemaker.Session()\n", + "\n", + "job_name = 'DEMO-linear-learner-abalone-regression-' + strftime(\"%H-%M-%S\", gmtime())\n", + "print(\"Training job\", job_name)\n", + "\n", + "linear = sagemaker.estimator.Estimator(container,\n", + " role,\n", + " input_mode='File',\n", + " train_instance_count=1, \n", + " train_instance_type='ml.m4.xlarge',\n", + " output_path=output_location,\n", + " sagemaker_session=sess\n", + " )\n", + "\n", + "linear.set_hyperparameters(feature_dim=8,\n", + " epochs=16,\n", + " wd=0.01,\n", + " loss='absolute_loss',\n", + " predictor_type='regressor',\n", + " normalize_data=True,\n", + " optimizer='adam',\n", + " mini_batch_size=100,\n", + " lr_scheduler_step=100,\n", + " lr_scheduler_factor=0.99,\n", + " lr_scheduler_minimum_lr=0.0001,\n", + " learning_rate=0.1\n", + " )\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "After configuring the Estimator object and setting the hyperparameters for this object. The only remaining thing to do is to train the algorithm. The following cell will train the algorithm. Training the algorithm involves a few steps. Firstly, the instances that we requested while creating the Estimator classes are provisioned and are setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take time, depending on the size of the data. Therefore it might be a few minutes before we start getting data logs for our training jobs. The data logs will also print out Mean Average Precision (mAP) on the validation data, among other losses, for every run of the dataset once or one epoch. This metric is a proxy for the quality of the algorithm.\n", + "\n", + "Once the job has finished a \"Job complete\" message will be printed. The trained model can be found in the S3 bucket that was setup as output_path in the estimator. For this example,the training time takes between 4 and 6 minutes.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "linear.fit(inputs={'train': train_data, \n", + " 'validation': validation_data},job_name=job_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Set up hosting for the model\n", + "\n", + "Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same insantance (or type of instance) that we used to train. Training is a prolonged and compute heavy job that require a different of compute and memory requirements that hosting typically do not. We can choose any type of instance we want to host the model. In our case we chose the ml.m4.xlarge instance to train, but we choose to host the model on the less expensive cpu instance, ml.c4.xlarge. The endpoint deployment can be accomplished as follows:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "%%time\n", + "#creating the endpoint out of the trained model\n", + "linear_predictor = linear.deploy(initial_instance_count=1,\n", + " instance_type='ml.c4.xlarge')\n", + "print(\"\\ncreated endpoint: \" + linear_predictor.endpoint_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Inference\n", + "\n", + "Now that the trained model is deployed at an endpoint that is up-and-running, we can use this endpoint for inference. To do this, we are going to configure the predictor object [link](https://sagemaker.readthedocs.io/en/v1.2.4/predictors.html) to parse contents of type text/csv and deserialize the reply recieved from the endpoint to json format.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#configure the predictor to accept to serialize csv input and parse the reposne as json\n", + "from sagemaker.predictor import csv_serializer, json_deserializer\n", + "\n", + "linear_predictor.serializer = csv_serializer\n", + "linear_predictor.deserializer = json_deserializer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "We then use the test file containing the records of the data that we kept to test the model prediction. By running below cell multiple times we are selecting random sample from the testing samples to perform the Inference with." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "import json\n", + "from itertools import islice\n", + "import math\n", + "import struct\n", + "\n", + "#getting testing sample from our test file\n", + "test_data = [l for l in open(FILE_TEST, 'r')]\n", + "sample=random.choice(test_data).split(',')\n", + "actual_age=sample[0] \n", + "payload=sample[1:] #removing actual age from the sample\n", + "payload=','.join(map(str, payload))\n", + "\n", + "\n", + "#Invoke the predicor and analyise the result\n", + "result = linear_predictor.predict(payload)\n", + "\n", + "#extracting the prediction value\n", + "result = round(float(result['predictions'][0]['score']),2)\n", + "\n", + "\n", + "accuracy=str(round(100-((abs(float(result)-float(actual_age))/float(actual_age))*100),2))\n", + "print ('Actual age: ',actual_age,'\\nPrediction: ', result, '\\nAccuracy: ', accuracy)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Delete the Endpoint\n", + "Having an endpoint running will incur some costs. Therefore as a clean-up job, we should delete the endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sagemaker.Session().delete_endpoint(linear_predictor.endpoint_name)\n", + "print(f\"deleted {linear_predictor.endpoint_name} successfully!\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "conda_python3", + "language": "python", + "name": "conda_python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}