Added TF Iris, MXNet MNIST, KMeans high level + low level notebooks (#1)

* Added client libraries notebooks + kmeans notebooks * Fixed delete endpoint cell in mxnet notebook * Removed leftover debugging cells from kmeans high level * Reset IOBytes buffer after writing in KMeans low-level * kmeans lowlevel notebook fixes * Move examples into im-python-sdk directory
aws · Nov 3, 2017 · 20e6ef0 · 20e6ef0
1 parent 33158d0
commit 20e6ef0
Show file tree

Hide file tree

Showing 9 changed files with 1,801 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+**/.ipynb_checkpoints
diff --git a/im-python-sdk/1P_kmeans_highlevel/kmeans_mnist.ipynb b/im-python-sdk/1P_kmeans_highlevel/kmeans_mnist.ipynb
@@ -0,0 +1,379 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# End-to-End Example #1\n",
+    "\n",
+    "1. [Introduction](#Introduction)\n",
+    "2. [Prerequisites and Preprocessing](#Prequisites-and-Preprocessing)\n",
+    "  1. [Permissions and environment variables](#Permissions-and-environment-variables)\n",
+    "  2. [Data ingestion](#Data-ingestion)\n",
+    "  3. [Data inspection](#Data-inspection)\n",
+    "  4. [Data conversion](#Data-conversion)\n",
+    "3. [Training the K-Means model](#Training-the-K-Means-model)\n",
+    "4. [Set up hosting for the model](#Set-up-hosting-for-the-model)\n",
+    "  1. [Import model into hosting](#Import-model-into-hosting)\n",
+    "  2. [Create endpoint configuration](#Create-endpoint-configuration)\n",
+    "  3. [Create endpoint](#Create-endpoint)\n",
+    "5. [Validate the model for use](#Validate-the-model-for-use)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "\n",
+    "Welcome to our first end-to-end example! Today, we're working through a classification problem, specifically of images of handwritten digits, from zero to nine. Let's imagine that this dataset doesn't have labels, so we don't know for sure what the true answer is. In later examples, we'll show the value of \"ground truth\", as it's commonly known.\n",
+    "\n",
+    "Today, however, we need to get these digits classified without ground truth. A common method for doing this is a set of methods known as \"clustering\", and in particular, the method that we'll look at today is called k-means clustering. In this method, each point belongs to the cluster with the closest mean, and the data is partitioned into a number of clusters that is specified when framing the problem. In this case, since we know there are 10 clusters, and we have no labeled data (in the way we framed the problem), this is a good fit.\n",
+    "\n",
+    "To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prequisites and Preprocessing\n",
+    "\n",
+    "### Permissions and environment variables\n",
+    "\n",
+    "Here we set up the linkage and authentication to AWS services. There are three parts to this:\n",
+    "\n",
+    "1. The credentials and region for the account that's running training. Upload the credentials in the normal AWS credentials file format using the jupyter upload feature. The region must always be `us-west-2` during the Beta program.\n",
+    "2. The roles used to give learning and hosting access to your data. See the documentation for how to specify these.\n",
+    "3. The S3 bucket that you want to use for training and model data.\n",
+    "\n",
+    "_Note:_ Credentials for hosted notebooks will be automated before the final release."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "role='<your IM execution role here>'\n",
+    "bucket='<<bucket-name>>'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data ingestion\n",
+    "\n",
+    "Next, we read the dataset from the existing repository into memory, for preprocessing prior to training. This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "import pickle, gzip, numpy, urllib.request, json\n",
+    "\n",
+    "# Load the dataset\n",
+    "urllib.request.urlretrieve(\"http://deeplearning.net/data/mnist/mnist.pkl.gz\", \"mnist.pkl.gz\")\n",
+    "with gzip.open('mnist.pkl.gz', 'rb') as f:\n",
+    "    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data inspection\n",
+    "\n",
+    "Once the dataset is imported, it's typical as part of the machine learning process to inspect the data, understand the distributions, and determine what type(s) of preprocessing might be needed. You can perform those tasks right here in the notebook. As an example, let's go ahead and look at one of the digits that is part of the dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline\n",
+    "import matplotlib.pyplot as plt\n",
+    "plt.rcParams[\"figure.figsize\"] = (2,10)\n",
+    "\n",
+    "\n",
+    "def show_digit(img, caption='', subplot=None):\n",
+    "    if subplot==None:\n",
+    "        _,(subplot)=plt.subplots(1,1)\n",
+    "    imgr=img.reshape((28,28))\n",
+    "    subplot.axis('off')\n",
+    "    subplot.imshow(imgr, cmap='gray')\n",
+    "    plt.title(caption)\n",
+    "\n",
+    "show_digit(train_set[0][30], 'This is a {}'.format(train_set[1][30]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data conversion\n",
+    "\n",
+    "Since algorithms have particular input and output requirements, converting the dataset is also part of the process that a data scientist goes through prior to initiating training. In this particular case, the hosted implementation of k-means takes recordio-wrapped protobuf, where the data we have today is a pickle-ized numpy array on disk.\n",
+    "\n",
+    "Some of the effort involved in the protobuf format conversion is hidden in a library that is imported, below. This library will be folded into the SDK for algorithm authors to make it easier for algorithm authors to support multiple formats. This doesn't __prevent__ algorithm authors from requiring non-standard formats, but it encourages them to support the standard ones.\n",
+    "\n",
+    "For this dataset, conversion takes approximately one minute."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "import io\n",
+    "import im.kmeans\n",
+    "\n",
+    "vectors = [t.tolist() for t in train_set[0]]\n",
+    "labels = [t.tolist() for t in train_set[1]]\n",
+    "\n",
+    "buf = io.BytesIO()\n",
+    "im.kmeans.write_data_as_pb_recordio(vectors, labels, buf)\n",
+    "buf.seek(0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Upload training data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "import boto3\n",
+    "\n",
+    "key = 'MNIST-1P-Test/recordio-pb-data'\n",
+    "boto3.resource('s3').Bucket(bucket).Object(key).upload_fileobj(buf)\n",
+    "s3_train_data = 's3://{}/{}'.format(bucket, key)\n",
+    "print('uploaded training data location: {}'.format(s3_train_data))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "output_location = 's3://{}/kmeansoutput'.format(bucket)\n",
+    "print('training artifacts will be uploaded to: {}'.format(output_location))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Training the K-Means model\n",
+    "\n",
+    "Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. Since this data is relatively small, it isn't meant to show off the performance of the kmeans training algorithm - we will visit that in another example.\n",
+    "\n",
+    "After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes between 7 and 11 minutes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from im.kmeans import KMeans\n",
+    "\n",
+    "kmeans = KMeans(role=role,\n",
+    "                train_instance_count=2,\n",
+    "                train_instance_type='c4.8xlarge',\n",
+    "                output_path=output_location,\n",
+    "                k=10,\n",
+    "                feature_dim=784)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "\n",
+    "kmeans.fit({'train': s3_train_data})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Set up hosting for the model\n",
+    "In order to set up hosting, we have to import the model from training to hosting. A common question would be, why wouldn't we automatically go from training to hosting? As we worked through examples of what customers were looking to do with hosting, we realized that the Amazon ML model of hosting was unlikely to be sufficient for all customers.\n",
+    "\n",
+    "As a result, we have introduced some flexibility with respect to model deployment, with the goal of additional model deployment targets after launch. In the short term, that introduces some complexity, but we are actively working on making that easier for customers, even before GA.\n",
+    "\n",
+    "### Import model into hosting\n",
+    "Next, you register the model with hosting. This allows you the flexibility of importing models trained elsewhere, as well as the choice of not importing models if the target of model creation is AWS Lambda, AWS Greengrass, Amazon Redshift, Amazon Athena, or other deployment target."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "\n",
+    "kmeans_predictor = kmeans.deploy(min_instances=1,\n",
+    "                                 max_instances=1,\n",
+    "                                 instance_type='c4.xlarge')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Validate the model for use\n",
+    "Finally, the customer can now validate the model for use. They can obtain the endpoint from the client library using the result from previous operations, and generate classifications from the trained model using that endpoint."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "result = kmeans_predictor.predict(train_set[0][30:31])\n",
+    "print(result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "OK, a single prediction works.\n",
+    "\n",
+    "Let's do a whole batch and see how well the clustering works."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time \n",
+    "\n",
+    "result = kmeans_predictor.predict(valid_set[0][0:100])\n",
+    "clusters = result['labels']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for cluster in range(10):\n",
+    "    print('\\n\\n\\nCluster {}:'.format(int(cluster)))\n",
+    "    digits = [ img for l, img in zip(clusters, valid_set[0]) if int(l) == cluster ]\n",
+    "    height=((len(digits)-1)//5)+1\n",
+    "    width=5\n",
+    "    plt.rcParams[\"figure.figsize\"] = (width,height)\n",
+    "    _, subplots = plt.subplots(height, width)\n",
+    "    subplots=numpy.ndarray.flatten(subplots)\n",
+    "    for subplot, image in zip(subplots, digits):\n",
+    "        show_digit(image, subplot=subplot)\n",
+    "    for subplot in subplots[len(digits):]:\n",
+    "        subplot.axis('off')\n",
+    "\n",
+    "    plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### The bottom line\n",
+    "\n",
+    "K-Means clustering is not the best algorithm for image analysis problems, but we do see pretty reasonable clusters being built."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### (Optional) Delete the Endpoint"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(kmeans_predictor.endpoint)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import im\n",
+    "\n",
+    "im.Session().delete_endpoint(kmeans_predictor.endpoint)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}