From 32bc20f771249c440c7a5df105a7976208cbaa03 Mon Sep 17 00:00:00 2001 From: neelamkoshiya Date: Tue, 9 Aug 2022 15:34:12 -0700 Subject: [PATCH] Delete TS-Workshop.ipynb --- .../timeseries-dataflow/TS-Workshop.ipynb | 923 ------------------ 1 file changed, 923 deletions(-) delete mode 100644 sagemaker-datawrangler/timeseries-dataflow/TS-Workshop.ipynb diff --git a/sagemaker-datawrangler/timeseries-dataflow/TS-Workshop.ipynb b/sagemaker-datawrangler/timeseries-dataflow/TS-Workshop.ipynb deleted file mode 100644 index 5f2ea3944e..0000000000 --- a/sagemaker-datawrangler/timeseries-dataflow/TS-Workshop.ipynb +++ /dev/null @@ -1,923 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Amazon SageMaker Data Wrangler time series workshop\n", - "\n", - "Our business goal: predict the number of NY City yellow taxi pickups in the next 24 hour for each pickup zones per hour and provide some insights for a drivers like average tips, average distance, etc.\n", - "First and important step to achive our goal is a data preparation. This workshop will be focused on this step. \n", - "\n", - "Data used in this demo notebook:\n", - "- Original data source for all open data from 2008-now: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page\n", - "- AWS-hosted location: https://registry.opendata.aws/nyc-tlc-trip-records-pds/\n", - "\n", - "The raw data is split 1 month per file, per Yellow, Green, or ForHire from 2008 through 2020, with each file around 600MB. The entire raw data S3 bucket is HUGE. \n", - "\n", - "We will use just 14 files: yellow cabs Jan-Dec 2019, Jan-Feb 2020 to avoid COVID effects." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Amazon SageMaker Data Wrangler could use diffrent sources, but we will need an S3 bucket to act as the source of your data. The code bellow will create a bucket with a unique `bucket_name`.\n", - "\n", - "We recommend to have a S3 bucket in the same region as the Amazon SageMaker Data Wrangler." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import boto3\n", - "import json\n", - "import random" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%store -r bucket_name\n", - "%store -r data_uploaded\n", - "%store -r region\n", - "\n", - "try:\n", - " bucket_name\n", - " data_uploaded\n", - "except NameError:\n", - " data_uploaded = False" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "if data_uploaded:\n", - " print(f'using previously uploaded data into {bucket_name}')\n", - "else:\n", - " # Sets the same region as current Amazon SageMaker Notebook\n", - " with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:\n", - " data = json.load(notebook_info)\n", - " resource_arn = data['ResourceArn']\n", - " region = resource_arn.split(':')[3]\n", - " print('region:', region)\n", - "\n", - " # Or you can specify the region where your bucket and model will be located in this region\n", - " # region = \"us-east-1\" \n", - "\n", - " s3 = boto3.resource('s3')\n", - " account_id = boto3.client('sts').get_caller_identity().get('Account')\n", - " bucket_name = account_id + \"-\" + region + \"-\" + \"datawranglertimeseries\" + \"-\" + str(random.randrange(0, 10001, 4))\n", - " print('bucket_name:', bucket_name)\n", - "\n", - " try: \n", - " print(f'creating a S3 bucket in {region}')\n", - " if region == \"us-east-1\":\n", - " s3.create_bucket(Bucket=bucket_name)\n", - " else:\n", - " s3.create_bucket(\n", - " Bucket = bucket_name,\n", - " CreateBucketConfiguration={'LocationConstraint': region}\n", - " )\n", - " except Exception as e:\n", - " print (e)\n", - " print(\"Bucket already exists. Using bucket\", bucket_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First we need to download the data (training data) to our new bucket" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "CopySource = [\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-01.parquet'},\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-02.parquet'},\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-03.parquet'},\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-04.parquet'},\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-05.parquet'},\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-06.parquet'},\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-07.parquet'},\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-08.parquet'},\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-09.parquet'},\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-10.parquet'},\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-11.parquet'},\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-12.parquet'},\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2020-01.parquet'},\n", - " {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2020-02.parquet'}\n", - "]\n", - "\n", - "if not data_uploaded:\n", - " for copy_source in CopySource:\n", - " s3.meta.client.copy(copy_source, bucket_name, copy_source['Key'])\n", - " \n", - " data_uploaded = True\n", - "else:\n", - " print(f'skipping data upload as data already in the bucket {bucket_name}')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!aws s3 ls s3://{bucket_name}/'trip data'/" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%store bucket_name\n", - "%store data_uploaded\n", - "%store region" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we have raw data in our S3 bucket and ready to explore it and build a training dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Dataset Import\n", - "\n", - "Our first step is to launch a new Data Wrangler session and there are multiple ways how to do that. For example, use the following: Click File -> New -> Data Wrangler Flow\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![newDWF](./pictures/newDWF.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Amazon SageMaker will start to provision a resources for you and you a could find a new Data Wrangler Flow file in a File Browser section\n", - "![DWStarting](./pictures/DWStarting.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Lets rename our new workflow: Right click on file -> Rename Data Wrangler Flow\n", - "\n", - "![DWRename](./pictures/DWRename.png)\n", - "\n", - "Put a new name, for example: `TS-Workshop-DataPreparation.flow`\n", - "\n", - "![DWRename](./pictures/DWNewName.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In a few minutes DataWragler will finish to provision recources and you could see \"Import Data\" screen. \n", - "Data Wrangles support many data sources: Amazon S3, Amazon Athena, Amazon Redshift, Snowflake, Databricks.\n", - "Our data already in S3, let's import it by clicking \"Amazon S3\" button. \n", - "![SelectS3](./pictures/SelectS3.png)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(f'S3 bucket name with data: {bucket_name}')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You will see all your S3 buckets and if you want you could manually select a bucket which we created at the begining of the workshop. \n", - "\n", - "![EntireS3](./pictures/EntireS3.png)\n", - "\n", - "As you might have thouthands of buckets I recommend to provide a name in a \"S3 URI path field\". Use a `s3`-format `s3://= 0)\n", - "df = df.filter(df.tip_amount >= 0)\n", - "df = df.filter(df.total_amount >= 0)\n", - "df = df.filter(df.duration >= 1)\n", - "df = df.filter((1 <= df.PULocationID) & (df.PULocationID <= 263))\n", - "df = df.filter((df.tpep_pickup_datetime >= \"2019-01-01 00:00:00\") & (df.tpep_pickup_datetime < \"2020-03-01 00:00:00\"))\n", - "``` \n", - "\n", - "![FilterTransform](./pictures/FilterTransform.png)\n", - "1. Choose Preview\n", - "1. Choose Add to save the step.\n", - "\n", - "When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset with a new column duration and without column tpep_dropoff_datetime\n", - "![FilterTransformResult](./pictures/FilterTransformResult.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Quick analysis of dataset\n", - "\n", - "Amazon SageMaker Data Wrangler includes built-in analysis that help you generate visualizations and data insights in a few clicks. You can create custom analysis using your own code. We use the **Table Summary** analysis to quickly summarize our data. \n", - "\n", - "For columns with numerical data, including long and float data, table summary reports the number of entries (`count`), minimum (`min`), maximum (`max`), mean, and standard deviation (`stddev`) for each column. \n", - "\n", - "For columns with non-numerical data, including columns with String, Boolean, or DateTime data, table summary reports the number of entries (`count`), least frequent value (`min`), and most frequent value (`max`). \n", - "\n", - "To create this analyses you have to:\n", - "1. Click the plus sign next to a collection of transformation elements and choose \"Add analyses\".\n", - "![addFirstAnalyses](./pictures/addFirstAnalyses.png)\n", - "1. In a \"analyses type\" drop down menu select \"Table Summary\" and provide a name for \"Analysis name\", for example: \"Cleaned dataset summary\"\n", - "![AnalysesConfig](./pictures/AnalysesConfig.png)\n", - "1. Choose Preview\n", - "1. Choose Add to save the analyses.\n", - "1. You could find your first analyses on a \"Analysis\" tab. All future visualisations will could be also found here. \n", - "![AnalysesPreview](./pictures/AnalysesPreview.png)\n", - "1. Click on analyses icon to open it. Take a look on a data. Most interesting part is a summary for \"duration\" column: maximum value is 1439 and this is minutes! 1439 minutes = almost 24 hours and this is defenetly an error which will reduce quality of our model. Such errors also could be called outliers and our next step is to handle them. \n", - "![AnalysesResult](./pictures/AnalysesResult.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Handling outliers in numeric attributes\n", - "In statistics, an outlier is a data point that differs significantly from other observations in the same dataset. An outlier may be due to variability in the measurement or it may indicate experimental error. The latter are sometimes excluded from the dataset. For example, in our dataset we have `tip_amount` feature and usually it is less than 10 dollars, but due to an error in a data collection, some values can show thousands of dollar as a tip. Such data errors will skew statistics and aggregated values which will lead to a lower model accuracy. \n", - "\n", - "An outlier can cause serious problems in statistical analysis. Machine learning models are sensitive to the distribution and range of feature values. Outliers, or rare values, can negatively impact model accuracy and lead to longer training times. \n", - "\n", - "When you define a **Handle outliers** transform step, the statistics used to detect outliers are generated on the data available in Data Wrangler when defining this step. These same statistics are used when running a Data Wrangler job. \n", - "\n", - "Data Wrangler support several outliers detection and handle methods. We are going to use **Standard Deviation Numeric Outliers** and we remove all outliers as our dataset is big enough. This transform detects and fixes outliers in numeric features using the mean and standard deviation. You specify the number of standard deviations a value must vary from the mean to be considered an outlier. For example, if you specify 3 for standard deviations, a value falling more than 3 standard deviations from the mean is considered an outlier. \n", - "\n", - "To create this transformation you have to:\n", - "1. Click the plus sign next to a collection of transformation elements and choose Add transform.\n", - "![AddTransformOutliers](./pictures/AddTransformOutliers.png)\n", - "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n", - "![AddStep](./pictures/AddStep.png)\n", - "1. Choose Handle Outliers. \\\n", - "![SelectOutliers](./pictures/SelectOutliers.png)\n", - " 1. For \"Transform\" choose \"Standard deviation numeric outliers\"\n", - " 1. For \"Inputs columns\" choose `tip_amount`, `total_amount`, `duration`, and `trip_distance`\n", - " 1. For \"Fix method\" choose \"Remove\" \n", - " 1. For \"Standard deviations\" put 4 \\\n", - "![outliersConfig](./pictures/outliersConfig.png)\n", - "1. Choose Preview\n", - "1. Choose Add to save the step.\n", - "\n", - "When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset. \n", - "![outliersResult](./pictures/outliersResult.png)\n", - "\n", - "Optional: if you want you could repeat steps from a previous step (\"Quick analysis of a current dataset\") to create a new table summary and check new maximum for duration. Now the max value for `duration` is 130 minutes, which is more realistic. \n", - "![newTableSumary](./pictures/newTableSumary.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Grouping and aggregating data\n", - "At this moment we have cleaned dataset by removing outliers, invalid values, and added new features. There are few more steps before we start training our forecasting model. \n", - "\n", - "As we are interested in a hourly forecast we have to count number of trips per hour per station and also aggregate (with mean) all metrics such as distance, duration, tip, total amount. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Truncating timestamp\n", - "We don't need minutes and seconds in out timestamp, so we remove them.\n", - "There is no built-in filter transformation in Data Wrangler, so we create a custom transformation again.\n", - "\n", - "To create a custom transformation you have to:\n", - "1. Click the plus sign next to a collection of transformation elements and choose Add transform.\n", - "![addTrandformDate](./pictures/addTrandformDate.png)\n", - "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n", - "![AddStep](./pictures/AddStep.png)\n", - "1. Choose Custom Transform. \\\n", - "![CustomTransform](./pictures/CustomTransform.png)\n", - "1. In drop down menu select Python (PySpark) and use code below. This code will create a new column with a truncated timestemp and then drop original pickup column. \n", - "\n", - "```Python\n", - "from pyspark.sql.functions import col, date_trunc\n", - "df = df.withColumn('pickup_time', date_trunc(\"hour\",col(\"tpep_pickup_datetime\")))\n", - "df = df.drop(\"tpep_pickup_datetime\")\n", - "``` \n", - "\n", - "![DateTruncCode](./pictures/DateTruncCode.png)\n", - "1. Choose Preview\n", - "1. Choose Add to save the step.\n", - "\n", - "When you apply the transfromation on sampled data, you must see all curent steps and a preview of a resulted dataset with a new column `pickup_time` and without column `tpep_pickup_datetime`\n", - "![DateTruncResult](./pictures/DateTruncResult.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Count number of trips per hour per station\n", - "Currenly we have only piece of information about each trip, but we don't know how many trips were made from each station per hour. The simplest way to do that is count number of records per stationID per hourly timestamp. While DataWrangles provide **GroupBy** transfromation. This transformation doesn't support grouping by multiple columns, so we use a custom transformation again. \n", - "\n", - "To create a custom transformation you have to:\n", - "1. Click the plus sign next to a collection of transformation elements and choose Add transform.\n", - "![addTrandformDate](./pictures/addTrandformDate.png)\n", - "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n", - "![AddStep](./pictures/AddStep.png)\n", - "1. Choose Custom Transform. \\\n", - "![CustomTransform](./pictures/CustomTransform.png)\n", - "1. In drop down menu select Python (PySpark) and use code below. This code will create a new column with a number of trips from each location for each timestamp. \n", - "\n", - "```Python\n", - "from pyspark.sql import functions as f\n", - "from pyspark.sql import Window\n", - "df = df.withColumn('count', f.count('duration').over(Window.partitionBy([f.col(\"pickup_time\"), f.col(\"PULocationID\")])))\n", - "``` \n", - "\n", - "![CountCode](./pictures/CountCode.png)\n", - "1. Choose Preview\n", - "1. Choose Add to save the step.\n", - "\n", - "When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset with a new column count.\n", - "![CountResult](./pictures/CountResult.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Resample time series\n", - "Now we are ready to make a final agregation! We aggregate all rows by combination of `PULocationID` and `pickup_time` timestamp while features should be replaced by mean value for each combination. \n", - "\n", - "We use special built-in Time Series transformation **Resample**. The Resample transformation changes the frequency of the time series observations to a specified granularity. It also comes with both upsampling and downsampling options. Applying upsampling increases the frequency of the observations, for example from daily to hourly, whereas downsampling decreases the frequency of the observations, for example from hourly to daily.\n", - "\n", - "To create this transformation you have to:\n", - "1. Click the plus sign next to a collection of transformation elements and choose Add transform.\n", - "![AddResample](./pictures/AddResample.png)\n", - "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n", - "![AddStep](./pictures/AddStep.png)\n", - "1. Choose Time Series. \\\n", - "![SelectTimeSeries](./pictures/SelectTimeSeries.png)\n", - " 1. For \"Transform\" choose \"Resample\"\n", - " 1. For \"Timestamp\" choose pickup_time\n", - " 1. For \"ID column\" choose \"PULocationID\" \n", - " 1. For \"Frequency unit\" choose \"Hourly\"\n", - " 1. For \"Frequency quantity\" put 1\n", - " 1. For \"Method to aggregate numeric values\" choose \"mean\"\n", - " 1. Use default values for the rest of parameters\n", - "![ResampleConfig](./pictures/ResampleConfig.png)\n", - "1. Choose Preview\n", - "1. Choose Add to save the step.\n", - "\n", - "When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset. \n", - "![ResampleResult](./pictures/ResampleResult.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Dataset export\n", - "\n", - "Lets summarize our steps before this stage:\n", - "1. Data import\n", - "1. Data types validation\n", - "1. Columns drop\n", - "1. Handle missing and invalid timestamps\n", - "1. Feature engeneering\n", - "1. Handle missing and invalid data in features\n", - "1. Removed corrupted data\n", - "1. Quick analyses \n", - "1. Handling outliers\n", - "1. Grouping and aggregating data\n", - "\n", - "At this stage we have a new dataset with cleaned data and new engineered features. This dataset already could be used to create a forecast with open source libraries or low-code / no-code tools like Amazon SageMaker Canvas or Amazon Forecast service. \n", - "\n", - "We only have to run the Data Wrangler processing flow for the entire dataset. You could export this processing flow in many ways: as as single processing job, as a SageMaker pipline step, or as Python code.\n", - "\n", - "We are going to export data to S3. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Export to S3\n", - "This option creates a SageMaker processing job which runs the Data Wrangler processing flow and saves the resulting dataset to a specified S3 bucket.\n", - "\n", - "Follow the next steps to setup export to S3.\n", - "1. Click the plus sign next to a collection of transformation elements and choose \"Add destination\"->\"Amazon S3\".\n", - "![addDestination](./pictures/addDestination.png)\n", - "1. Provide parameters for S3 destination:\n", - " 1. Dataset name - name for new dataset, for example used \"NYC_export\"\n", - " 1. File type - CSV\n", - " 1. Delimeter - Comma\n", - " 1. Compression - none\n", - " 1. Amazon S3 location - You can use the same bucket name which we created at the begining\n", - "1. Click \"Add destination\" orange button \\\n", - "![addDestinationConfig](./pictures/addDestinationConfig.png)\n", - "1. Now your dataflow has a final step and you see a new \"Create job\" orange button. Click it. \n", - "![flowCompleated](./pictures/flowCompleated.png)\n", - "1. Provide a \"Job name\" or keep autogenerated option and select \"destination\". We have only one \"S3:NYC_export\", but you might have multiple destinations from different steps in your workflow. Leave a \"KMS key ARN\" field empty and click \"Next\" orange button. \n", - "![Job1](./pictures/Job1.png)\n", - "1. Now your have to provide configuration for a compute capacity for a job. You can keep all defaults values:\n", - " 1. For Instance type use \"ml.m5.4xlarge\"\n", - " 1. For Instance count use \"2\"\n", - " 1. You can explore \"Additional configuration\", but keep them without change. \n", - " 1. Click \"Run\" orange button \\\n", - "![Job2](./pictures/Job2.png)\n", - "1. Now your job is started and it takes about 1 hour to process 6 GB of data according to our Data Wrangler processing flow. Cost for this job will be around 2 USD as \"ml.m5.4xlarge\" cost 0.922 USD per hour and we are using two of them. \\\n", - "![Job3](./pictures/Job3.png)\n", - "1. If you click on the job name you will be redirected to a new window with the job details. On the job details page you see all parameters from a previous steps.\n", - "![Job4](./pictures/Job4.png)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you already closed previous window and want to take a look on job detais, run the following code cell and click on the \"Processing Jobs\" link." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.core.display import display, HTML\n", - "\n", - "display(\n", - " HTML(\n", - " 'Open Processing Jobs'.format(\n", - " region, region\n", - " )\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Approximately in one hour you should see that job status changed to \"Completed\" and you could also check \"Processing time (seconds)\" value. \n", - "![Job5](./pictures/Job5.png)\n", - "Now you could close job details page." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Check data processing results\n", - "After the Data Wrangler processing job is completed, we can check the results saved in our S3 bucket. Do not forget to update \"job_name\" variable with your job name. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Data Wrangler exported data to a selected bucket (created at the begining) with a prefix of job name\n", - "print(f'checking files in s3://{bucket_name}')\n", - "\n", - "job_name = \"TS-Workshop-DataPreparation-2022-05-17T00-04-33\" #!!! Replace with your job name!!!\n", - "\n", - "s3_client = boto3.client(\"s3\")\n", - "response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=job_name)\n", - "files = response.get(\"Contents\")\n", - "if not files:\n", - " print(f'no files found in the location s3://{bucket_name}/{job_name}. Check that the processing job is completed.')\n", - "else:\n", - " for file in files:\n", - " print(f\"file_name: {file['Key']}, size: {file['Size']}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We have just one file with size of 223 Mb. Let's import it and explore a little bit." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!mkdir \"./data\"\n", - "s3_client.download_file(bucket_name, files[0]['Key'], \"./data/data.csv\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "df = pd.read_csv('./data/data.csv') \n", - "df.dtypes" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Our file schema look exactly as we expected: all columns are in place. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df.describe().apply(lambda s: s.apply('{0:.5f}'.format))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Statistics also looks good. Maximum numbers might be a little bit high, but we could fix it by adjusting \"Standard deviations\" value in \"Handling outliers\" step. You could build several models with different values and select which one will produce more accurate model. \n", - "\n", - "Congratulations! At this stage you have designed a workflow and sucesfully launched it. Of course it is not mandatory to always run a job by clicking on a \"Run\" button and you could automate it, but this is a topic of another workshop in this serires. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
💡\n", - "Congratulations!
\n", - "You reached the end of this part. Now you know how to use Amazon SageMaker Data Wrangler for time series dataset preparation!\n", - "
\n", - "\n", - "You can now move to an optional advanced time series transformation exercise in the notebook [`TS-Workshop-Advanced.ipynb`](./TS-Workshop-Advanced.ipynb)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Clean up\n", - "If you choose not to run the notebook with advanced transformation, please move to the cleanup notebook [`TS-Workshop-Cleanup.ipynb`](./TS-Workshop-Cleanup.ipynb)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Release resources\n", - "The following code will stop the kernel in this notebook." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%html\n", - "\n", - "

Shutting down your kernel for this notebook to release resources.

\n", - "\n", - " \n", - "" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (Data Science)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:470317259841:image/datascience-1.0" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.10" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -}