From 32bc20f771249c440c7a5df105a7976208cbaa03 Mon Sep 17 00:00:00 2001
From: neelamkoshiya <neelamkoshiya@gmail.com>
Date: Tue, 9 Aug 2022 15:34:12 -0700
Subject: [PATCH] Delete TS-Workshop.ipynb

---
 .../timeseries-dataflow/TS-Workshop.ipynb     | 923 ------------------
 1 file changed, 923 deletions(-)
 delete mode 100644 sagemaker-datawrangler/timeseries-dataflow/TS-Workshop.ipynb

diff --git a/sagemaker-datawrangler/timeseries-dataflow/TS-Workshop.ipynb b/sagemaker-datawrangler/timeseries-dataflow/TS-Workshop.ipynb
deleted file mode 100644
index 5f2ea3944e..0000000000
--- a/sagemaker-datawrangler/timeseries-dataflow/TS-Workshop.ipynb
+++ /dev/null
@@ -1,923 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Amazon SageMaker Data Wrangler time series workshop\n",
-    "\n",
-    "Our business goal: predict the number of NY City yellow taxi pickups in the next 24 hour for each pickup zones per hour and provide some insights for a drivers like average tips, average distance, etc.\n",
-    "First and important step to achive our goal is a data preparation. This workshop will be focused on this step. \n",
-    "\n",
-    "Data used in this demo notebook:\n",
-    "- Original data source for all open data from 2008-now: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page\n",
-    "- AWS-hosted location: https://registry.opendata.aws/nyc-tlc-trip-records-pds/\n",
-    "\n",
-    "The raw data is split 1 month per file, per Yellow, Green, or ForHire from 2008 through 2020, with each file around 600MB. The entire raw data S3 bucket is HUGE. \n",
-    "\n",
-    "We will use just 14 files: yellow cabs Jan-Dec 2019, Jan-Feb 2020 to avoid COVID effects."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Amazon SageMaker Data Wrangler could use diffrent sources, but we will need an S3 bucket to act as the source of your data. The code bellow will create a bucket with a unique `bucket_name`.\n",
-    "\n",
-    "We recommend to have a S3 bucket in the same region as the Amazon SageMaker Data Wrangler."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import boto3\n",
-    "import json\n",
-    "import random"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%store -r bucket_name\n",
-    "%store -r data_uploaded\n",
-    "%store -r region\n",
-    "\n",
-    "try:\n",
-    "    bucket_name\n",
-    "    data_uploaded\n",
-    "except NameError:\n",
-    "    data_uploaded = False"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "if data_uploaded:\n",
-    "    print(f'using previously uploaded data into {bucket_name}')\n",
-    "else:\n",
-    "    # Sets the same region as current Amazon SageMaker Notebook\n",
-    "    with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:\n",
-    "        data = json.load(notebook_info)\n",
-    "        resource_arn = data['ResourceArn']\n",
-    "        region = resource_arn.split(':')[3]\n",
-    "    print('region:', region)\n",
-    "\n",
-    "    # Or you can specify the region where your bucket and model will be located in this region\n",
-    "    # region = \"us-east-1\" \n",
-    "\n",
-    "    s3 = boto3.resource('s3')\n",
-    "    account_id = boto3.client('sts').get_caller_identity().get('Account')\n",
-    "    bucket_name = account_id + \"-\" + region + \"-\" + \"datawranglertimeseries\" + \"-\" + str(random.randrange(0, 10001, 4))\n",
-    "    print('bucket_name:', bucket_name)\n",
-    "\n",
-    "    try: \n",
-    "        print(f'creating a S3 bucket in {region}')\n",
-    "        if region == \"us-east-1\":\n",
-    "            s3.create_bucket(Bucket=bucket_name)\n",
-    "        else:\n",
-    "            s3.create_bucket(\n",
-    "                Bucket = bucket_name,\n",
-    "                CreateBucketConfiguration={'LocationConstraint': region}\n",
-    "                )\n",
-    "    except Exception as e:\n",
-    "        print (e)\n",
-    "        print(\"Bucket already exists. Using bucket\", bucket_name)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "First we need to download the data (training data) to our new bucket"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "CopySource = [\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-01.parquet'},\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-02.parquet'},\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-03.parquet'},\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-04.parquet'},\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-05.parquet'},\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-06.parquet'},\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-07.parquet'},\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-08.parquet'},\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-09.parquet'},\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-10.parquet'},\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-11.parquet'},\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2019-12.parquet'},\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2020-01.parquet'},\n",
-    "    {'Bucket': 'nyc-tlc', 'Key': 'trip data/yellow_tripdata_2020-02.parquet'}\n",
-    "]\n",
-    "\n",
-    "if not data_uploaded:\n",
-    "    for copy_source in CopySource:\n",
-    "        s3.meta.client.copy(copy_source, bucket_name, copy_source['Key'])\n",
-    "    \n",
-    "    data_uploaded = True\n",
-    "else:\n",
-    "    print(f'skipping data upload as data already in the bucket {bucket_name}')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!aws s3 ls s3://{bucket_name}/'trip data'/"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%store bucket_name\n",
-    "%store data_uploaded\n",
-    "%store region"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now we have raw data in our S3 bucket and ready to explore it and build a training dataset"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Dataset Import\n",
-    "\n",
-    "Our first step is to launch a new Data Wrangler session and there are multiple ways how to do that. For example, use the following: Click File -> New -> Data Wrangler Flow\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "![newDWF](./pictures/newDWF.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Amazon SageMaker will start to provision a resources for you and you a could find a new Data Wrangler Flow file in a File Browser section\n",
-    "![DWStarting](./pictures/DWStarting.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Lets rename our new workflow: Right click on file -> Rename Data Wrangler Flow\n",
-    "\n",
-    "![DWRename](./pictures/DWRename.png)\n",
-    "\n",
-    "Put a new name, for example: `TS-Workshop-DataPreparation.flow`\n",
-    "\n",
-    "![DWRename](./pictures/DWNewName.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In a few minutes DataWragler will finish to provision recources and you could see \"Import Data\" screen. \n",
-    "Data Wrangles support many data sources: Amazon S3, Amazon Athena, Amazon Redshift, Snowflake, Databricks.\n",
-    "Our data already in S3, let's import it by clicking \"Amazon S3\" button. \n",
-    "![SelectS3](./pictures/SelectS3.png)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(f'S3 bucket name with data: {bucket_name}')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "You will see all your S3 buckets and if you want you could manually select a bucket which we created at the begining of the workshop. \n",
-    "\n",
-    "![EntireS3](./pictures/EntireS3.png)\n",
-    "\n",
-    "As you might have thouthands of buckets I recommend to provide a name in a \"S3 URI path field\". Use a `s3`-format `s3://<YOUR_BUCKET_NAME`. As soon as you click \"go\" button you will be automatically redirected to a bucket and you will see its contect. All our files are in \"trip data\" folder, so lets select it. Data Wrangler will import all files from a folder and sample first 50000 rows for an interactive preview (you could change number of sampled rows and strategy). On a right side menu you could customize import job settings like Name, File type, Delimiter, etc. More infortaion about import process could be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-import.html).\n",
-    "\n",
-    "To finish setting up import step select \"parquet\" in \"File type\" drop down menu and press the orange button \"Import\"\n",
-    "    \n",
-    "![SelectS3Folder](./pictures/SelectS3Folder.png)\n",
-    "\n",
-    "It will take a few minutes to import data and validate it. DataWarangler will automatically recognise data types. You should see \"Validation complete 0 errors message\"\n",
-    "\n",
-    "![ImportedS3](./pictures/ImportedS3.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Check and change data types\n",
-    "\n",
-    "First of all we have to check thet data types were correctly recognised and change them. This might be nescessary as Data Wrangler selects data types based on a sampled data which is limited to 50000 rows. Sampled data might potentially miss some variations. \n",
-    "\n",
-    "Any data manipulation called \"data transformations step\". To add a data transformation step use the plus sign next to Data types and choose Edit data types.\n",
-    "\n",
-    "![SelectEditDataTypes](./pictures/SelectEditDataTypes.png)\n",
-    "\n",
-    "In our case several columns were incorrectly recognised: \n",
-    "- `passenger_count` (must be `long` instead of `float`)\n",
-    "- `RatecodeID` (must be `long` instead of `float`)\n",
-    "- `airport_fee` (must be `float` instead of `long`)\n",
-    "\n",
-    "I know correct data types from dataset description. In real life you could also easily find such information. Let's correct data types by selecting a correct type from a drop down menu. \n",
-    "\n",
-    "![DWDataTypesCorrection](./pictures/DWDataTypesCorrection.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Click Preview and then Apply button. \\\n",
-    "Click Back to data flow."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Dataset preparation\n",
-    "### Drop columns\n",
-    "Before we analyse data and do feature engeneering we have to clean dataset from data which we will not use in next steps. \n",
-    "\n",
-    "To re-interate our business goal: \n",
-    "Predict the number of NY City yellow taxi pickups in the next 24 hour for each pickup per hour zones and provide some insights for a drivers like average tips, average distance, etc. \n",
-    "\n",
-    "As we are interesting in per hour forecast we have to agregate some features and remove features which is impossible to aggregate. For this purpose we don't need the following columns: \n",
-    "\n",
-    "1. `VendorID` (A code indicating the TPEP provider that provided the record) \n",
-    "1. `RatecodeID` (The final rate code in effect at the end of the trip)\n",
-    "1. `Store_and_fwd_flag` (This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,”because the vehicle did not have a connection to the server)\n",
-    "1. `DOLocationID` (TLC Taxi Zone in which the taximeter was disengaged)\n",
-    "1. `Payment_type` (A numeric code signifying how the passenger paid for the trip)\n",
-    "1. `Fare_amount` (The time-and-distance fare calculated by the meter) - we will use total amount feature\n",
-    "1. `Extra` (Miscellaneous extras and surcharges)\n",
-    "1. `MTA_tax` (0.50 MTA tax that is automatically triggered based on the metered rate in use)\n",
-    "1. `Tolls_amount` (Total amount of all tolls paid in trip)\n",
-    "1. `Improvement_surcharge` (improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015)\n",
-    "1. `Passenger_count` (This is a driver-entered value)\n",
-    "1. `congestion_surcharge` (Total amount collected in trip for NYS congestion surcharge)\n",
-    "1. `Airport_fee` (Only at LaGuardia and John F. Kennedy Airports)\n",
-    "\n",
-    "To remove those columns:\n",
-    "1. Click the plus sign next to \"Data types\" element and choose Add transform.\n",
-    "![SelectAddTransform](./pictures/SelectAddTransform.png)\n",
-    "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n",
-    "![AddStep](./pictures/AddStep.png)\n",
-    "1. Choose Manage columns. \\\n",
-    "![SelectManageColumns](./pictures/SelectManageColumns.png)\n",
-    "1. For Transform, choose Drop columm and for Column to drop, choose all mentioned above.\n",
-    "![DropColumns](./pictures/DropColumns.png)\n",
-    "1. Choose Preview\n",
-    "1. Choose Add to save the step.\n",
-    "\n",
-    "When transfromation will be applied on a sampled data you should see all curent steps and a preview of a resulted dataset. \n",
-    "![ColumnsDroped](./pictures/ColumnsDroped.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Click Back to data flow."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Handle missing and invalid data in timestamps\n",
-    "Missing data is a common problem in real life, it could be a result of data corruption, data loss or issues in data ingestion. The best practice is to verify the presence of any missing or invalid values and handle them appropriately. \n",
-    "\n",
-    "The are many different strategies how missing or invalid data could be handled, for example dropping rows with missing values or filling the missing values with static or calculated values. Depending on a dataset size you could choose what to do: fix values or just drop them. The **Time Series - Handle missing** transform allows you to choose from multiple strategies.\n",
-    "\n",
-    "All future agregations will be based on a time stamps, so we have to make sure that we dont have any lines with a missing time stamps in `tpep_pickup_datetime` and `tpep_dropoff_datetime` features. Data Wrangler have several time series specific transformations, including **Validate timestamps** which includes check for two situations:\n",
-    "1. Your timestamp column has missing values.\n",
-    "1. The values in your time stamp column are not formatted correctly."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "To validate timestamps in `tpep_dropoff_datetime` and `tpep_pickup_datetime` columns:\n",
-    "1. Click the plus sign next to \"Drop colums\" element and choose Add transform.\n",
-    "![AddDateValidationTransform](./pictures/AddDateValidationTransform.png)\n",
-    "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n",
-    "![AddStep](./pictures/AddStep.png)\n",
-    "1. Choose Time Series. \\\n",
-    "![SelectTimeSeries](./pictures/SelectTimeSeries.png)\n",
-    "1. For Transform choose Validate Timestamps, For TimeStamp colums choose `tpep_pickup_datetime`, for Policy select drop. \n",
-    "![ValidateDate](./pictures/ValidateDate.png)\n",
-    "1. Choose Preview\n",
-    "1. Choose Add to save the step.\n",
-    "1. Repeat same steps again for `tpep_dropoff_datetime` column\n",
-    "\n",
-    "When you apply a transformation a sampled data you should see all curent steps and a preview of a resulted dataset. \n",
-    "![DatesValidated](./pictures/DatesValidated.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Click Back to data flow."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Feature engeenering based on a timestamp with a custom transformation. \n",
-    "\n",
-    "At this stage we have a pickup and dropoff timestamps, but we are more interested in a pickup timestamp and ride duration. We have to createa a new feature `ride duration` as a difference between pick up and drop off time in minutes. There is no built-in date diffrence transformation in a Data Wrangler, but we could create it with a custom transformation. The **Custom Transforms** allows you to use Pyspark, Pandas, or Pyspark (SQL) to define your own transformations. For all three options, you use the variable `df` to access a dataframe to which you want to apply the transform. You do not need to include a return statement.\n",
-    "\n",
-    "To create a custom transformation you have to:\n",
-    "1. Click the plus sign next to a collection of transformation elements and choose Add transform.\n",
-    "![AddTransformDur](./pictures/AddTransformDur.png)\n",
-    "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n",
-    "![AddStep](./pictures/AddStep.png)\n",
-    "1. Choose Custom transform. \\\n",
-    "![CustomTransform](./pictures/CustomTransform.png)\n",
-    "1. In drop down menu select Python (PySpark) and use code below. This code will import functions, calculate difference beetween two timestamps by converting them to unix format (real number) and round result and drop tpep_dropoff_datetime column\n",
-    "\n",
-    "```Python\n",
-    "from pyspark.sql.functions import col, round\n",
-    "df = df.withColumn('duration', round((col(\"tpep_dropoff_datetime\").cast(\"long\")-col(\"tpep_pickup_datetime\").cast(\"long\"))/60,2))\n",
-    "df = df.drop(\"tpep_dropoff_datetime\")\n",
-    "```  \n",
-    "\n",
-    "![CustomTransformCode](./pictures/CustomTransformCode.png)\n",
-    "1. Choose Preview\n",
-    "1. Choose Add to save the step.\n",
-    "\n",
-    "When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset with a new column duration and without column tpep_dropoff_datetime\n",
-    "![DurResult](./pictures/DurResult.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Click Back to data flow."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Handling missing data in numeric attributes\n",
-    "\n",
-    "We already discussed what is a missing values and why it is important to handle them. So far we were working with timestamps only. Now we are going to handle missing values in the rest of attributes. \n",
-    "\n",
-    "We exclude `duration` feature from this operation as it was calculated from the timestamps. As we discussed previously, there are several ways to handle missing data: fill a static number or calculate a correct value (for example: median or mean for last 7 days). It might make sense to calculate value if your timeseries represent a continues process like sensor reading or product sale quantity. In our case all trips are independent from each other and we cannot calculate values based on previous trips as it might bring data bias and increase an error. We replace all missing values with zeros. Sometimes it might make sense to drop rows with a missing values. \n",
-    "\n",
-    "Data Wrangler has two transformations to handle missing data: general and special for time series. We demonstrate how to use both of them and describe when to use each of them. "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Handle missing data with general Handle missing values transformation\n",
-    "This transformation could be used if you want to:\n",
-    "1. Replace missing values with a same static value for all time series\n",
-    "1. Replace missing values with a calculated value and you have only one time serie (for example: one sensor or one product in a shop)\n",
-    "\n",
-    "To create this transformation you have to:\n",
-    "1. Click the plus sign next to a collection of transformation elements and choose Add transform.\n",
-    "![AddTransformMissingGeneral](./pictures/AddTransformMissingGeneral.png)\n",
-    "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n",
-    "![AddStep](./pictures/AddStep.png)\n",
-    "1. Choose Handle Missing. \\\n",
-    "![chooseHandleMissing](./pictures/chooseHandleMissing.png)\n",
-    "    1. For \"Transform\" choose Fill missing\n",
-    "    1. For \"inputs columns\" choose `PULocationID`, `tip_amount`, and `total_amount`\n",
-    "    1. For \"Fill value\" put 0. \\\n",
-    "![handleMissingGeneral](./pictures/handleMissingGeneral.png)\n",
-    "1. Choose Preview\n",
-    "1. Choose Add to save the step.\n",
-    "\n",
-    "When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset. \n",
-    "![handleMissingGeneralResult](./pictures/handleMissingGeneralResult.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Handle missing data with special Time Series transformation\n",
-    "In real life datasets we have many time series in a same dataset and to separate them we use some IDs, for example sensor ID or item SKU. If we want to replace missing values with calculated values, for example mean for last 10 sensor observations, we must calculate it based on data for each time serie independently. Instead of writing code you could use a special Time Series transformation in Data Wrangler. \n",
-    "\n",
-    "To create this transformation you have to:\n",
-    "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n",
-    "![AddStep](./pictures/AddStep.png)\n",
-    "1. Choose Time Series. \\\n",
-    "![SelectTimeSeries](./pictures/SelectTimeSeries.png)\n",
-    "    1. For \"Transform\" choose Handle missing\n",
-    "    1. For \"Time series input type\" choose Along column\n",
-    "    1. For \"Impute missing values for this column\" choose `trip_distance`\n",
-    "    1. For \"Timestamp column\" choose tpep_pickup_datetime\n",
-    "    1. For \"ID column\" choose PULocationID\n",
-    "    1. For \"Method for imputing values\" choose Constant value\n",
-    "    1. For \"Custom value\" put 0.0 \\\n",
-    "![HandleMissing](./pictures/HandleMissing.png)\n",
-    "1. Choose Preview\n",
-    "1. Choose Add to save the step.\n",
-    "\n",
-    "When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset. \n",
-    "![HandleMissingCompleted](./pictures/HandleMissingCompleted.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Filter rows with invalid data\n",
-    "\n",
-    "Based on our understanding of data we could also apply several filters to remove invalid or corrupted data from a business point of view. This transformation improves accuracy of a ML model as we feed only correct data to a model during training. \n",
-    "\n",
-    "We filter data based on following rules:\n",
-    "1. `tpep_pickup_datetime` - have to be in range from 1 Jan 2019 (included) till 1 March 2020 (excluded)\n",
-    "1. `trip_distance` - have to be greater than or equal to 0 (only positive numbers)\n",
-    "1. `tip_amount` - have to be greater than or equal to 0 (only positive numbers)\n",
-    "1. `total_amount` - have to be greater than or equal to 0 (only positive numbers)\n",
-    "1. `duration` - have to be greater than or equal to 1 (we are not interested in super short trips).\n",
-    "1. `PULocationID` - have to be in range from 1 to 263\n",
-    "\n",
-    "There is no built-in filter transformation in Data Wrangler, so we will again create a custom transformation. \n",
-    "\n",
-    "To create a custom transformation you have to:\n",
-    "1. Click the plus sign next to a collection of transformation elements and choose Add transform.\n",
-    "![AddTransformFilter](./pictures/AddTransformFilter.png)\n",
-    "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n",
-    "![AddStep](./pictures/AddStep.png)\n",
-    "1. Choose Custom Transform. \\\n",
-    "![CustomTransform](./pictures/CustomTransform.png)\n",
-    "1. In drop down menu select Python (PySpark) and use code below. This code will filter rows based on a conditions. \n",
-    "\n",
-    "```Python\n",
-    "df = df.filter(df.trip_distance >= 0)\n",
-    "df = df.filter(df.tip_amount >= 0)\n",
-    "df = df.filter(df.total_amount >= 0)\n",
-    "df = df.filter(df.duration >= 1)\n",
-    "df = df.filter((1 <= df.PULocationID) & (df.PULocationID <= 263))\n",
-    "df = df.filter((df.tpep_pickup_datetime >= \"2019-01-01 00:00:00\") & (df.tpep_pickup_datetime < \"2020-03-01 00:00:00\"))\n",
-    "``` \n",
-    "\n",
-    "![FilterTransform](./pictures/FilterTransform.png)\n",
-    "1. Choose Preview\n",
-    "1. Choose Add to save the step.\n",
-    "\n",
-    "When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset with a new column duration and without column tpep_dropoff_datetime\n",
-    "![FilterTransformResult](./pictures/FilterTransformResult.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Quick analysis of dataset\n",
-    "\n",
-    "Amazon SageMaker Data Wrangler includes built-in analysis that help you generate visualizations and data insights in a few clicks. You can create custom analysis using your own code. We use the **Table Summary** analysis to quickly summarize our data. \n",
-    "\n",
-    "For columns with numerical data, including long and float data, table summary reports the number of entries (`count`), minimum (`min`), maximum (`max`), mean, and standard deviation (`stddev`) for each column. \n",
-    "\n",
-    "For columns with non-numerical data, including columns with String, Boolean, or DateTime data, table summary reports the number of entries (`count`), least frequent value (`min`), and most frequent value (`max`). \n",
-    "\n",
-    "To create this analyses you have to:\n",
-    "1. Click the plus sign next to a collection of transformation elements and choose \"Add analyses\".\n",
-    "![addFirstAnalyses](./pictures/addFirstAnalyses.png)\n",
-    "1. In a \"analyses type\" drop down menu select \"Table Summary\" and provide a name for \"Analysis name\", for example: \"Cleaned dataset summary\"\n",
-    "![AnalysesConfig](./pictures/AnalysesConfig.png)\n",
-    "1. Choose Preview\n",
-    "1. Choose Add to save the analyses.\n",
-    "1. You could find your first analyses on a \"Analysis\" tab. All future visualisations will could be also found here. \n",
-    "![AnalysesPreview](./pictures/AnalysesPreview.png)\n",
-    "1. Click on analyses icon to open it. Take a look on a data. Most interesting part is a summary for \"duration\" column: maximum value is 1439 and this is minutes! 1439 minutes = almost 24 hours and this is defenetly an error which will reduce quality of our model. Such errors also could be called outliers and our next step is to handle them. \n",
-    "![AnalysesResult](./pictures/AnalysesResult.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Handling outliers in numeric attributes\n",
-    "In statistics, an outlier is a data point that differs significantly from other observations in the same dataset. An outlier may be due to variability in the measurement or it may indicate experimental error. The latter are sometimes excluded from the dataset. For example, in our dataset we have `tip_amount` feature and usually it is less than 10 dollars, but due to an error in a data collection, some values can show thousands of dollar as a tip. Such data errors will skew statistics and aggregated values which will lead to a lower model accuracy. \n",
-    "\n",
-    "An outlier can cause serious problems in statistical analysis. Machine learning models are sensitive to the distribution and range of feature values. Outliers, or rare values, can negatively impact model accuracy and lead to longer training times. \n",
-    "\n",
-    "When you define a **Handle outliers** transform step, the statistics used to detect outliers are generated on the data available in Data Wrangler when defining this step. These same statistics are used when running a Data Wrangler job. \n",
-    "\n",
-    "Data Wrangler support several outliers detection and handle methods. We are going to use **Standard Deviation Numeric Outliers** and we remove all outliers as our dataset is big enough. This transform detects and fixes outliers in numeric features using the mean and standard deviation. You specify the number of standard deviations a value must vary from the mean to be considered an outlier. For example, if you specify 3 for standard deviations, a value falling more than 3 standard deviations from the mean is considered an outlier. \n",
-    "\n",
-    "To create this transformation you have to:\n",
-    "1. Click the plus sign next to a collection of transformation elements and choose Add transform.\n",
-    "![AddTransformOutliers](./pictures/AddTransformOutliers.png)\n",
-    "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n",
-    "![AddStep](./pictures/AddStep.png)\n",
-    "1. Choose Handle Outliers. \\\n",
-    "![SelectOutliers](./pictures/SelectOutliers.png)\n",
-    "    1. For \"Transform\" choose \"Standard deviation numeric outliers\"\n",
-    "    1. For \"Inputs columns\" choose `tip_amount`, `total_amount`, `duration`, and `trip_distance`\n",
-    "    1. For \"Fix method\" choose \"Remove\" \n",
-    "    1. For \"Standard deviations\" put 4 \\\n",
-    "![outliersConfig](./pictures/outliersConfig.png)\n",
-    "1. Choose Preview\n",
-    "1. Choose Add to save the step.\n",
-    "\n",
-    "When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset. \n",
-    "![outliersResult](./pictures/outliersResult.png)\n",
-    "\n",
-    "Optional: if you want you could repeat steps from a previous step (\"Quick analysis of a current dataset\") to create a new table summary and check new maximum for duration. Now the max value for `duration` is 130 minutes, which is more realistic. \n",
-    "![newTableSumary](./pictures/newTableSumary.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Grouping and aggregating data\n",
-    "At this moment we have cleaned dataset by removing outliers, invalid values, and added new features. There are few more steps before we start training our forecasting model. \n",
-    "\n",
-    "As we are interested in a hourly forecast we have to count number of trips per hour per station and also aggregate (with mean) all metrics such as distance, duration, tip, total amount. "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Truncating timestamp\n",
-    "We don't need minutes and seconds in out timestamp, so we remove them.\n",
-    "There is no built-in filter transformation in Data Wrangler, so we create a custom transformation again.\n",
-    "\n",
-    "To create a custom transformation you have to:\n",
-    "1. Click the plus sign next to a collection of transformation elements and choose Add transform.\n",
-    "![addTrandformDate](./pictures/addTrandformDate.png)\n",
-    "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n",
-    "![AddStep](./pictures/AddStep.png)\n",
-    "1. Choose Custom Transform. \\\n",
-    "![CustomTransform](./pictures/CustomTransform.png)\n",
-    "1. In drop down menu select Python (PySpark) and use code below. This code will create a new column with a truncated timestemp and then drop original pickup column. \n",
-    "\n",
-    "```Python\n",
-    "from pyspark.sql.functions import col, date_trunc\n",
-    "df = df.withColumn('pickup_time', date_trunc(\"hour\",col(\"tpep_pickup_datetime\")))\n",
-    "df = df.drop(\"tpep_pickup_datetime\")\n",
-    "``` \n",
-    "\n",
-    "![DateTruncCode](./pictures/DateTruncCode.png)\n",
-    "1. Choose Preview\n",
-    "1. Choose Add to save the step.\n",
-    "\n",
-    "When you apply the transfromation on sampled data, you must see all curent steps and a preview of a resulted dataset with a new column `pickup_time` and without column `tpep_pickup_datetime`\n",
-    "![DateTruncResult](./pictures/DateTruncResult.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Count number of trips per hour per station\n",
-    "Currenly we have only piece of information about each trip, but we don't know how many trips were made from each station per hour. The simplest way to do that is count number of records per stationID per hourly timestamp. While DataWrangles provide **GroupBy** transfromation. This transformation doesn't support grouping by multiple columns, so we use a custom transformation again. \n",
-    "\n",
-    "To create a custom transformation you have to:\n",
-    "1. Click the plus sign next to a collection of transformation elements and choose Add transform.\n",
-    "![addTrandformDate](./pictures/addTrandformDate.png)\n",
-    "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n",
-    "![AddStep](./pictures/AddStep.png)\n",
-    "1. Choose Custom Transform. \\\n",
-    "![CustomTransform](./pictures/CustomTransform.png)\n",
-    "1. In drop down menu select Python (PySpark) and use code below. This code will create a new column with a number of trips from each location for each timestamp. \n",
-    "\n",
-    "```Python\n",
-    "from pyspark.sql import functions as f\n",
-    "from pyspark.sql import Window\n",
-    "df = df.withColumn('count', f.count('duration').over(Window.partitionBy([f.col(\"pickup_time\"), f.col(\"PULocationID\")])))\n",
-    "``` \n",
-    "\n",
-    "![CountCode](./pictures/CountCode.png)\n",
-    "1. Choose Preview\n",
-    "1. Choose Add to save the step.\n",
-    "\n",
-    "When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset with a new column count.\n",
-    "![CountResult](./pictures/CountResult.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Resample time series\n",
-    "Now we are ready to make a final agregation! We aggregate all rows by combination of `PULocationID` and `pickup_time` timestamp while features should be replaced by mean value for each combination. \n",
-    "\n",
-    "We use special built-in Time Series transformation **Resample**. The Resample transformation changes the frequency of the time series observations to a specified granularity. It also comes with both upsampling and downsampling options. Applying upsampling increases the frequency of the observations, for example from daily to hourly, whereas downsampling decreases the frequency of the observations, for example from hourly to daily.\n",
-    "\n",
-    "To create this transformation you have to:\n",
-    "1. Click the plus sign next to a collection of transformation elements and choose Add transform.\n",
-    "![AddResample](./pictures/AddResample.png)\n",
-    "1. Click \"+ Add step\" orange button in the TRANSFORMS menu.\n",
-    "![AddStep](./pictures/AddStep.png)\n",
-    "1. Choose Time Series. \\\n",
-    "![SelectTimeSeries](./pictures/SelectTimeSeries.png)\n",
-    "    1. For \"Transform\" choose \"Resample\"\n",
-    "    1. For \"Timestamp\" choose pickup_time\n",
-    "    1. For \"ID column\" choose \"PULocationID\" \n",
-    "    1. For \"Frequency unit\" choose \"Hourly\"\n",
-    "    1. For \"Frequency quantity\" put 1\n",
-    "    1. For \"Method to aggregate numeric values\" choose \"mean\"\n",
-    "    1. Use default values for the rest of parameters\n",
-    "![ResampleConfig](./pictures/ResampleConfig.png)\n",
-    "1. Choose Preview\n",
-    "1. Choose Add to save the step.\n",
-    "\n",
-    "When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset. \n",
-    "![ResampleResult](./pictures/ResampleResult.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Dataset export\n",
-    "\n",
-    "Lets summarize our steps before this stage:\n",
-    "1. Data import\n",
-    "1. Data types validation\n",
-    "1. Columns drop\n",
-    "1. Handle missing and invalid timestamps\n",
-    "1. Feature engeneering\n",
-    "1. Handle missing and invalid data in features\n",
-    "1. Removed corrupted data\n",
-    "1. Quick analyses \n",
-    "1. Handling outliers\n",
-    "1. Grouping and aggregating data\n",
-    "\n",
-    "At this stage we have a new dataset with cleaned data and new engineered features. This dataset already could be used to create a forecast with open source libraries or low-code / no-code tools like Amazon SageMaker Canvas or Amazon Forecast service. \n",
-    "\n",
-    "We only have to run the Data Wrangler processing flow for the entire dataset. You could export this processing flow in many ways: as as single processing job, as a SageMaker pipline step, or as Python code.\n",
-    "\n",
-    "We are going to export data to S3. "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Export to S3\n",
-    "This option creates a SageMaker processing job which runs the Data Wrangler processing flow and saves the resulting dataset to a specified S3 bucket.\n",
-    "\n",
-    "Follow the next steps to setup export to S3.\n",
-    "1. Click the plus sign next to a collection of transformation elements and choose \"Add destination\"->\"Amazon S3\".\n",
-    "![addDestination](./pictures/addDestination.png)\n",
-    "1. Provide parameters for S3 destination:\n",
-    "    1. Dataset name - name for new dataset, for example used \"NYC_export\"\n",
-    "    1. File type - CSV\n",
-    "    1. Delimeter - Comma\n",
-    "    1. Compression - none\n",
-    "    1. Amazon S3 location - You can use the same bucket name which we created at the begining\n",
-    "1. Click \"Add destination\" orange button \\\n",
-    "![addDestinationConfig](./pictures/addDestinationConfig.png)\n",
-    "1. Now your dataflow has a final step and you see a new \"Create job\" orange button. Click it. \n",
-    "![flowCompleated](./pictures/flowCompleated.png)\n",
-    "1. Provide a \"Job name\" or keep autogenerated option and select \"destination\". We have only one \"S3:NYC_export\", but you might have multiple destinations from different steps in your workflow. Leave a \"KMS key ARN\" field empty and click \"Next\" orange button. \n",
-    "![Job1](./pictures/Job1.png)\n",
-    "1. Now your have to provide configuration for a compute capacity for a job. You can keep all defaults values:\n",
-    "    1. For Instance type use \"ml.m5.4xlarge\"\n",
-    "    1. For Instance count use \"2\"\n",
-    "    1. You can explore \"Additional configuration\", but keep them without change. \n",
-    "    1. Click \"Run\" orange button \\\n",
-    "![Job2](./pictures/Job2.png)\n",
-    "1. Now your job is started and it takes about 1 hour to process 6 GB of data according to our Data Wrangler processing flow. Cost for this job will be around 2 USD as \"ml.m5.4xlarge\" cost 0.922 USD per hour and we are using two of them. \\\n",
-    "![Job3](./pictures/Job3.png)\n",
-    "1. If you click on the job name you will be redirected to a new window with the job details. On the job details page you see all parameters from a previous steps.\n",
-    "![Job4](./pictures/Job4.png)\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "If you already closed previous window and want to take a look on job detais, run the following code cell and click on the \"Processing Jobs\" link."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from IPython.core.display import display, HTML\n",
-    "\n",
-    "display(\n",
-    "    HTML(\n",
-    "        '<b>Open <a target=\"blank\" href=\"https://{}.console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/\">Processing Jobs</a></b>'.format(\n",
-    "            region, region\n",
-    "        )\n",
-    "    )\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Approximately in one hour you should see that job status changed to \"Completed\" and you could also check \"Processing time (seconds)\" value.  \n",
-    "![Job5](./pictures/Job5.png)\n",
-    "Now you could close job details page."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Check data processing results\n",
-    "After the Data Wrangler processing job is completed, we can check the results saved in our S3 bucket. Do not forget to update \"job_name\" variable with your job name. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Data Wrangler exported data to a selected bucket (created at the begining) with a prefix of job name\n",
-    "print(f'checking files in s3://{bucket_name}')\n",
-    "\n",
-    "job_name = \"TS-Workshop-DataPreparation-2022-05-17T00-04-33\" #!!! Replace with your job name!!!\n",
-    "\n",
-    "s3_client = boto3.client(\"s3\")\n",
-    "response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=job_name)\n",
-    "files = response.get(\"Contents\")\n",
-    "if not files:\n",
-    "    print(f'no files found in the location s3://{bucket_name}/{job_name}. Check that the processing job is completed.')\n",
-    "else:\n",
-    "    for file in files:\n",
-    "        print(f\"file_name: {file['Key']}, size: {file['Size']}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We have just one file with size of 223 Mb. Let's import it and explore a little bit."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!mkdir \"./data\"\n",
-    "s3_client.download_file(bucket_name, files[0]['Key'], \"./data/data.csv\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import pandas as pd\n",
-    "df = pd.read_csv('./data/data.csv')  \n",
-    "df.dtypes"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Our file schema look exactly as we expected: all columns are in place. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "df.describe().apply(lambda s: s.apply('{0:.5f}'.format))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Statistics also looks good. Maximum numbers might be a little bit high, but we could fix it by adjusting \"Standard deviations\" value in \"Handling outliers\" step. You could build several models with different values and select which one will produce more accurate model. \n",
-    "\n",
-    "Congratulations! At this stage you have designed a workflow and sucesfully launched it. Of course it is not  mandatory to always run a job by clicking on a \"Run\" button and you could automate it, but this is a topic of another workshop in this serires. "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<div class=\"alert alert-info\"> 💡\n",
-    "<b>Congratulations!</b></br>\n",
-    "You reached the end of this part. Now you know how to use Amazon SageMaker Data Wrangler for time series dataset preparation!\n",
-    "</div>\n",
-    "\n",
-    "You can now move to an optional advanced time series transformation exercise in the notebook [`TS-Workshop-Advanced.ipynb`](./TS-Workshop-Advanced.ipynb)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Clean up\n",
-    "If you choose not to run the notebook with advanced transformation, please move to the cleanup notebook [`TS-Workshop-Cleanup.ipynb`](./TS-Workshop-Cleanup.ipynb)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Release resources\n",
-    "The following code will stop the kernel in this notebook."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%%html\n",
-    "\n",
-    "<p><b>Shutting down your kernel for this notebook to release resources.</b></p>\n",
-    "<button class=\"sm-command-button\" data-commandlinker-command=\"kernelmenu:shutdown\" style=\"display:none;\">Shutdown Kernel</button>\n",
-    "        \n",
-    "<script>\n",
-    "try {\n",
-    "    els = document.getElementsByClassName(\"sm-command-button\");\n",
-    "    els[0].click();\n",
-    "}\n",
-    "catch(err) {\n",
-    "    // NoOp\n",
-    "}    \n",
-    "</script>"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (Data Science)",
-   "language": "python",
-   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:470317259841:image/datascience-1.0"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.10"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}