Skip to content

Commit

Permalink
[FEATURE] Add SageMaker Pipeline local mode example with BYOC and Fra…
Browse files Browse the repository at this point in the history
…meworkProcessor (aws#3614)

* added framework-processor-local-pipelines

* black-np on notebook

* updated README.md

* solving problems for commit id fc80e0d

* solved formatting problem in notebook

* reviewed notebook content, added dataset description, download dataset ffrom public sagemaker s3 bucket

* grammar check

* changed dataset to synthetic transactions dataset

* removed reference to dataset origin

* updated to main branch

* fixing grammar spell

Co-authored-by: Aaron Markham <[email protected]>
  • Loading branch information
2 people authored and atqy committed Oct 28, 2022
1 parent 004040c commit 822112a
Show file tree
Hide file tree
Showing 10 changed files with 1,526 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,7 @@ These examples show you how to use [SageMaker Pipelines](https://aws.amazon.com/
- [Amazon Comprehend with SageMaker Pipelines](sagemaker-pipelines/nlp/amazon_comprehend_sagemaker_pipeline) shows how to deploy a custom text classification using Amazon Comprehend and SageMaker Pipelines.
- [Amazon Forecast with SageMaker Pipelines](sagemaker-pipelines/time_series_forecasting/amazon_forecast_pipeline) shows how you can create a dataset, dataset group and predictor with Amazon Forecast and SageMaker Pipelines.
- [Multi-model SageMaker Pipeline with Hyperparamater Tuning and Experiments](sagemaker-pipeline-multi-model) shows how you can generate a regression model by training real estate data from Athena using Data Wrangler, and uses multiple algorithms both from a custom container and a SageMaker container in a single pipeline.
- [SageMaker Pipeline Local Mode with FrameworkProcessor and BYOC for PyTorch with sagemaker-training-toolkig](sagemaker-pipelines/tabular/local-mode/framework-processor-byoc)

### Amazon SageMaker Pre-Built Framework Containers and the Python SDK

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
## buildspec.sh

This script allows you to create the source artifact to use for SageMaker Training Jobs or SageMaker Processing Jobs

Parameters:
* ALGORITHM_NAME: Mandatory - Name of the algorithm you want to package
* S3_BUCKET_NAME: Optional - S3 bucket name where the package will be uploaded in the path s3://<S3_BUCKET_NAME>/artifact/<ALGORITHM_NAME>

```
.buildspec.sh <ALGORITHM_NAME> <ACCOUNT_ID> <S3_BUCKET_NAME> <KMS_ALIAS>
```

### Example:

#### Processing

```
./buildspec.sh processing test-bucket
```

#### Training

```
./buildspec.sh training test-bucket
```
## build_image.sh

This script allows you to create a custom docker image and push on ECR

Parameters:
* IMAGE_NAME: *Mandatory* - Name of the image you want to build
* REGISTRY_NAME: *Mandatory* - Name of the ECR repository you want to use for pushing the image
* IMAGE_TAG: *Mandatory* - Tag to apply to the ECR image
* DOCKER_FILE: *Mandatory* - Dockerfile to build
* PLATFORM: *Optional* - Target architecture chip where the image is executed
```
./build_image.sh <IMAGE_NAME> <REGISTRY_NAME> <IMAGE_TAG> <DOCKER_FILE> <PLATFORM>
```

Examples:

```
./build_image.sh training torch-1.12.1 latest Dockerfile linux/amd64
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
#!/bin/sh

# The name of our algorithm
repo=$1
registry_name=$2
image_tag=$3
docker_file=$4
platforms=$5

echo "[INFO]: registry_name=${registry_name}"
echo "[INFO]: image_tag=${image_tag}"
echo "[INFO]: docker_file=${docker_file}"
echo "[INFO]: platforms=${platforms}"

cd $repo

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)

echo "[INFO]: Region ${region}"

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${registry_name}:${image_tag}"

echo "[INFO]: Image name: ${fullname}"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${registry_name}" > /dev/null 2>&1

aws ecr create-repository --repository-name "${registry_name}" > /dev/null

## If you are extending Amazon SageMaker Images, you need to login to the account
# Get the login command from ECR and execute it directly
password=$(aws ecr --region ${region} get-login-password)

docker login -u AWS -p ${password} "${account}.dkr.ecr.${region}.amazonaws.com"

if [ -z ${platforms} ]
then
docker build -t ${fullname} -f ${docker_file} .
else
echo "Provided platform = ${platforms}"
docker build -t ${fullname} -f ${docker_file} . --platform=${platforms}
fi

docker push ${fullname}
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# © 2021 Amazon Web Services, Inc. or its affiliates. All Rights Reserved.
#
# This AWS Content is provided subject to the terms of the AWS Customer Agreement
# available at http://aws.amazon.com/agreement or other written agreement between
# Customer and either Amazon Web Services, Inc. or Amazon Web Services EMEA SARL or both.

#!/bin/sh

REPO=$1
S3_BUCKET_NAME=$2

NAME=sourcedir
PUSH=true

if [ -z ${REPO} ] ;
then
echo "Repository not specified"
exit 1
fi

SCRIPT=$(readlink -f "$0")
# Absolute path this script is in, thus /home/user/bin
SCRIPTPATH=$(dirname "$SCRIPT")

cd $SCRIPTPATH/$REPO/src
tar --exclude='data' -czvf ${NAME}.tar.gz *

rm -rf ../dist/$REPO
mkdir ../dist
mkdir ../dist/$REPO

mv ${NAME}.tar.gz ../dist/$REPO

if [ -z ${S3_BUCKET_NAME} ] ;
then
echo "S3 Bucket not specified, no upload"
PUSH=false
fi

if $PUSH ;
then
echo "Uploading s3://${S3_BUCKET_NAME}/artifact/${REPO}/${NAME}.tar.gz"

aws s3 cp ../dist/${REPO}/${NAME}.tar.gz s3://${S3_BUCKET_NAME}/artifact/${REPO}/${NAME}.tar.gz
else
exit 1
fi
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
import argparse
import csv
import logging
import numpy as np
import os
from os import listdir
from os.path import isfile, join
import pandas as pd
from sklearn.model_selection import train_test_split
import traceback

logging.basicConfig(level=logging.INFO)
LOGGER = logging.getLogger(__name__)

BASE_PATH = os.path.join("/", "opt", "ml")
PROCESSING_PATH = os.path.join(BASE_PATH, "processing")
PROCESSING_PATH_INPUT = os.path.join(PROCESSING_PATH, "input")
PROCESSING_PATH_OUTPUT = os.path.join(PROCESSING_PATH, "output")

def extract_data(file_path, percentage=100):
try:
files = [f for f in listdir(file_path) if isfile(join(file_path, f)) and f.endswith(".csv")]
LOGGER.info("{}".format(files))

frames = []

for file in files:
df = pd.read_csv(
os.path.join(file_path, file),
sep=",",
quotechar='"',
quoting=csv.QUOTE_ALL,
escapechar='\\',
encoding='utf-8',
error_bad_lines=False
)

df = df.head(int(len(df) * (percentage / 100)))

frames.append(df)

df = pd.concat(frames)

return df
except Exception as e:
stacktrace = traceback.format_exc()
LOGGER.error("{}".format(stacktrace))

raise e

def load_data(df, file_path, file_name):
try:
if not os.path.exists(file_path):
os.makedirs(file_path)

path = os.path.join(file_path, file_name + ".csv")

LOGGER.info("Saving file in {}".format(path))

df.to_csv(
path,
index=False,
header=True,
quoting=csv.QUOTE_ALL,
encoding="utf-8",
escapechar="\\",
sep=","
)
except Exception as e:
stacktrace = traceback.format_exc()
LOGGER.error("{}".format(stacktrace))

raise e

def transform_data(df):
try:
df = df[df['Is Fraud?'].notna()]

df.insert(0, 'ID', range(1, len(df) + 1))

df["Errors?"].fillna('', inplace=True)
df['Errors?'] = df['Errors?'].map(lambda x: x.strip())
df["Errors?"] = df["Errors?"].map({
"Insufficient Balance": 0,
"Technical Glitch": 1,
"Bad PIN": 2,
"Bad Expiration": 3,
"Bad Card Number": 4,
"Bad CVV": 5,
"Bad PIN,Insufficient Balance": 6,
"Bad PIN,Technical Glitch": 7,
"": 8
})

df["Use Chip"].fillna('', inplace=True)
df['Use Chip'] = df['Use Chip'].map(lambda x: x.strip())
df["Use Chip"] = df["Use Chip"].map({
"Swipe Transaction": 0,
"Chip Transaction": 1,
"Online Transaction": 2
})

df['Is Fraud?'] = df['Is Fraud?'].map(lambda x: x.replace("'", ""))
df['Is Fraud?'] = df['Is Fraud?'].map(lambda x: x.strip())
df['Is Fraud?'] = df['Is Fraud?'].replace('', np.nan)
df['Is Fraud?'] = df['Is Fraud?'].replace(' ', np.nan)

df["Is Fraud?"] = df["Is Fraud?"].map({"No": 0, "Yes": 1})

df = df.rename(
columns={'Card': 'card', 'MCC': 'mcc', "Errors?": "errors", "Use Chip": "use_chip", "Is Fraud?": "labels"})

df = df[["card", "mcc", "errors", "use_chip", "labels"]]

return df

except Exception as e:
stacktrace = traceback.format_exc()
LOGGER.error("{}".format(stacktrace))

raise e

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--dataset-percentage", type=int, required=False, default=100)
args = parser.parse_args()

LOGGER.info("Arguments: {}".format(args))

df = extract_data(PROCESSING_PATH_INPUT, args.dataset_percentage)

df = transform_data(df)

data_train, data_test = train_test_split(df, test_size=0.2, shuffle=True)

load_data(data_train, os.path.join(PROCESSING_PATH_OUTPUT, "train"), "train")
load_data(data_test, os.path.join(PROCESSING_PATH_OUTPUT, "test"), "test")
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
pandas
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
.idea/
.vscode/
*/**/data/

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Logs
**/logs
*.log

# OS generated files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
FROM pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime

ARG PYTHON=python3
ARG PYTHON_PIP=python3-pip
ARG PIP=pip3

RUN apt-get update && apt-get install gcc -y

RUN ${PIP} --no-cache-dir install --upgrade pip

RUN ${PIP} install \
pandas \
scikit-learn

WORKDIR /

ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/lib" \
PYTHONIOENCODING=UTF-8 \
LANG=C.UTF-8 \
LC_ALL=C.UTF-8

RUN ${PIP} install --no-cache --upgrade \
sagemaker-pytorch-training
Loading

0 comments on commit 822112a

Please sign in to comment.