Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding R Studio Examples #2997

Merged
merged 26 commits into from
Oct 29, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
1556e09
add lineage example notebooks (#90)
danabens Oct 1, 2020
d3a6c89
add example notebook skeleton for fairness and explainability (#91)
xinyu7030 Oct 16, 2020
071cf58
merge changes from public repository (#92)
ajaykarpur Nov 12, 2020
5fd50a8
Revert "add lineage example notebooks (#90)" (#94)
danabens Nov 12, 2020
1fbfcc8
Revert "add example notebook skeleton for fairness and explainability…
nigenda-amazon Nov 12, 2020
1ffb996
add lineage example notebooks (#90)
danabens Oct 1, 2020
ab01615
merge changes from public repository (#92)
ajaykarpur Nov 12, 2020
d41663e
Revert "add lineage example notebooks (#90)" (#94)
danabens Nov 12, 2020
57fdf88
merge changes from public repository
ajaykarpur Nov 30, 2020
7afb190
Update README.md
shreyapandit Aug 12, 2021
b31a45a
Update from aws example notebooks public repo
shreyapandit Sep 16, 2021
e0e03cb
Update config files to match public repo
shreyapandit Sep 16, 2021
c5009ba
- Added RSW/RSC examples. Modified root README.
michaelhsieh42 Sep 22, 2021
10ebaf5
- Fixed data sources and other edits.
michaelhsieh42 Sep 23, 2021
a02a0b1
- Added comments.
michaelhsieh42 Sep 23, 2021
5d20ffe
- Fixed the s3 URI to http form.
michaelhsieh42 Sep 24, 2021
ef00312
Bring in changes from public samples repo
shreyapandit Oct 21, 2021
7efac02
Merge branch 'aws:master' into master
shreyapandit Oct 25, 2021
38597c1
Merge remote-tracking branch 'amazon-sagemaker-examples/master' into …
shreyapandit Oct 25, 2021
73afcb6
Merge pull request #138 from shreyapandit/master
shreyapandit Oct 26, 2021
29ac47d
- Fixed the service name.
michaelhsieh42 Oct 28, 2021
95be530
Merge pull request #133 from michaelhsieh42/master
shreyapandit Oct 29, 2021
ebf6d25
Merge remote-tracking branch 'staging1/master' into master
shreyapandit Oct 29, 2021
02d6aad
Removes title change
shreyapandit Oct 29, 2021
acc61dc
removes redundant notebook
shreyapandit Oct 29, 2021
dcb7cba
Add back test image file
shreyapandit Oct 29, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,14 @@ These examples provide an introduction to SageMaker Clarify which provides machi
* [Fairness and Explainability with SageMaker Clarify](sagemaker_processing/fairness_and_explainability) shows how to use SageMaker Clarify Processor API to measure the pre-training bias of a dataset and post-training bias of a model, and explain the importance of the input features on the model's decision.
* [Amazon SageMaker Clarify Model Monitors](sagemaker_model_monitor/fairness_and_explainability) shows how to use SageMaker Clarify Model Monitor API to schedule bias monitor to monitor predictions for bias drift on a regular basis, and schedule explainability monitor to monitor predictions for feature attribution drift on a regular basis.

### Publishing content from RStudio on Amazon SageMaker to RStudio Connect

These examples show you how to run R examples, and publish applications in RStudio on Amazon SageMaker to RStudio Connect.

- [Publishing R Markdown](r_examples/rsconnect_rmarkdown/) shows how you can author an R Markdown document (.Rmd, .Rpres) within RStudio on Amazon SageMaker and publish to RStudio Connect for wide consumption.
- [Publishing R Shiny Apps](r_examples/rsconnect_shiny/) shows how you can author an R Shiny application within RStudio on Amazon SageMaker and publish to RStudio Connect for wide consumption.
- [Publishing Streamlit Apps](r_examples/rsconnect_streamlit/) shows how you can author a streamlit application withing Amazon SageMaker Studio and publish to RStudio Connect for wide consumption.

### Advanced Amazon SageMaker Functionality

These examples that showcase unique functionality available in Amazon SageMaker. They cover a broad range of topics and will utilize a variety of methods, but aim to provide the user with sufficient insight or inspiration to develop within Amazon SageMaker.
Expand Down
37 changes: 37 additions & 0 deletions r_examples/rsconnect_rmarkdown/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Publishing R Markdown documents from RStudio on Amazon SageMaker to RStudio Connect

You can easily and programmatically create an analysis within RStudio on Amazon SageMaker and publish it to RStudio Connect so that your collaborators can easily consume your analysis. In this example, we use a [UCI breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) to walkthrough some of the common use case of publication: R Markdown, R Presentation documents.

## R Markdown

R Markdown is a great tool to run your analyses in R as part of a markdown file and share in RStudio Connect. In the rmarkdown example in [breast_cancer_eda.Rmd](./breast_cancer_eda.Rmd) in the GitHub repo, we perform two simple analyses and plotting on the dataset along with the texts in markdown.

```{r}
```{r breastcancer}
data(BreastCancer)
df <- BreastCancer
# convert input values to numeric
for(i in 2:10) {
df[,i] <- as.numeric(as.character(df[,i]))
}
summary(df)
```

```{r cl_thickness, echo=FALSE}
ggplot(df, aes(x=Cl.thickness))+
geom_histogram(color="black", fill="white", binwidth = 1)+
facet_grid(Class ~ .)
```
```

We can preview the file by clicking on the **Knit** button (1) and publish it to our RStudio Connect with the **Publish** button (2).
![publish-rmd](./images/publish-rmd.png)

## R Presentation

We could also run the similar analysis inline to create a R Presentation deck that can be published to your collaborators.
In the example in [breast_cancer_eda.Rpres](./breast_cancer_eda.Rpres) in the GitHub repo, we combine the presentation, markdown and the R commands together to create a slide deck. You can preview the slides while writing codes with the **Preview** button (1). Once you complete, you can publish it with the **Publish** button (2) in the **Presentation** tab on the right.

![publish-rpres](./images/publish-rpres.png)

We showed you the static work that can be published and shared on RStudio Connect from RStudio on Amazon SageMaker. More often than not, you are building an interactive application or dashboard with Shiny. Let’s take a look how we can publish Shiny apps from RStudio on Amazon SageMaker to RStudio Connect in [Publishing R Shiny Apps](../rsconnect_shiny).
80 changes: 80 additions & 0 deletions r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
title: "Breast Cancer data analysis"
author: "Amazon Web Services"
date: "9/7/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(mlbench)
library(ggplot2)
library(caret)
```

## Breast Cancer data summary

This is an exploratory analysis on [UCI Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) library. The data is collected from 699 people who were eligible of the study. 9 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, describing the characteristics of the cell nuclei present in the image. Let's look at the descriptive statistics of the dataset that are in numeric format.

```{r breastcancer}
data(BreastCancer)
df <- BreastCancer
# convert input columns 2 to 10 from factor to numeric
for(i in 2:10) {
df[,i] <- as.numeric(as.character(df[,i]))
}
summary(df)
```

## Histogram of clump thickness by class

We are interested to see the distribution of the clump thickness between the two classes: *Benign* and *Malignant*.

```{r cl_thickness, echo=FALSE}
ggplot(df, aes(x=Cl.thickness))+
geom_histogram(color="black", fill="white", binwidth = 1)+
facet_grid(Class ~ .)
```

It turns out that *benign* cases tend to have smaller clumps as oppose to *malignant* cases who tend to have thicker clumps in the breasts.

## Training a machine learning model
Let's split the data, standardize accordingly and train a ML model. The training process includes a 10-fold cross validation using gradient boosting model, optimized with area under ROC curve.
```{r modeling}
# split the data into train and test and perform preprocessing
trainIndex <- createDataPartition(df$Class, p = .8,
list = FALSE,
times = 1)
df_train <- df[ trainIndex,]
df_test <- df[-trainIndex,]
preProcValues <- preProcess(df_train, method = c("center", "scale", "medianImpute"))
df_train_transformed <- predict(preProcValues, df_train)

# train a model on df_train
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)

set.seed(825)
gbmFit <- train(Class ~ ., data = df_train_transformed[,2:11],
method = "gbm",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE,
metric = "ROC")
```

We can see the feature importance based on the algorithm.
```{r featureimportance, echo=FALSE}
summary(gbmFit)
```

This is the end of a simple analysis and plotting in a R Markdown file. We develop it in RStudio Workbench in Amazon SageMaker and will publish it to a RStudio Connect server.
49 changes: 49 additions & 0 deletions r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rpres
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
Breast Cancer data analysis
========================================================
author: Amazon Web Services
date: 09/07/2021
autosize: true

Dataset
========================================================

This is an exploratory analysis on [UCI Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) library.

The data is collected from 699 people who were eligible of the study. 9 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, describing the characteristics of the cell nuclei present in the image.

Descriptive Statistics
========================================================

We could see class imbalance between the *Benign* and *Malignant* cases. Summary statistics shown below.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(mlbench)
library(ggplot2)
```

```{r breastcancer, echo=FALSE}
data(BreastCancer)
df <- BreastCancer
# convert input columns 2 to 10 from factor to numeric
for(i in 2:10) {
df[,i] <- as.numeric(as.character(df[,i]))
}
summary(df)
```

Thicker clumps in malignant cases
========================================================

It turns out that *benign* cases tend to have smaller clumps as oppose to *malignant* cases who tend to have thicker clumps in the breasts.

```{r cl_thickness, dpi=100, fig.width = 10, echo=FALSE}
theme_set(theme_gray(base_size = 20))
ggplot(df, aes(x=Cl.thickness))+
geom_histogram(color="black", fill="white", binwidth = 1)+
facet_grid(Class ~ .)
```

Thank you
========================================================

This is the end of the presentation.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 11 additions & 0 deletions r_examples/rsconnect_shiny/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Publishing R Shiny apps from RStudio on Amazon SageMaker to RStudio Connect

[Shiny](https://shiny.rstudio.com/) is an R package that makes it easy to create interactive web applications programmatically. It is popular among data scientists to share their analyses and models through a Shiny application to their stakeholders. In this example [breast-cancer-app](./breast-cancer-app), we develop a machine learning model using a [UCI breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29) in `breast_cancer_modeling.r` and create a web application to allow users to interact with the data and ML model.

To publish, open the [breast-cancer-app/app.R](./breast-cancer-app/app.R) and click the **Publish** button to publish the application. Please select both `app.R` and `breast_cancer_modeling.r` to publish.

![publish-shiny-app-2](./images/publish-shiny-app-2.png)

In the application, you can change the features to visualize in the plot and select the data points in the plot to see more details and model prediction whether they are benign or malignant cancer cases. By sliding the probability threshold, you can interact with the model and get a different classification count.

![shiny-dashboard-breast-cancer2.gif](./images/shiny-dashboard-breast-cancer2.gif)
149 changes: 149 additions & 0 deletions r_examples/rsconnect_shiny/breast-cancer-app/app.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
library(shiny)
library(caret)
library(gbm)
library(e1071)

source('breast_cancer_modeling.r')
test_data <- readRDS('./breast_cancer_test_data.rds')
gbmFit <- readRDS('./gbm_model.rds')
preProcessor <- readRDS('./preProcessor.rds')
test_data_transformed <- predict(preProcessor, test_data)
prediction <- predict(gbmFit, newdata = test_data_transformed[,2:10], type = "prob")

inputs1 <- c("Clump Thickness" = "Cl.thickness",
"Uniformity of Cell Size" = "Cell.size",
"Uniformity of Cell Shape" = "Cell.shape",
"Marginal Adhesion" = "Marg.adhesion",
"Single Epithelial Cell Size" = "Epith.c.size",
"Bare Nuclei" = "Bare.nuclei",
"Bland Chromatin" = "Bl.cromatin",
"Normal Nucleoli" = "Normal.nucleoli",
"Mitoses" = "Mitoses")

inputs2 <- c("Uniformity of Cell Size" = "Cell.size",
"Clump Thickness" = "Cl.thickness",
"Uniformity of Cell Shape" = "Cell.shape",
"Marginal Adhesion" = "Marg.adhesion",
"Single Epithelial Cell Size" = "Epith.c.size",
"Bare Nuclei" = "Bare.nuclei",
"Bland Chromatin" = "Bl.cromatin",
"Normal Nucleoli" = "Normal.nucleoli",
"Mitoses" = "Mitoses")


# Define UI for the app ----
ui <- fluidPage(

# App title ----
titlePanel("Breast Cancer"),

# Sidebar layout with input and output definitions ----
sidebarLayout(

# Sidebar panel for inputs ----
sidebarPanel(
# Input: Decimal interval with step value ----
sliderInput("threshold", "Probability Threshold:",
min = 0, max = 1,
value = 0.5, step = 0.01),

# Input: Selector for variable to plot on x axis ----
selectInput("variable_x", "Variable on X:",
inputs1),

# Input: Selector for variable to plot on y axis ----
selectInput("variable_y", "Variable on Y:",
inputs2),
),

# Main panel for displaying outputs ----
mainPanel(

# Output: Formatted text for caption ----
h3(textOutput("caption")),

# Output: prediction outcome
tableOutput("predictions"),

# Output: Verbatim text for data summary ----
verbatimTextOutput("summary"),

# Output: Formatted text for formula ----
h3(textOutput("formula")),

# Output: Plot of the data ----
# was click = "plot_click"
plotOutput("scatterPlot", brush = "plot_brush"),

# Output: present click info
tableOutput("info")

)
)
)

# Define server logic to plot various variables ----
server <- function(input, output) {

# Compute the formula text ----
# This is in a reactive expression since it is shared by the
# output$caption function
formulaText <- reactive({
paste(input$variable_y, "~", input$variable_x)
})

# Compute the formula text ----
# This is in a reactive expression since it is shared by the
# output$caption function
total_count <- reactive({
data.frame(Class = colnames(prediction),
Count = c(sum(prediction$malignant<input$threshold),
sum(prediction$malignant>=input$threshold)))
})

# Compute the formula text ----
# This is in a reactive expression
threshold_proba <- reactive({
cbind(Prediction = ifelse(prediction$malignant>=input$threshold,
"malignant", "benign"),
test_data)
})

# return prediction summary
output$predictions <- renderTable({
total_count()
})

# Return the formula text for printing as a caption ----
output$caption <- renderText({
"Breast cancer test data summary"
})

# Generate a summary of the dataset ----
# The output$summary depends on the datasetInput reactive
# expression, so will be re-executed whenever datasetInput is
# invalidated, i.e. whenever the input$dataset changes
output$summary <- renderPrint({
summary(test_data)
})

# Return the formula text for printing as a caption ----
output$formula <- renderText({
formulaText()
})

# Generate a plot of the requested variables ----
# and only exclude outliers if requested
output$scatterPlot <- renderPlot({
plot(as.formula(formulaText()), data = threshold_proba())
})

output$info <- renderTable({
brushedPoints(threshold_proba(), input$plot_brush,
xvar = input$variable_x, yvar = input$variable_y)
})

}

# Create Shiny app ----
shinyApp(ui, server)
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
library(caret)
library(mlbench)

data(BreastCancer)
summary(BreastCancer) #Summary of Dataset

df <- BreastCancer
# convert input values to numeric
for(i in 2:10) {
df[,i] <- as.numeric(as.character(df[,i]))
}

# split the data into train and test and perform preprocessing
trainIndex <- createDataPartition(df$Class, p = .8,
list = FALSE,
times = 1)
df_train <- df[ trainIndex,]
df_test <- df[-trainIndex,]
preProcValues <- preProcess(df_train, method = c("center", "scale", "medianImpute"))
df_train_transformed <- predict(preProcValues, df_train)

# train a model on df_train
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)

set.seed(825)
gbmFit <- train(Class ~ ., data = df_train_transformed[,2:11],
method = "gbm",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE,
metric = "ROC")
gbmFit

saveRDS(preProcValues, file = './preProcessor.rds')
saveRDS(gbmFit, file = './gbm_model.rds')
saveRDS(df_test[,1:10], file = './breast_cancer_test_data.rds')

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading