Skip to content

Commit

Permalink
Adding R Studio Examples (#2997)
Browse files Browse the repository at this point in the history
shreyapandit authored Oct 29, 2021
1 parent 2d81c82 commit 9dd3fce
Showing 15 changed files with 461 additions and 0 deletions.
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -139,6 +139,14 @@ These examples provide an introduction to SageMaker Clarify which provides machi
* [Fairness and Explainability with SageMaker Clarify](sagemaker_processing/fairness_and_explainability) shows how to use SageMaker Clarify Processor API to measure the pre-training bias of a dataset and post-training bias of a model, and explain the importance of the input features on the model's decision.
* [Amazon SageMaker Clarify Model Monitors](sagemaker_model_monitor/fairness_and_explainability) shows how to use SageMaker Clarify Model Monitor API to schedule bias monitor to monitor predictions for bias drift on a regular basis, and schedule explainability monitor to monitor predictions for feature attribution drift on a regular basis.

### Publishing content from RStudio on Amazon SageMaker to RStudio Connect

These examples show you how to run R examples, and publish applications in RStudio on Amazon SageMaker to RStudio Connect.

- [Publishing R Markdown](r_examples/rsconnect_rmarkdown/) shows how you can author an R Markdown document (.Rmd, .Rpres) within RStudio on Amazon SageMaker and publish to RStudio Connect for wide consumption.
- [Publishing R Shiny Apps](r_examples/rsconnect_shiny/) shows how you can author an R Shiny application within RStudio on Amazon SageMaker and publish to RStudio Connect for wide consumption.
- [Publishing Streamlit Apps](r_examples/rsconnect_streamlit/) shows how you can author a streamlit application withing Amazon SageMaker Studio and publish to RStudio Connect for wide consumption.

### Advanced Amazon SageMaker Functionality

These examples that showcase unique functionality available in Amazon SageMaker. They cover a broad range of topics and will utilize a variety of methods, but aim to provide the user with sufficient insight or inspiration to develop within Amazon SageMaker.
37 changes: 37 additions & 0 deletions r_examples/rsconnect_rmarkdown/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Publishing R Markdown documents from RStudio on Amazon SageMaker to RStudio Connect

You can easily and programmatically create an analysis within RStudio on Amazon SageMaker and publish it to RStudio Connect so that your collaborators can easily consume your analysis. In this example, we use a [UCI breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) to walkthrough some of the common use case of publication: R Markdown, R Presentation documents.

## R Markdown

R Markdown is a great tool to run your analyses in R as part of a markdown file and share in RStudio Connect. In the rmarkdown example in [breast_cancer_eda.Rmd](./breast_cancer_eda.Rmd) in the GitHub repo, we perform two simple analyses and plotting on the dataset along with the texts in markdown.

```{r}
```{r breastcancer}
data(BreastCancer)
df <- BreastCancer
# convert input values to numeric
for(i in 2:10) {
df[,i] <- as.numeric(as.character(df[,i]))
}
summary(df)
```
```{r cl_thickness, echo=FALSE}
ggplot(df, aes(x=Cl.thickness))+
geom_histogram(color="black", fill="white", binwidth = 1)+
facet_grid(Class ~ .)
```
```

We can preview the file by clicking on the **Knit** button (1) and publish it to our RStudio Connect with the **Publish** button (2).
![publish-rmd](./images/publish-rmd.png)

## R Presentation

We could also run the similar analysis inline to create a R Presentation deck that can be published to your collaborators.
In the example in [breast_cancer_eda.Rpres](./breast_cancer_eda.Rpres) in the GitHub repo, we combine the presentation, markdown and the R commands together to create a slide deck. You can preview the slides while writing codes with the **Preview** button (1). Once you complete, you can publish it with the **Publish** button (2) in the **Presentation** tab on the right.

![publish-rpres](./images/publish-rpres.png)

We showed you the static work that can be published and shared on RStudio Connect from RStudio on Amazon SageMaker. More often than not, you are building an interactive application or dashboard with Shiny. Let’s take a look how we can publish Shiny apps from RStudio on Amazon SageMaker to RStudio Connect in [Publishing R Shiny Apps](../rsconnect_shiny).
80 changes: 80 additions & 0 deletions r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
title: "Breast Cancer data analysis"
author: "Amazon Web Services"
date: "9/7/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(mlbench)
library(ggplot2)
library(caret)
```

## Breast Cancer data summary

This is an exploratory analysis on [UCI Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) library. The data is collected from 699 people who were eligible of the study. 9 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, describing the characteristics of the cell nuclei present in the image. Let's look at the descriptive statistics of the dataset that are in numeric format.

```{r breastcancer}
data(BreastCancer)
df <- BreastCancer
# convert input columns 2 to 10 from factor to numeric
for(i in 2:10) {
df[,i] <- as.numeric(as.character(df[,i]))
}
summary(df)
```

## Histogram of clump thickness by class

We are interested to see the distribution of the clump thickness between the two classes: *Benign* and *Malignant*.

```{r cl_thickness, echo=FALSE}
ggplot(df, aes(x=Cl.thickness))+
geom_histogram(color="black", fill="white", binwidth = 1)+
facet_grid(Class ~ .)
```

It turns out that *benign* cases tend to have smaller clumps as oppose to *malignant* cases who tend to have thicker clumps in the breasts.

## Training a machine learning model
Let's split the data, standardize accordingly and train a ML model. The training process includes a 10-fold cross validation using gradient boosting model, optimized with area under ROC curve.
```{r modeling}
# split the data into train and test and perform preprocessing
trainIndex <- createDataPartition(df$Class, p = .8,
list = FALSE,
times = 1)
df_train <- df[ trainIndex,]
df_test <- df[-trainIndex,]
preProcValues <- preProcess(df_train, method = c("center", "scale", "medianImpute"))
df_train_transformed <- predict(preProcValues, df_train)
# train a model on df_train
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
set.seed(825)
gbmFit <- train(Class ~ ., data = df_train_transformed[,2:11],
method = "gbm",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE,
metric = "ROC")
```

We can see the feature importance based on the algorithm.
```{r featureimportance, echo=FALSE}
summary(gbmFit)
```

This is the end of a simple analysis and plotting in a R Markdown file. We develop it in RStudio Workbench in Amazon SageMaker and will publish it to a RStudio Connect server.
49 changes: 49 additions & 0 deletions r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rpres
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
Breast Cancer data analysis
========================================================
author: Amazon Web Services
date: 09/07/2021
autosize: true

Dataset
========================================================

This is an exploratory analysis on [UCI Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) library.

The data is collected from 699 people who were eligible of the study. 9 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, describing the characteristics of the cell nuclei present in the image.

Descriptive Statistics
========================================================

We could see class imbalance between the *Benign* and *Malignant* cases. Summary statistics shown below.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(mlbench)
library(ggplot2)
```

```{r breastcancer, echo=FALSE}
data(BreastCancer)
df <- BreastCancer
# convert input columns 2 to 10 from factor to numeric
for(i in 2:10) {
df[,i] <- as.numeric(as.character(df[,i]))
}
summary(df)
```

Thicker clumps in malignant cases
========================================================

It turns out that *benign* cases tend to have smaller clumps as oppose to *malignant* cases who tend to have thicker clumps in the breasts.

```{r cl_thickness, dpi=100, fig.width = 10, echo=FALSE}
theme_set(theme_gray(base_size = 20))
ggplot(df, aes(x=Cl.thickness))+
geom_histogram(color="black", fill="white", binwidth = 1)+
facet_grid(Class ~ .)
```

Thank you
========================================================

This is the end of the presentation.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 11 additions & 0 deletions r_examples/rsconnect_shiny/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Publishing R Shiny apps from RStudio on Amazon SageMaker to RStudio Connect

[Shiny](https://shiny.rstudio.com/) is an R package that makes it easy to create interactive web applications programmatically. It is popular among data scientists to share their analyses and models through a Shiny application to their stakeholders. In this example [breast-cancer-app](./breast-cancer-app), we develop a machine learning model using a [UCI breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29) in `breast_cancer_modeling.r` and create a web application to allow users to interact with the data and ML model.

To publish, open the [breast-cancer-app/app.R](./breast-cancer-app/app.R) and click the **Publish** button to publish the application. Please select both `app.R` and `breast_cancer_modeling.r` to publish.

![publish-shiny-app-2](./images/publish-shiny-app-2.png)

In the application, you can change the features to visualize in the plot and select the data points in the plot to see more details and model prediction whether they are benign or malignant cancer cases. By sliding the probability threshold, you can interact with the model and get a different classification count.

![shiny-dashboard-breast-cancer2.gif](./images/shiny-dashboard-breast-cancer2.gif)
149 changes: 149 additions & 0 deletions r_examples/rsconnect_shiny/breast-cancer-app/app.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
library(shiny)
library(caret)
library(gbm)
library(e1071)

source('breast_cancer_modeling.r')
test_data <- readRDS('./breast_cancer_test_data.rds')
gbmFit <- readRDS('./gbm_model.rds')
preProcessor <- readRDS('./preProcessor.rds')
test_data_transformed <- predict(preProcessor, test_data)
prediction <- predict(gbmFit, newdata = test_data_transformed[,2:10], type = "prob")

inputs1 <- c("Clump Thickness" = "Cl.thickness",
"Uniformity of Cell Size" = "Cell.size",
"Uniformity of Cell Shape" = "Cell.shape",
"Marginal Adhesion" = "Marg.adhesion",
"Single Epithelial Cell Size" = "Epith.c.size",
"Bare Nuclei" = "Bare.nuclei",
"Bland Chromatin" = "Bl.cromatin",
"Normal Nucleoli" = "Normal.nucleoli",
"Mitoses" = "Mitoses")

inputs2 <- c("Uniformity of Cell Size" = "Cell.size",
"Clump Thickness" = "Cl.thickness",
"Uniformity of Cell Shape" = "Cell.shape",
"Marginal Adhesion" = "Marg.adhesion",
"Single Epithelial Cell Size" = "Epith.c.size",
"Bare Nuclei" = "Bare.nuclei",
"Bland Chromatin" = "Bl.cromatin",
"Normal Nucleoli" = "Normal.nucleoli",
"Mitoses" = "Mitoses")


# Define UI for the app ----
ui <- fluidPage(

# App title ----
titlePanel("Breast Cancer"),

# Sidebar layout with input and output definitions ----
sidebarLayout(

# Sidebar panel for inputs ----
sidebarPanel(
# Input: Decimal interval with step value ----
sliderInput("threshold", "Probability Threshold:",
min = 0, max = 1,
value = 0.5, step = 0.01),

# Input: Selector for variable to plot on x axis ----
selectInput("variable_x", "Variable on X:",
inputs1),

# Input: Selector for variable to plot on y axis ----
selectInput("variable_y", "Variable on Y:",
inputs2),
),

# Main panel for displaying outputs ----
mainPanel(

# Output: Formatted text for caption ----
h3(textOutput("caption")),

# Output: prediction outcome
tableOutput("predictions"),

# Output: Verbatim text for data summary ----
verbatimTextOutput("summary"),

# Output: Formatted text for formula ----
h3(textOutput("formula")),

# Output: Plot of the data ----
# was click = "plot_click"
plotOutput("scatterPlot", brush = "plot_brush"),

# Output: present click info
tableOutput("info")

)
)
)

# Define server logic to plot various variables ----
server <- function(input, output) {

# Compute the formula text ----
# This is in a reactive expression since it is shared by the
# output$caption function
formulaText <- reactive({
paste(input$variable_y, "~", input$variable_x)
})

# Compute the formula text ----
# This is in a reactive expression since it is shared by the
# output$caption function
total_count <- reactive({
data.frame(Class = colnames(prediction),
Count = c(sum(prediction$malignant<input$threshold),
sum(prediction$malignant>=input$threshold)))
})

# Compute the formula text ----
# This is in a reactive expression
threshold_proba <- reactive({
cbind(Prediction = ifelse(prediction$malignant>=input$threshold,
"malignant", "benign"),
test_data)
})

# return prediction summary
output$predictions <- renderTable({
total_count()
})

# Return the formula text for printing as a caption ----
output$caption <- renderText({
"Breast cancer test data summary"
})

# Generate a summary of the dataset ----
# The output$summary depends on the datasetInput reactive
# expression, so will be re-executed whenever datasetInput is
# invalidated, i.e. whenever the input$dataset changes
output$summary <- renderPrint({
summary(test_data)
})

# Return the formula text for printing as a caption ----
output$formula <- renderText({
formulaText()
})

# Generate a plot of the requested variables ----
# and only exclude outliers if requested
output$scatterPlot <- renderPlot({
plot(as.formula(formulaText()), data = threshold_proba())
})

output$info <- renderTable({
brushedPoints(threshold_proba(), input$plot_brush,
xvar = input$variable_x, yvar = input$variable_y)
})

}

# Create Shiny app ----
shinyApp(ui, server)
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
library(caret)
library(mlbench)

data(BreastCancer)
summary(BreastCancer) #Summary of Dataset

df <- BreastCancer
# convert input values to numeric
for(i in 2:10) {
df[,i] <- as.numeric(as.character(df[,i]))
}

# split the data into train and test and perform preprocessing
trainIndex <- createDataPartition(df$Class, p = .8,
list = FALSE,
times = 1)
df_train <- df[ trainIndex,]
df_test <- df[-trainIndex,]
preProcValues <- preProcess(df_train, method = c("center", "scale", "medianImpute"))
df_train_transformed <- predict(preProcValues, df_train)

# train a model on df_train
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)

set.seed(825)
gbmFit <- train(Class ~ ., data = df_train_transformed[,2:11],
method = "gbm",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE,
metric = "ROC")
gbmFit

saveRDS(preProcValues, file = './preProcessor.rds')
saveRDS(gbmFit, file = './gbm_model.rds')
saveRDS(df_test[,1:10], file = './breast_cancer_test_data.rds')

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 26 additions & 0 deletions r_examples/rsconnect_streamlit/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Publishing Streamlit apps from Amazon SageMaker Studio to RStudio Connect

[Streamlit](https://docs.streamlit.io/en/latest/index.html) is an open source project in Python that makes it easy to create web applications for machine learning and data science developers. In this example, we develope a simple Streamlit application in `app.py` to allow interactive data exploration and visualization on the [UCI breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29). Let’s see how we can deploy this app to RStudio Connect from SageMaker Studio.

1. Before we proceed, we first need to create an API key from your RStudio Connect account. Please follow the instruction (https://docs.rstudio.com/connect/user/api-keys/#api-keys-creating) to create one, and save it for record as the publication process requires the API key.
1. Open a system terminal in SageMaker Studio in **File**->**New**->**Terminal**.
1. Install the [`rsconnect-python`](https://github.com/rstudio/rsconnect-python) library in the terminal.

```python
pip install rsconnect-python
```

1. Deploy the app using `rsconnect deploy` command in the terminal. In the command, we need to specify the deploy type, the server URL, an API key for the RStudio Connect account, and a file path to the folder containing `app.py` and `requirements.txt`. Note that you need to put required Python libraries in [`requirements.txt`](./breast-cancer-streamlit-app/requirements.txt) for deployment.

```python
rsconnect deploy streamlit \
--server https://xxxx.rstudioconnect.com/ \
--api-key <your-rstudio-connect-api-key> \
/path/to/breast-cancer-streamlit-app/
```

At the end of the execution, you should see URLs for the app in RStudio Connect. You can open the URL to see the published Streamlit app.

![streamlit-app-in-action.gif](./images/streamlit-app-in-action.gif)

Note that [RStudio Connect insists on matching `<MAJOR.MINOR>` versions of Python](https://github.com/rstudio/rsconnect-python#deploying-python-content-to-rstudio-connect). You need to make sure that your RStudio Connect instance has a compatible version to what is available on the notebook kernel. You can verify the Python version in the terminal in SageMaker Studio with `python --version`. You can check the version available on RStudio Connect in the **Documentation** page in your RStudio Connect, for example, `https://xxxx.rstudioconnect.com/connect/#/help/docs`.
50 changes: 50 additions & 0 deletions r_examples/rsconnect_streamlit/breast-cancer-streamlit-app/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import streamlit as st

st.title("Breast Cancer Analysis")
data_url = "https://sagemaker-sample-files.s3.amazonaws.com/datasets/tabular/breast_cancer/breast-cancer-wisconsin.csv"
columns = [
"Id",
"Cl.thickness",
"Cell.size",
"Cell.shape",
"Marg.adhesion",
"Epith.c.size",
"Bare.nuclei",
"Bl.cromatin",
"Normal.nucleoli",
"Mitoses",
"Class",
]


@st.cache
def load_data():
df = pd.read_csv(data_url, names=columns)
df["Class"] = df["Class"].replace(to_replace=[2, 4], value=["benign", "malignant"])
return df


data_load_state = st.text("Loading data...")
data = load_data()
data_load_state.text("")

column = st.selectbox("Features", columns[1:-1])

st.subheader("Histogram of %s by diagnosis type" % column)
fig, ax = plt.subplots(
1,
2,
sharex=True,
sharey=True,
)
data.hist(column=column, by=columns[-1], layout=(1, 2), ax=ax)
st.pyplot(fig)

if st.checkbox("Show raw data"):
st.subheader("Raw data")
st.write(data[[column, columns[-1]]])

st.markdown(f"Source: <{data_url}>")
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
matplotlib
pandas
numpy
streamlit
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 9dd3fce

Please sign in to comment.