Adding R Studio Examples (aws#2997)

hchings · Oct 29, 2021 · 9dd3fce · 9dd3fce
1 parent 2d81c82
commit 9dd3fce
Show file tree

Hide file tree

Showing 15 changed files with 461 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -139,6 +139,14 @@ These examples provide an introduction to SageMaker Clarify which provides machi
 * [Fairness and Explainability with SageMaker Clarify](sagemaker_processing/fairness_and_explainability) shows how to use SageMaker Clarify Processor API to measure the pre-training bias of a dataset and post-training bias of a model, and explain the importance of the input features on the model's decision.
 * [Amazon SageMaker Clarify Model Monitors](sagemaker_model_monitor/fairness_and_explainability) shows how to use SageMaker Clarify Model Monitor API to schedule bias monitor to monitor predictions for bias drift on a regular basis, and schedule explainability monitor to monitor predictions for feature attribution drift on a regular basis.
 
+### Publishing content from RStudio on Amazon SageMaker to RStudio Connect
+
+These examples show you how to run R examples, and publish applications in RStudio on Amazon SageMaker to RStudio Connect. 
+
+- [Publishing R Markdown](r_examples/rsconnect_rmarkdown/) shows how you can author an R Markdown document (.Rmd, .Rpres) within RStudio on Amazon SageMaker and publish to RStudio Connect for wide consumption.
+- [Publishing R Shiny Apps](r_examples/rsconnect_shiny/) shows how you can author an R Shiny application within RStudio on Amazon SageMaker and publish to RStudio Connect for wide consumption.
+- [Publishing Streamlit Apps](r_examples/rsconnect_streamlit/) shows how you can author a streamlit application withing Amazon SageMaker Studio and publish to RStudio Connect for wide consumption.
+
 ### Advanced Amazon SageMaker Functionality
 
 These examples that showcase unique functionality available in Amazon SageMaker. They cover a broad range of topics and will utilize a variety of methods, but aim to provide the user with sufficient insight or inspiration to develop within Amazon SageMaker.

diff --git a/r_examples/rsconnect_rmarkdown/README.md b/r_examples/rsconnect_rmarkdown/README.md
@@ -0,0 +1,37 @@
+# Publishing R Markdown documents from RStudio on Amazon SageMaker to RStudio Connect
+
+You can easily and programmatically create an analysis within RStudio on Amazon SageMaker and publish it to RStudio Connect so that your collaborators can easily consume your analysis. In this example, we use a [UCI breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) to walkthrough some of the common use case of publication: R Markdown, R Presentation documents.
+
+## R Markdown
+
+R Markdown is a great tool to run your analyses in R as part of a markdown file and share in RStudio Connect. In the rmarkdown example in [breast_cancer_eda.Rmd](./breast_cancer_eda.Rmd) in the GitHub repo, we perform two simple analyses and plotting on the dataset along with the texts in markdown. 
+
+```{r}
+    ```{r breastcancer}
+    data(BreastCancer)
+    df <- BreastCancer
+    # convert input values to numeric
+    for(i in 2:10) {
+    df[,i] <- as.numeric(as.character(df[,i]))
+    }
+    summary(df)
+    ```
+
+    ```{r cl_thickness, echo=FALSE}
+    ggplot(df, aes(x=Cl.thickness))+
+        geom_histogram(color="black", fill="white", binwidth = 1)+
+        facet_grid(Class ~ .)
+    ```
+```
+
+We can preview the file by clicking on the **Knit** button (1) and publish it to our RStudio Connect with the **Publish** button (2).
+![publish-rmd](./images/publish-rmd.png)
+
+## R Presentation
+
+We could also run the similar analysis inline to create a R Presentation deck that can be published to your collaborators. 
+In the example in [breast_cancer_eda.Rpres](./breast_cancer_eda.Rpres) in the GitHub repo, we combine the presentation, markdown and the R commands together to create a slide deck. You can preview the slides while writing codes with the **Preview** button (1). Once you complete, you can publish it with the **Publish** button (2) in the **Presentation** tab on the right. 
+
+![publish-rpres](./images/publish-rpres.png)
+
+We showed you the static work that can be published and shared on RStudio Connect from RStudio on Amazon SageMaker. More often than not, you are building an interactive application or dashboard with Shiny.  Let’s take a look how we can publish Shiny apps from RStudio on Amazon SageMaker to RStudio Connect in [Publishing R Shiny Apps](../rsconnect_shiny).
diff --git a/r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rmd b/r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rmd
@@ -0,0 +1,80 @@
+---
+title: "Breast Cancer data analysis"
+author: "Amazon Web Services"
+date: "9/7/2021"
+output: html_document
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+library(mlbench)
+library(ggplot2)
+library(caret)
+```
+
+## Breast Cancer data summary
+
+This is an exploratory analysis on [UCI Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) library. The data is collected from 699 people who were eligible of the study. 9 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, describing the characteristics of the cell nuclei present in the image. Let's look at the descriptive statistics of the dataset that are in numeric format.
+
+```{r breastcancer}
+data(BreastCancer)
+df <- BreastCancer
+# convert input columns 2 to 10 from factor to numeric
+for(i in 2:10) {
+  df[,i] <- as.numeric(as.character(df[,i]))
+}
+summary(df)
+```
+
+## Histogram of clump thickness by class
+
+We are interested to see the distribution of the clump thickness between the two classes: *Benign* and *Malignant*. 
+
+```{r cl_thickness, echo=FALSE}
+ggplot(df, aes(x=Cl.thickness))+
+       geom_histogram(color="black", fill="white", binwidth = 1)+
+       facet_grid(Class ~ .)
+```
+
+It turns out that *benign* cases tend to have smaller clumps as oppose to *malignant* cases who tend to have thicker clumps in the breasts.
+
+## Training a machine learning model
+Let's split the data, standardize accordingly and train a ML model. The training process includes a 10-fold cross validation using gradient boosting model, optimized with area under ROC curve.
+```{r modeling}
+# split the data into train and test and perform preprocessing
+trainIndex <- createDataPartition(df$Class, p = .8, 
+                                  list = FALSE, 
+                                  times = 1)
+df_train <- df[ trainIndex,]
+df_test  <- df[-trainIndex,]
+preProcValues <- preProcess(df_train, method = c("center", "scale", "medianImpute"))
+df_train_transformed <- predict(preProcValues, df_train)
+
+# train a model on df_train
+fitControl <- trainControl(## 10-fold CV
+  method = "repeatedcv",
+  number = 10,
+  ## repeated ten times
+  repeats = 10,
+  ## Estimate class probabilities
+  classProbs = TRUE,
+  ## Evaluate performance using 
+  ## the following function
+  summaryFunction = twoClassSummary)
+
+set.seed(825)
+gbmFit <- train(Class ~ ., data = df_train_transformed[,2:11], 
+                method = "gbm", 
+                trControl = fitControl,
+                ## This last option is actually one
+                ## for gbm() that passes through
+                verbose = FALSE,
+                metric = "ROC")
+```
+
+We can see the feature importance based on the algorithm.
+```{r featureimportance, echo=FALSE}
+summary(gbmFit)
+```
+
+This is the end of a simple analysis and plotting in a R Markdown file. We develop it in RStudio Workbench in Amazon SageMaker and will publish it to a RStudio Connect server.
diff --git a/r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rpres b/r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rpres
@@ -0,0 +1,49 @@
+Breast Cancer data analysis
+========================================================
+author: Amazon Web Services
+date: 09/07/2021
+autosize: true
+
+Dataset
+========================================================
+
+This is an exploratory analysis on [UCI Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) library. 
+
+The data is collected from 699 people who were eligible of the study. 9 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, describing the characteristics of the cell nuclei present in the image. 
+
+Descriptive Statistics
+========================================================
+
+We could see class imbalance between the *Benign* and *Malignant* cases. Summary statistics shown below.
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+library(mlbench)
+library(ggplot2)
+```
+
+```{r breastcancer, echo=FALSE}
+data(BreastCancer)
+df <- BreastCancer
+# convert input columns 2 to 10 from factor to numeric
+for(i in 2:10) {
+  df[,i] <- as.numeric(as.character(df[,i]))
+}
+summary(df)
+```
+
+Thicker clumps in malignant cases
+========================================================
+
+It turns out that *benign* cases tend to have smaller clumps as oppose to *malignant* cases who tend to have thicker clumps in the breasts.
+
+```{r cl_thickness, dpi=100, fig.width = 10, echo=FALSE}
+theme_set(theme_gray(base_size = 20))
+ggplot(df, aes(x=Cl.thickness))+
+       geom_histogram(color="black", fill="white", binwidth = 1)+
+       facet_grid(Class ~ .)
+```
+
+Thank you
+========================================================
+
+This is the end of the presentation.
diff --git a/r_examples/rsconnect_rmarkdown/images/publish-rmd.png b/r_examples/rsconnect_rmarkdown/images/publish-rmd.png
diff --git a/r_examples/rsconnect_rmarkdown/images/publish-rpres.png b/r_examples/rsconnect_rmarkdown/images/publish-rpres.png
diff --git a/r_examples/rsconnect_shiny/README.md b/r_examples/rsconnect_shiny/README.md
@@ -0,0 +1,11 @@
+# Publishing R Shiny apps from RStudio on Amazon SageMaker to RStudio Connect
+
+[Shiny](https://shiny.rstudio.com/) is an R package that makes it easy to create interactive web applications programmatically. It is popular among data scientists to share their analyses and models through a Shiny application to their stakeholders. In this example [breast-cancer-app](./breast-cancer-app), we develop a machine learning model using a [UCI breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29) in `breast_cancer_modeling.r` and create a web application to allow users to interact with the data and ML model. 
+
+To publish, open the [breast-cancer-app/app.R](./breast-cancer-app/app.R) and click the **Publish** button to publish the application. Please select both `app.R` and `breast_cancer_modeling.r` to publish. 
+
+![publish-shiny-app-2](./images/publish-shiny-app-2.png)
+
+In the application, you can change the features to visualize in the plot and select the data points in the plot to see more details and model prediction whether they are benign or malignant cancer cases. By sliding the probability threshold, you can interact with the model and get a different classification count.
+
+![shiny-dashboard-breast-cancer2.gif](./images/shiny-dashboard-breast-cancer2.gif)
diff --git a/r_examples/rsconnect_shiny/breast-cancer-app/app.R b/r_examples/rsconnect_shiny/breast-cancer-app/app.R
@@ -0,0 +1,149 @@
+library(shiny)
+library(caret)
+library(gbm)
+library(e1071)
+
+source('breast_cancer_modeling.r')
+test_data <- readRDS('./breast_cancer_test_data.rds')
+gbmFit <- readRDS('./gbm_model.rds')
+preProcessor <- readRDS('./preProcessor.rds')
+test_data_transformed <- predict(preProcessor, test_data)
+prediction <- predict(gbmFit, newdata = test_data_transformed[,2:10], type = "prob")
+
+inputs1 <- c("Clump Thickness" = "Cl.thickness",
+             "Uniformity of Cell Size" = "Cell.size",
+             "Uniformity of Cell Shape" = "Cell.shape",
+             "Marginal Adhesion" = "Marg.adhesion",
+             "Single Epithelial Cell Size" = "Epith.c.size",
+             "Bare Nuclei" = "Bare.nuclei",
+             "Bland Chromatin" = "Bl.cromatin",
+             "Normal Nucleoli" = "Normal.nucleoli",
+             "Mitoses" = "Mitoses")
+
+inputs2 <- c("Uniformity of Cell Size" = "Cell.size",
+             "Clump Thickness" = "Cl.thickness",
+             "Uniformity of Cell Shape" = "Cell.shape",
+             "Marginal Adhesion" = "Marg.adhesion",
+             "Single Epithelial Cell Size" = "Epith.c.size",
+             "Bare Nuclei" = "Bare.nuclei",
+             "Bland Chromatin" = "Bl.cromatin",
+             "Normal Nucleoli" = "Normal.nucleoli",
+             "Mitoses" = "Mitoses")
+
+
+# Define UI for the app ----
+ui <- fluidPage(
+
+  # App title ----
+  titlePanel("Breast Cancer"),
+
+  # Sidebar layout with input and output definitions ----
+  sidebarLayout(
+
+    # Sidebar panel for inputs ----
+    sidebarPanel(
+      # Input: Decimal interval with step value ----
+      sliderInput("threshold", "Probability Threshold:",
+                  min = 0, max = 1,
+                  value = 0.5, step = 0.01),
+
+      # Input: Selector for variable to plot on x axis ----
+      selectInput("variable_x", "Variable on X:",
+                  inputs1),
+
+      # Input: Selector for variable to plot on y axis ----
+      selectInput("variable_y", "Variable on Y:",
+                  inputs2),
+    ),
+
+    # Main panel for displaying outputs ----
+    mainPanel(
+
+      # Output: Formatted text for caption ----
+      h3(textOutput("caption")),
+
+      # Output: prediction outcome
+      tableOutput("predictions"),
+
+      # Output: Verbatim text for data summary ----
+      verbatimTextOutput("summary"),
+
+      # Output: Formatted text for formula ----
+      h3(textOutput("formula")),
+
+      # Output: Plot of the data ----
+      # was  click = "plot_click"
+      plotOutput("scatterPlot", brush = "plot_brush"),
+
+      # Output: present click info
+      tableOutput("info")
+
+    )
+  )
+)
+
+# Define server logic to plot various variables ----
+server <- function(input, output) {
+
+  # Compute the formula text ----
+  # This is in a reactive expression since it is shared by the
+  # output$caption function
+  formulaText <- reactive({
+    paste(input$variable_y, "~", input$variable_x)
+  })
+
+  # Compute the formula text ----
+  # This is in a reactive expression since it is shared by the
+  # output$caption function
+  total_count <- reactive({
+    data.frame(Class = colnames(prediction),
+               Count = c(sum(prediction$malignant<input$threshold),
+                         sum(prediction$malignant>=input$threshold)))
+  })
+
+  # Compute the formula text ----
+  # This is in a reactive expression
+  threshold_proba <- reactive({
+    cbind(Prediction = ifelse(prediction$malignant>=input$threshold, 
+                              "malignant", "benign"),
+          test_data)
+  })
+
+  # return prediction summary
+  output$predictions <- renderTable({
+    total_count()
+  })
+
+  # Return the formula text for printing as a caption ----
+  output$caption <- renderText({
+    "Breast cancer test data summary"
+  })
+
+  # Generate a summary of the dataset ----
+  # The output$summary depends on the datasetInput reactive
+  # expression, so will be re-executed whenever datasetInput is
+  # invalidated, i.e. whenever the input$dataset changes
+  output$summary <- renderPrint({
+    summary(test_data)
+  })
+
+  # Return the formula text for printing as a caption ----
+  output$formula <- renderText({
+    formulaText()
+  })
+
+  # Generate a plot of the requested variables ----
+  # and only exclude outliers if requested
+  output$scatterPlot <- renderPlot({
+    plot(as.formula(formulaText()), data = threshold_proba())
+  })
+
+  output$info <- renderTable({
+    brushedPoints(threshold_proba(), input$plot_brush, 
+                  xvar = input$variable_x, yvar = input$variable_y)
+  })
+
+}
+
+# Create Shiny app ----
+shinyApp(ui, server)
diff --git a/r_examples/rsconnect_shiny/breast-cancer-app/breast_cancer_modeling.r b/r_examples/rsconnect_shiny/breast-cancer-app/breast_cancer_modeling.r
@@ -0,0 +1,47 @@
+library(caret)
+library(mlbench)
+
+data(BreastCancer)
+summary(BreastCancer) #Summary of Dataset
+
+df <- BreastCancer
+# convert input values to numeric
+for(i in 2:10) {
+  df[,i] <- as.numeric(as.character(df[,i]))
+}
+
+# split the data into train and test and perform preprocessing
+trainIndex <- createDataPartition(df$Class, p = .8, 
+                                  list = FALSE, 
+                                  times = 1)
+df_train <- df[ trainIndex,]
+df_test  <- df[-trainIndex,]
+preProcValues <- preProcess(df_train, method = c("center", "scale", "medianImpute"))
+df_train_transformed <- predict(preProcValues, df_train)
+
+# train a model on df_train
+fitControl <- trainControl(## 10-fold CV
+  method = "repeatedcv",
+  number = 10,
+  ## repeated ten times
+  repeats = 10,
+  ## Estimate class probabilities
+  classProbs = TRUE,
+  ## Evaluate performance using 
+  ## the following function
+  summaryFunction = twoClassSummary)
+
+set.seed(825)
+gbmFit <- train(Class ~ ., data = df_train_transformed[,2:11], 
+                method = "gbm", 
+                trControl = fitControl,
+                ## This last option is actually one
+                ## for gbm() that passes through
+                verbose = FALSE,
+                metric = "ROC")
+gbmFit
+
+saveRDS(preProcValues, file = './preProcessor.rds')
+saveRDS(gbmFit, file = './gbm_model.rds')
+saveRDS(df_test[,1:10], file = './breast_cancer_test_data.rds')
+
diff --git a/r_examples/rsconnect_shiny/images/publish-shiny-app-2.png b/r_examples/rsconnect_shiny/images/publish-shiny-app-2.png
diff --git a/r_examples/rsconnect_shiny/images/shiny-dashboard-breast-cancer2.gif b/r_examples/rsconnect_shiny/images/shiny-dashboard-breast-cancer2.gif