diff --git a/README.md b/README.md index 7a1b1acffe..84369ff0cd 100644 --- a/README.md +++ b/README.md @@ -139,6 +139,14 @@ These examples provide an introduction to SageMaker Clarify which provides machi * [Fairness and Explainability with SageMaker Clarify](sagemaker_processing/fairness_and_explainability) shows how to use SageMaker Clarify Processor API to measure the pre-training bias of a dataset and post-training bias of a model, and explain the importance of the input features on the model's decision. * [Amazon SageMaker Clarify Model Monitors](sagemaker_model_monitor/fairness_and_explainability) shows how to use SageMaker Clarify Model Monitor API to schedule bias monitor to monitor predictions for bias drift on a regular basis, and schedule explainability monitor to monitor predictions for feature attribution drift on a regular basis. +### Publishing content from RStudio on Amazon SageMaker to RStudio Connect + +These examples show you how to run R examples, and publish applications in RStudio on Amazon SageMaker to RStudio Connect. + +- [Publishing R Markdown](r_examples/rsconnect_rmarkdown/) shows how you can author an R Markdown document (.Rmd, .Rpres) within RStudio on Amazon SageMaker and publish to RStudio Connect for wide consumption. +- [Publishing R Shiny Apps](r_examples/rsconnect_shiny/) shows how you can author an R Shiny application within RStudio on Amazon SageMaker and publish to RStudio Connect for wide consumption. +- [Publishing Streamlit Apps](r_examples/rsconnect_streamlit/) shows how you can author a streamlit application withing Amazon SageMaker Studio and publish to RStudio Connect for wide consumption. + ### Advanced Amazon SageMaker Functionality These examples that showcase unique functionality available in Amazon SageMaker. They cover a broad range of topics and will utilize a variety of methods, but aim to provide the user with sufficient insight or inspiration to develop within Amazon SageMaker. diff --git a/r_examples/rsconnect_rmarkdown/README.md b/r_examples/rsconnect_rmarkdown/README.md new file mode 100644 index 0000000000..f3306e3ff2 --- /dev/null +++ b/r_examples/rsconnect_rmarkdown/README.md @@ -0,0 +1,37 @@ +# Publishing R Markdown documents from RStudio on Amazon SageMaker to RStudio Connect + +You can easily and programmatically create an analysis within RStudio on Amazon SageMaker and publish it to RStudio Connect so that your collaborators can easily consume your analysis. In this example, we use a [UCI breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) to walkthrough some of the common use case of publication: R Markdown, R Presentation documents. + +## R Markdown + +R Markdown is a great tool to run your analyses in R as part of a markdown file and share in RStudio Connect. In the rmarkdown example in [breast_cancer_eda.Rmd](./breast_cancer_eda.Rmd) in the GitHub repo, we perform two simple analyses and plotting on the dataset along with the texts in markdown. + +```{r} + ```{r breastcancer} + data(BreastCancer) + df <- BreastCancer + # convert input values to numeric + for(i in 2:10) { + df[,i] <- as.numeric(as.character(df[,i])) + } + summary(df) + ``` + + ```{r cl_thickness, echo=FALSE} + ggplot(df, aes(x=Cl.thickness))+ + geom_histogram(color="black", fill="white", binwidth = 1)+ + facet_grid(Class ~ .) + ``` +``` + +We can preview the file by clicking on the **Knit** button (1) and publish it to our RStudio Connect with the **Publish** button (2). +![publish-rmd](./images/publish-rmd.png) + +## R Presentation + +We could also run the similar analysis inline to create a R Presentation deck that can be published to your collaborators. +In the example in [breast_cancer_eda.Rpres](./breast_cancer_eda.Rpres) in the GitHub repo, we combine the presentation, markdown and the R commands together to create a slide deck. You can preview the slides while writing codes with the **Preview** button (1). Once you complete, you can publish it with the **Publish** button (2) in the **Presentation** tab on the right. + +![publish-rpres](./images/publish-rpres.png) + +We showed you the static work that can be published and shared on RStudio Connect from RStudio on Amazon SageMaker. More often than not, you are building an interactive application or dashboard with Shiny. Let’s take a look how we can publish Shiny apps from RStudio on Amazon SageMaker to RStudio Connect in [Publishing R Shiny Apps](../rsconnect_shiny). diff --git a/r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rmd b/r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rmd new file mode 100644 index 0000000000..9cf46d183a --- /dev/null +++ b/r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rmd @@ -0,0 +1,80 @@ +--- +title: "Breast Cancer data analysis" +author: "Amazon Web Services" +date: "9/7/2021" +output: html_document +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +library(mlbench) +library(ggplot2) +library(caret) +``` + +## Breast Cancer data summary + +This is an exploratory analysis on [UCI Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) library. The data is collected from 699 people who were eligible of the study. 9 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, describing the characteristics of the cell nuclei present in the image. Let's look at the descriptive statistics of the dataset that are in numeric format. + +```{r breastcancer} +data(BreastCancer) +df <- BreastCancer +# convert input columns 2 to 10 from factor to numeric +for(i in 2:10) { + df[,i] <- as.numeric(as.character(df[,i])) +} +summary(df) +``` + +## Histogram of clump thickness by class + +We are interested to see the distribution of the clump thickness between the two classes: *Benign* and *Malignant*. + +```{r cl_thickness, echo=FALSE} +ggplot(df, aes(x=Cl.thickness))+ + geom_histogram(color="black", fill="white", binwidth = 1)+ + facet_grid(Class ~ .) +``` + +It turns out that *benign* cases tend to have smaller clumps as oppose to *malignant* cases who tend to have thicker clumps in the breasts. + +## Training a machine learning model +Let's split the data, standardize accordingly and train a ML model. The training process includes a 10-fold cross validation using gradient boosting model, optimized with area under ROC curve. +```{r modeling} +# split the data into train and test and perform preprocessing +trainIndex <- createDataPartition(df$Class, p = .8, + list = FALSE, + times = 1) +df_train <- df[ trainIndex,] +df_test <- df[-trainIndex,] +preProcValues <- preProcess(df_train, method = c("center", "scale", "medianImpute")) +df_train_transformed <- predict(preProcValues, df_train) + +# train a model on df_train +fitControl <- trainControl(## 10-fold CV + method = "repeatedcv", + number = 10, + ## repeated ten times + repeats = 10, + ## Estimate class probabilities + classProbs = TRUE, + ## Evaluate performance using + ## the following function + summaryFunction = twoClassSummary) + +set.seed(825) +gbmFit <- train(Class ~ ., data = df_train_transformed[,2:11], + method = "gbm", + trControl = fitControl, + ## This last option is actually one + ## for gbm() that passes through + verbose = FALSE, + metric = "ROC") +``` + +We can see the feature importance based on the algorithm. +```{r featureimportance, echo=FALSE} +summary(gbmFit) +``` + +This is the end of a simple analysis and plotting in a R Markdown file. We develop it in RStudio Workbench in Amazon SageMaker and will publish it to a RStudio Connect server. \ No newline at end of file diff --git a/r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rpres b/r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rpres new file mode 100644 index 0000000000..4344a9e1c2 --- /dev/null +++ b/r_examples/rsconnect_rmarkdown/breast_cancer_eda.Rpres @@ -0,0 +1,49 @@ +Breast Cancer data analysis +======================================================== +author: Amazon Web Services +date: 09/07/2021 +autosize: true + +Dataset +======================================================== + +This is an exploratory analysis on [UCI Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) library. + +The data is collected from 699 people who were eligible of the study. 9 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, describing the characteristics of the cell nuclei present in the image. + +Descriptive Statistics +======================================================== + +We could see class imbalance between the *Benign* and *Malignant* cases. Summary statistics shown below. +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +library(mlbench) +library(ggplot2) +``` + +```{r breastcancer, echo=FALSE} +data(BreastCancer) +df <- BreastCancer +# convert input columns 2 to 10 from factor to numeric +for(i in 2:10) { + df[,i] <- as.numeric(as.character(df[,i])) +} +summary(df) +``` + +Thicker clumps in malignant cases +======================================================== + +It turns out that *benign* cases tend to have smaller clumps as oppose to *malignant* cases who tend to have thicker clumps in the breasts. + +```{r cl_thickness, dpi=100, fig.width = 10, echo=FALSE} +theme_set(theme_gray(base_size = 20)) +ggplot(df, aes(x=Cl.thickness))+ + geom_histogram(color="black", fill="white", binwidth = 1)+ + facet_grid(Class ~ .) +``` + +Thank you +======================================================== + +This is the end of the presentation. \ No newline at end of file diff --git a/r_examples/rsconnect_rmarkdown/images/publish-rmd.png b/r_examples/rsconnect_rmarkdown/images/publish-rmd.png new file mode 100644 index 0000000000..61eb918d79 Binary files /dev/null and b/r_examples/rsconnect_rmarkdown/images/publish-rmd.png differ diff --git a/r_examples/rsconnect_rmarkdown/images/publish-rpres.png b/r_examples/rsconnect_rmarkdown/images/publish-rpres.png new file mode 100644 index 0000000000..6b458b8aec Binary files /dev/null and b/r_examples/rsconnect_rmarkdown/images/publish-rpres.png differ diff --git a/r_examples/rsconnect_shiny/README.md b/r_examples/rsconnect_shiny/README.md new file mode 100644 index 0000000000..32918aa20d --- /dev/null +++ b/r_examples/rsconnect_shiny/README.md @@ -0,0 +1,11 @@ +# Publishing R Shiny apps from RStudio on Amazon SageMaker to RStudio Connect + +[Shiny](https://shiny.rstudio.com/) is an R package that makes it easy to create interactive web applications programmatically. It is popular among data scientists to share their analyses and models through a Shiny application to their stakeholders. In this example [breast-cancer-app](./breast-cancer-app), we develop a machine learning model using a [UCI breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29) in `breast_cancer_modeling.r` and create a web application to allow users to interact with the data and ML model. + +To publish, open the [breast-cancer-app/app.R](./breast-cancer-app/app.R) and click the **Publish** button to publish the application. Please select both `app.R` and `breast_cancer_modeling.r` to publish. + +![publish-shiny-app-2](./images/publish-shiny-app-2.png) + +In the application, you can change the features to visualize in the plot and select the data points in the plot to see more details and model prediction whether they are benign or malignant cancer cases. By sliding the probability threshold, you can interact with the model and get a different classification count. + +![shiny-dashboard-breast-cancer2.gif](./images/shiny-dashboard-breast-cancer2.gif) diff --git a/r_examples/rsconnect_shiny/breast-cancer-app/app.R b/r_examples/rsconnect_shiny/breast-cancer-app/app.R new file mode 100644 index 0000000000..62689b0d20 --- /dev/null +++ b/r_examples/rsconnect_shiny/breast-cancer-app/app.R @@ -0,0 +1,149 @@ +library(shiny) +library(caret) +library(gbm) +library(e1071) + +source('breast_cancer_modeling.r') +test_data <- readRDS('./breast_cancer_test_data.rds') +gbmFit <- readRDS('./gbm_model.rds') +preProcessor <- readRDS('./preProcessor.rds') +test_data_transformed <- predict(preProcessor, test_data) +prediction <- predict(gbmFit, newdata = test_data_transformed[,2:10], type = "prob") + +inputs1 <- c("Clump Thickness" = "Cl.thickness", + "Uniformity of Cell Size" = "Cell.size", + "Uniformity of Cell Shape" = "Cell.shape", + "Marginal Adhesion" = "Marg.adhesion", + "Single Epithelial Cell Size" = "Epith.c.size", + "Bare Nuclei" = "Bare.nuclei", + "Bland Chromatin" = "Bl.cromatin", + "Normal Nucleoli" = "Normal.nucleoli", + "Mitoses" = "Mitoses") + +inputs2 <- c("Uniformity of Cell Size" = "Cell.size", + "Clump Thickness" = "Cl.thickness", + "Uniformity of Cell Shape" = "Cell.shape", + "Marginal Adhesion" = "Marg.adhesion", + "Single Epithelial Cell Size" = "Epith.c.size", + "Bare Nuclei" = "Bare.nuclei", + "Bland Chromatin" = "Bl.cromatin", + "Normal Nucleoli" = "Normal.nucleoli", + "Mitoses" = "Mitoses") + + +# Define UI for the app ---- +ui <- fluidPage( + + # App title ---- + titlePanel("Breast Cancer"), + + # Sidebar layout with input and output definitions ---- + sidebarLayout( + + # Sidebar panel for inputs ---- + sidebarPanel( + # Input: Decimal interval with step value ---- + sliderInput("threshold", "Probability Threshold:", + min = 0, max = 1, + value = 0.5, step = 0.01), + + # Input: Selector for variable to plot on x axis ---- + selectInput("variable_x", "Variable on X:", + inputs1), + + # Input: Selector for variable to plot on y axis ---- + selectInput("variable_y", "Variable on Y:", + inputs2), + ), + + # Main panel for displaying outputs ---- + mainPanel( + + # Output: Formatted text for caption ---- + h3(textOutput("caption")), + + # Output: prediction outcome + tableOutput("predictions"), + + # Output: Verbatim text for data summary ---- + verbatimTextOutput("summary"), + + # Output: Formatted text for formula ---- + h3(textOutput("formula")), + + # Output: Plot of the data ---- + # was click = "plot_click" + plotOutput("scatterPlot", brush = "plot_brush"), + + # Output: present click info + tableOutput("info") + + ) + ) +) + +# Define server logic to plot various variables ---- +server <- function(input, output) { + + # Compute the formula text ---- + # This is in a reactive expression since it is shared by the + # output$caption function + formulaText <- reactive({ + paste(input$variable_y, "~", input$variable_x) + }) + + # Compute the formula text ---- + # This is in a reactive expression since it is shared by the + # output$caption function + total_count <- reactive({ + data.frame(Class = colnames(prediction), + Count = c(sum(prediction$malignant=input$threshold))) + }) + + # Compute the formula text ---- + # This is in a reactive expression + threshold_proba <- reactive({ + cbind(Prediction = ifelse(prediction$malignant>=input$threshold, + "malignant", "benign"), + test_data) + }) + + # return prediction summary + output$predictions <- renderTable({ + total_count() + }) + + # Return the formula text for printing as a caption ---- + output$caption <- renderText({ + "Breast cancer test data summary" + }) + + # Generate a summary of the dataset ---- + # The output$summary depends on the datasetInput reactive + # expression, so will be re-executed whenever datasetInput is + # invalidated, i.e. whenever the input$dataset changes + output$summary <- renderPrint({ + summary(test_data) + }) + + # Return the formula text for printing as a caption ---- + output$formula <- renderText({ + formulaText() + }) + + # Generate a plot of the requested variables ---- + # and only exclude outliers if requested + output$scatterPlot <- renderPlot({ + plot(as.formula(formulaText()), data = threshold_proba()) + }) + + output$info <- renderTable({ + brushedPoints(threshold_proba(), input$plot_brush, + xvar = input$variable_x, yvar = input$variable_y) + }) + +} + +# Create Shiny app ---- +shinyApp(ui, server) \ No newline at end of file diff --git a/r_examples/rsconnect_shiny/breast-cancer-app/breast_cancer_modeling.r b/r_examples/rsconnect_shiny/breast-cancer-app/breast_cancer_modeling.r new file mode 100644 index 0000000000..9e2473733b --- /dev/null +++ b/r_examples/rsconnect_shiny/breast-cancer-app/breast_cancer_modeling.r @@ -0,0 +1,47 @@ +library(caret) +library(mlbench) + +data(BreastCancer) +summary(BreastCancer) #Summary of Dataset + +df <- BreastCancer +# convert input values to numeric +for(i in 2:10) { + df[,i] <- as.numeric(as.character(df[,i])) +} + +# split the data into train and test and perform preprocessing +trainIndex <- createDataPartition(df$Class, p = .8, + list = FALSE, + times = 1) +df_train <- df[ trainIndex,] +df_test <- df[-trainIndex,] +preProcValues <- preProcess(df_train, method = c("center", "scale", "medianImpute")) +df_train_transformed <- predict(preProcValues, df_train) + +# train a model on df_train +fitControl <- trainControl(## 10-fold CV + method = "repeatedcv", + number = 10, + ## repeated ten times + repeats = 10, + ## Estimate class probabilities + classProbs = TRUE, + ## Evaluate performance using + ## the following function + summaryFunction = twoClassSummary) + +set.seed(825) +gbmFit <- train(Class ~ ., data = df_train_transformed[,2:11], + method = "gbm", + trControl = fitControl, + ## This last option is actually one + ## for gbm() that passes through + verbose = FALSE, + metric = "ROC") +gbmFit + +saveRDS(preProcValues, file = './preProcessor.rds') +saveRDS(gbmFit, file = './gbm_model.rds') +saveRDS(df_test[,1:10], file = './breast_cancer_test_data.rds') + diff --git a/r_examples/rsconnect_shiny/images/publish-shiny-app-2.png b/r_examples/rsconnect_shiny/images/publish-shiny-app-2.png new file mode 100644 index 0000000000..ba9afb2857 Binary files /dev/null and b/r_examples/rsconnect_shiny/images/publish-shiny-app-2.png differ diff --git a/r_examples/rsconnect_shiny/images/shiny-dashboard-breast-cancer2.gif b/r_examples/rsconnect_shiny/images/shiny-dashboard-breast-cancer2.gif new file mode 100644 index 0000000000..3f75d1caf8 Binary files /dev/null and b/r_examples/rsconnect_shiny/images/shiny-dashboard-breast-cancer2.gif differ diff --git a/r_examples/rsconnect_streamlit/README.md b/r_examples/rsconnect_streamlit/README.md new file mode 100644 index 0000000000..466bc2a1d7 --- /dev/null +++ b/r_examples/rsconnect_streamlit/README.md @@ -0,0 +1,26 @@ +# Publishing Streamlit apps from Amazon SageMaker Studio to RStudio Connect + +[Streamlit](https://docs.streamlit.io/en/latest/index.html) is an open source project in Python that makes it easy to create web applications for machine learning and data science developers. In this example, we develope a simple Streamlit application in `app.py` to allow interactive data exploration and visualization on the [UCI breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29). Let’s see how we can deploy this app to RStudio Connect from SageMaker Studio. + +1. Before we proceed, we first need to create an API key from your RStudio Connect account. Please follow the instruction (https://docs.rstudio.com/connect/user/api-keys/#api-keys-creating) to create one, and save it for record as the publication process requires the API key. +1. Open a system terminal in SageMaker Studio in **File**->**New**->**Terminal**. +1. Install the [`rsconnect-python`](https://github.com/rstudio/rsconnect-python) library in the terminal. + + ```python + pip install rsconnect-python + ``` + +1. Deploy the app using `rsconnect deploy` command in the terminal. In the command, we need to specify the deploy type, the server URL, an API key for the RStudio Connect account, and a file path to the folder containing `app.py` and `requirements.txt`. Note that you need to put required Python libraries in [`requirements.txt`](./breast-cancer-streamlit-app/requirements.txt) for deployment. + + ```python + rsconnect deploy streamlit \ + --server https://xxxx.rstudioconnect.com/ \ + --api-key \ + /path/to/breast-cancer-streamlit-app/ + ``` + +At the end of the execution, you should see URLs for the app in RStudio Connect. You can open the URL to see the published Streamlit app. + +![streamlit-app-in-action.gif](./images/streamlit-app-in-action.gif) + +Note that [RStudio Connect insists on matching `` versions of Python](https://github.com/rstudio/rsconnect-python#deploying-python-content-to-rstudio-connect). You need to make sure that your RStudio Connect instance has a compatible version to what is available on the notebook kernel. You can verify the Python version in the terminal in SageMaker Studio with `python --version`. You can check the version available on RStudio Connect in the **Documentation** page in your RStudio Connect, for example, `https://xxxx.rstudioconnect.com/connect/#/help/docs`. \ No newline at end of file diff --git a/r_examples/rsconnect_streamlit/breast-cancer-streamlit-app/app.py b/r_examples/rsconnect_streamlit/breast-cancer-streamlit-app/app.py new file mode 100644 index 0000000000..de50c1a31a --- /dev/null +++ b/r_examples/rsconnect_streamlit/breast-cancer-streamlit-app/app.py @@ -0,0 +1,50 @@ +import matplotlib.pyplot as plt +import pandas as pd +import numpy as np +import streamlit as st + +st.title("Breast Cancer Analysis") +data_url = "https://sagemaker-sample-files.s3.amazonaws.com/datasets/tabular/breast_cancer/breast-cancer-wisconsin.csv" +columns = [ + "Id", + "Cl.thickness", + "Cell.size", + "Cell.shape", + "Marg.adhesion", + "Epith.c.size", + "Bare.nuclei", + "Bl.cromatin", + "Normal.nucleoli", + "Mitoses", + "Class", +] + + +@st.cache +def load_data(): + df = pd.read_csv(data_url, names=columns) + df["Class"] = df["Class"].replace(to_replace=[2, 4], value=["benign", "malignant"]) + return df + + +data_load_state = st.text("Loading data...") +data = load_data() +data_load_state.text("") + +column = st.selectbox("Features", columns[1:-1]) + +st.subheader("Histogram of %s by diagnosis type" % column) +fig, ax = plt.subplots( + 1, + 2, + sharex=True, + sharey=True, +) +data.hist(column=column, by=columns[-1], layout=(1, 2), ax=ax) +st.pyplot(fig) + +if st.checkbox("Show raw data"): + st.subheader("Raw data") + st.write(data[[column, columns[-1]]]) + +st.markdown(f"Source: <{data_url}>") diff --git a/r_examples/rsconnect_streamlit/breast-cancer-streamlit-app/requirements.txt b/r_examples/rsconnect_streamlit/breast-cancer-streamlit-app/requirements.txt new file mode 100644 index 0000000000..97b1237b30 --- /dev/null +++ b/r_examples/rsconnect_streamlit/breast-cancer-streamlit-app/requirements.txt @@ -0,0 +1,4 @@ +matplotlib +pandas +numpy +streamlit diff --git a/r_examples/rsconnect_streamlit/images/streamlit-app-in-action.gif b/r_examples/rsconnect_streamlit/images/streamlit-app-in-action.gif new file mode 100644 index 0000000000..dc12f3e3ec Binary files /dev/null and b/r_examples/rsconnect_streamlit/images/streamlit-app-in-action.gif differ