-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
1 parent
2d81c82
commit 9dd3fce
Showing
15 changed files
with
461 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Publishing R Markdown documents from RStudio on Amazon SageMaker to RStudio Connect | ||
|
||
You can easily and programmatically create an analysis within RStudio on Amazon SageMaker and publish it to RStudio Connect so that your collaborators can easily consume your analysis. In this example, we use a [UCI breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) to walkthrough some of the common use case of publication: R Markdown, R Presentation documents. | ||
|
||
## R Markdown | ||
|
||
R Markdown is a great tool to run your analyses in R as part of a markdown file and share in RStudio Connect. In the rmarkdown example in [breast_cancer_eda.Rmd](./breast_cancer_eda.Rmd) in the GitHub repo, we perform two simple analyses and plotting on the dataset along with the texts in markdown. | ||
|
||
```{r} | ||
```{r breastcancer} | ||
data(BreastCancer) | ||
df <- BreastCancer | ||
# convert input values to numeric | ||
for(i in 2:10) { | ||
df[,i] <- as.numeric(as.character(df[,i])) | ||
} | ||
summary(df) | ||
``` | ||
```{r cl_thickness, echo=FALSE} | ||
ggplot(df, aes(x=Cl.thickness))+ | ||
geom_histogram(color="black", fill="white", binwidth = 1)+ | ||
facet_grid(Class ~ .) | ||
``` | ||
``` | ||
|
||
We can preview the file by clicking on the **Knit** button (1) and publish it to our RStudio Connect with the **Publish** button (2). | ||
![publish-rmd](./images/publish-rmd.png) | ||
|
||
## R Presentation | ||
|
||
We could also run the similar analysis inline to create a R Presentation deck that can be published to your collaborators. | ||
In the example in [breast_cancer_eda.Rpres](./breast_cancer_eda.Rpres) in the GitHub repo, we combine the presentation, markdown and the R commands together to create a slide deck. You can preview the slides while writing codes with the **Preview** button (1). Once you complete, you can publish it with the **Publish** button (2) in the **Presentation** tab on the right. | ||
|
||
![publish-rpres](./images/publish-rpres.png) | ||
|
||
We showed you the static work that can be published and shared on RStudio Connect from RStudio on Amazon SageMaker. More often than not, you are building an interactive application or dashboard with Shiny. Let’s take a look how we can publish Shiny apps from RStudio on Amazon SageMaker to RStudio Connect in [Publishing R Shiny Apps](../rsconnect_shiny). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
--- | ||
title: "Breast Cancer data analysis" | ||
author: "Amazon Web Services" | ||
date: "9/7/2021" | ||
output: html_document | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE) | ||
library(mlbench) | ||
library(ggplot2) | ||
library(caret) | ||
``` | ||
|
||
## Breast Cancer data summary | ||
|
||
This is an exploratory analysis on [UCI Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) library. The data is collected from 699 people who were eligible of the study. 9 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, describing the characteristics of the cell nuclei present in the image. Let's look at the descriptive statistics of the dataset that are in numeric format. | ||
|
||
```{r breastcancer} | ||
data(BreastCancer) | ||
df <- BreastCancer | ||
# convert input columns 2 to 10 from factor to numeric | ||
for(i in 2:10) { | ||
df[,i] <- as.numeric(as.character(df[,i])) | ||
} | ||
summary(df) | ||
``` | ||
|
||
## Histogram of clump thickness by class | ||
|
||
We are interested to see the distribution of the clump thickness between the two classes: *Benign* and *Malignant*. | ||
|
||
```{r cl_thickness, echo=FALSE} | ||
ggplot(df, aes(x=Cl.thickness))+ | ||
geom_histogram(color="black", fill="white", binwidth = 1)+ | ||
facet_grid(Class ~ .) | ||
``` | ||
|
||
It turns out that *benign* cases tend to have smaller clumps as oppose to *malignant* cases who tend to have thicker clumps in the breasts. | ||
|
||
## Training a machine learning model | ||
Let's split the data, standardize accordingly and train a ML model. The training process includes a 10-fold cross validation using gradient boosting model, optimized with area under ROC curve. | ||
```{r modeling} | ||
# split the data into train and test and perform preprocessing | ||
trainIndex <- createDataPartition(df$Class, p = .8, | ||
list = FALSE, | ||
times = 1) | ||
df_train <- df[ trainIndex,] | ||
df_test <- df[-trainIndex,] | ||
preProcValues <- preProcess(df_train, method = c("center", "scale", "medianImpute")) | ||
df_train_transformed <- predict(preProcValues, df_train) | ||
# train a model on df_train | ||
fitControl <- trainControl(## 10-fold CV | ||
method = "repeatedcv", | ||
number = 10, | ||
## repeated ten times | ||
repeats = 10, | ||
## Estimate class probabilities | ||
classProbs = TRUE, | ||
## Evaluate performance using | ||
## the following function | ||
summaryFunction = twoClassSummary) | ||
set.seed(825) | ||
gbmFit <- train(Class ~ ., data = df_train_transformed[,2:11], | ||
method = "gbm", | ||
trControl = fitControl, | ||
## This last option is actually one | ||
## for gbm() that passes through | ||
verbose = FALSE, | ||
metric = "ROC") | ||
``` | ||
|
||
We can see the feature importance based on the algorithm. | ||
```{r featureimportance, echo=FALSE} | ||
summary(gbmFit) | ||
``` | ||
|
||
This is the end of a simple analysis and plotting in a R Markdown file. We develop it in RStudio Workbench in Amazon SageMaker and will publish it to a RStudio Connect server. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
Breast Cancer data analysis | ||
======================================================== | ||
author: Amazon Web Services | ||
date: 09/07/2021 | ||
autosize: true | ||
|
||
Dataset | ||
======================================================== | ||
|
||
This is an exploratory analysis on [UCI Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html) library. | ||
|
||
The data is collected from 699 people who were eligible of the study. 9 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, describing the characteristics of the cell nuclei present in the image. | ||
|
||
Descriptive Statistics | ||
======================================================== | ||
|
||
We could see class imbalance between the *Benign* and *Malignant* cases. Summary statistics shown below. | ||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE) | ||
library(mlbench) | ||
library(ggplot2) | ||
``` | ||
|
||
```{r breastcancer, echo=FALSE} | ||
data(BreastCancer) | ||
df <- BreastCancer | ||
# convert input columns 2 to 10 from factor to numeric | ||
for(i in 2:10) { | ||
df[,i] <- as.numeric(as.character(df[,i])) | ||
} | ||
summary(df) | ||
``` | ||
|
||
Thicker clumps in malignant cases | ||
======================================================== | ||
|
||
It turns out that *benign* cases tend to have smaller clumps as oppose to *malignant* cases who tend to have thicker clumps in the breasts. | ||
|
||
```{r cl_thickness, dpi=100, fig.width = 10, echo=FALSE} | ||
theme_set(theme_gray(base_size = 20)) | ||
ggplot(df, aes(x=Cl.thickness))+ | ||
geom_histogram(color="black", fill="white", binwidth = 1)+ | ||
facet_grid(Class ~ .) | ||
``` | ||
|
||
Thank you | ||
======================================================== | ||
|
||
This is the end of the presentation. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# Publishing R Shiny apps from RStudio on Amazon SageMaker to RStudio Connect | ||
|
||
[Shiny](https://shiny.rstudio.com/) is an R package that makes it easy to create interactive web applications programmatically. It is popular among data scientists to share their analyses and models through a Shiny application to their stakeholders. In this example [breast-cancer-app](./breast-cancer-app), we develop a machine learning model using a [UCI breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29) in `breast_cancer_modeling.r` and create a web application to allow users to interact with the data and ML model. | ||
|
||
To publish, open the [breast-cancer-app/app.R](./breast-cancer-app/app.R) and click the **Publish** button to publish the application. Please select both `app.R` and `breast_cancer_modeling.r` to publish. | ||
|
||
![publish-shiny-app-2](./images/publish-shiny-app-2.png) | ||
|
||
In the application, you can change the features to visualize in the plot and select the data points in the plot to see more details and model prediction whether they are benign or malignant cancer cases. By sliding the probability threshold, you can interact with the model and get a different classification count. | ||
|
||
![shiny-dashboard-breast-cancer2.gif](./images/shiny-dashboard-breast-cancer2.gif) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,149 @@ | ||
library(shiny) | ||
library(caret) | ||
library(gbm) | ||
library(e1071) | ||
|
||
source('breast_cancer_modeling.r') | ||
test_data <- readRDS('./breast_cancer_test_data.rds') | ||
gbmFit <- readRDS('./gbm_model.rds') | ||
preProcessor <- readRDS('./preProcessor.rds') | ||
test_data_transformed <- predict(preProcessor, test_data) | ||
prediction <- predict(gbmFit, newdata = test_data_transformed[,2:10], type = "prob") | ||
|
||
inputs1 <- c("Clump Thickness" = "Cl.thickness", | ||
"Uniformity of Cell Size" = "Cell.size", | ||
"Uniformity of Cell Shape" = "Cell.shape", | ||
"Marginal Adhesion" = "Marg.adhesion", | ||
"Single Epithelial Cell Size" = "Epith.c.size", | ||
"Bare Nuclei" = "Bare.nuclei", | ||
"Bland Chromatin" = "Bl.cromatin", | ||
"Normal Nucleoli" = "Normal.nucleoli", | ||
"Mitoses" = "Mitoses") | ||
|
||
inputs2 <- c("Uniformity of Cell Size" = "Cell.size", | ||
"Clump Thickness" = "Cl.thickness", | ||
"Uniformity of Cell Shape" = "Cell.shape", | ||
"Marginal Adhesion" = "Marg.adhesion", | ||
"Single Epithelial Cell Size" = "Epith.c.size", | ||
"Bare Nuclei" = "Bare.nuclei", | ||
"Bland Chromatin" = "Bl.cromatin", | ||
"Normal Nucleoli" = "Normal.nucleoli", | ||
"Mitoses" = "Mitoses") | ||
|
||
|
||
# Define UI for the app ---- | ||
ui <- fluidPage( | ||
|
||
# App title ---- | ||
titlePanel("Breast Cancer"), | ||
|
||
# Sidebar layout with input and output definitions ---- | ||
sidebarLayout( | ||
|
||
# Sidebar panel for inputs ---- | ||
sidebarPanel( | ||
# Input: Decimal interval with step value ---- | ||
sliderInput("threshold", "Probability Threshold:", | ||
min = 0, max = 1, | ||
value = 0.5, step = 0.01), | ||
|
||
# Input: Selector for variable to plot on x axis ---- | ||
selectInput("variable_x", "Variable on X:", | ||
inputs1), | ||
|
||
# Input: Selector for variable to plot on y axis ---- | ||
selectInput("variable_y", "Variable on Y:", | ||
inputs2), | ||
), | ||
|
||
# Main panel for displaying outputs ---- | ||
mainPanel( | ||
|
||
# Output: Formatted text for caption ---- | ||
h3(textOutput("caption")), | ||
|
||
# Output: prediction outcome | ||
tableOutput("predictions"), | ||
|
||
# Output: Verbatim text for data summary ---- | ||
verbatimTextOutput("summary"), | ||
|
||
# Output: Formatted text for formula ---- | ||
h3(textOutput("formula")), | ||
|
||
# Output: Plot of the data ---- | ||
# was click = "plot_click" | ||
plotOutput("scatterPlot", brush = "plot_brush"), | ||
|
||
# Output: present click info | ||
tableOutput("info") | ||
|
||
) | ||
) | ||
) | ||
|
||
# Define server logic to plot various variables ---- | ||
server <- function(input, output) { | ||
|
||
# Compute the formula text ---- | ||
# This is in a reactive expression since it is shared by the | ||
# output$caption function | ||
formulaText <- reactive({ | ||
paste(input$variable_y, "~", input$variable_x) | ||
}) | ||
|
||
# Compute the formula text ---- | ||
# This is in a reactive expression since it is shared by the | ||
# output$caption function | ||
total_count <- reactive({ | ||
data.frame(Class = colnames(prediction), | ||
Count = c(sum(prediction$malignant<input$threshold), | ||
sum(prediction$malignant>=input$threshold))) | ||
}) | ||
|
||
# Compute the formula text ---- | ||
# This is in a reactive expression | ||
threshold_proba <- reactive({ | ||
cbind(Prediction = ifelse(prediction$malignant>=input$threshold, | ||
"malignant", "benign"), | ||
test_data) | ||
}) | ||
|
||
# return prediction summary | ||
output$predictions <- renderTable({ | ||
total_count() | ||
}) | ||
|
||
# Return the formula text for printing as a caption ---- | ||
output$caption <- renderText({ | ||
"Breast cancer test data summary" | ||
}) | ||
|
||
# Generate a summary of the dataset ---- | ||
# The output$summary depends on the datasetInput reactive | ||
# expression, so will be re-executed whenever datasetInput is | ||
# invalidated, i.e. whenever the input$dataset changes | ||
output$summary <- renderPrint({ | ||
summary(test_data) | ||
}) | ||
|
||
# Return the formula text for printing as a caption ---- | ||
output$formula <- renderText({ | ||
formulaText() | ||
}) | ||
|
||
# Generate a plot of the requested variables ---- | ||
# and only exclude outliers if requested | ||
output$scatterPlot <- renderPlot({ | ||
plot(as.formula(formulaText()), data = threshold_proba()) | ||
}) | ||
|
||
output$info <- renderTable({ | ||
brushedPoints(threshold_proba(), input$plot_brush, | ||
xvar = input$variable_x, yvar = input$variable_y) | ||
}) | ||
|
||
} | ||
|
||
# Create Shiny app ---- | ||
shinyApp(ui, server) |
47 changes: 47 additions & 0 deletions
47
r_examples/rsconnect_shiny/breast-cancer-app/breast_cancer_modeling.r
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
library(caret) | ||
library(mlbench) | ||
|
||
data(BreastCancer) | ||
summary(BreastCancer) #Summary of Dataset | ||
|
||
df <- BreastCancer | ||
# convert input values to numeric | ||
for(i in 2:10) { | ||
df[,i] <- as.numeric(as.character(df[,i])) | ||
} | ||
|
||
# split the data into train and test and perform preprocessing | ||
trainIndex <- createDataPartition(df$Class, p = .8, | ||
list = FALSE, | ||
times = 1) | ||
df_train <- df[ trainIndex,] | ||
df_test <- df[-trainIndex,] | ||
preProcValues <- preProcess(df_train, method = c("center", "scale", "medianImpute")) | ||
df_train_transformed <- predict(preProcValues, df_train) | ||
|
||
# train a model on df_train | ||
fitControl <- trainControl(## 10-fold CV | ||
method = "repeatedcv", | ||
number = 10, | ||
## repeated ten times | ||
repeats = 10, | ||
## Estimate class probabilities | ||
classProbs = TRUE, | ||
## Evaluate performance using | ||
## the following function | ||
summaryFunction = twoClassSummary) | ||
|
||
set.seed(825) | ||
gbmFit <- train(Class ~ ., data = df_train_transformed[,2:11], | ||
method = "gbm", | ||
trControl = fitControl, | ||
## This last option is actually one | ||
## for gbm() that passes through | ||
verbose = FALSE, | ||
metric = "ROC") | ||
gbmFit | ||
|
||
saveRDS(preProcValues, file = './preProcessor.rds') | ||
saveRDS(gbmFit, file = './gbm_model.rds') | ||
saveRDS(df_test[,1:10], file = './breast_cancer_test_data.rds') | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Publishing Streamlit apps from Amazon SageMaker Studio to RStudio Connect | ||
|
||
[Streamlit](https://docs.streamlit.io/en/latest/index.html) is an open source project in Python that makes it easy to create web applications for machine learning and data science developers. In this example, we develope a simple Streamlit application in `app.py` to allow interactive data exploration and visualization on the [UCI breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29). Let’s see how we can deploy this app to RStudio Connect from SageMaker Studio. | ||
|
||
1. Before we proceed, we first need to create an API key from your RStudio Connect account. Please follow the instruction (https://docs.rstudio.com/connect/user/api-keys/#api-keys-creating) to create one, and save it for record as the publication process requires the API key. | ||
1. Open a system terminal in SageMaker Studio in **File**->**New**->**Terminal**. | ||
1. Install the [`rsconnect-python`](https://github.com/rstudio/rsconnect-python) library in the terminal. | ||
|
||
```python | ||
pip install rsconnect-python | ||
``` | ||
|
||
1. Deploy the app using `rsconnect deploy` command in the terminal. In the command, we need to specify the deploy type, the server URL, an API key for the RStudio Connect account, and a file path to the folder containing `app.py` and `requirements.txt`. Note that you need to put required Python libraries in [`requirements.txt`](./breast-cancer-streamlit-app/requirements.txt) for deployment. | ||
|
||
```python | ||
rsconnect deploy streamlit \ | ||
--server https://xxxx.rstudioconnect.com/ \ | ||
--api-key <your-rstudio-connect-api-key> \ | ||
/path/to/breast-cancer-streamlit-app/ | ||
``` | ||
|
||
At the end of the execution, you should see URLs for the app in RStudio Connect. You can open the URL to see the published Streamlit app. | ||
|
||
![streamlit-app-in-action.gif](./images/streamlit-app-in-action.gif) | ||
|
||
Note that [RStudio Connect insists on matching `<MAJOR.MINOR>` versions of Python](https://github.com/rstudio/rsconnect-python#deploying-python-content-to-rstudio-connect). You need to make sure that your RStudio Connect instance has a compatible version to what is available on the notebook kernel. You can verify the Python version in the terminal in SageMaker Studio with `python --version`. You can check the version available on RStudio Connect in the **Documentation** page in your RStudio Connect, for example, `https://xxxx.rstudioconnect.com/connect/#/help/docs`. |
50 changes: 50 additions & 0 deletions
50
r_examples/rsconnect_streamlit/breast-cancer-streamlit-app/app.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
import matplotlib.pyplot as plt | ||
import pandas as pd | ||
import numpy as np | ||
import streamlit as st | ||
|
||
st.title("Breast Cancer Analysis") | ||
data_url = "https://sagemaker-sample-files.s3.amazonaws.com/datasets/tabular/breast_cancer/breast-cancer-wisconsin.csv" | ||
columns = [ | ||
"Id", | ||
"Cl.thickness", | ||
"Cell.size", | ||
"Cell.shape", | ||
"Marg.adhesion", | ||
"Epith.c.size", | ||
"Bare.nuclei", | ||
"Bl.cromatin", | ||
"Normal.nucleoli", | ||
"Mitoses", | ||
"Class", | ||
] | ||
|
||
|
||
@st.cache | ||
def load_data(): | ||
df = pd.read_csv(data_url, names=columns) | ||
df["Class"] = df["Class"].replace(to_replace=[2, 4], value=["benign", "malignant"]) | ||
return df | ||
|
||
|
||
data_load_state = st.text("Loading data...") | ||
data = load_data() | ||
data_load_state.text("") | ||
|
||
column = st.selectbox("Features", columns[1:-1]) | ||
|
||
st.subheader("Histogram of %s by diagnosis type" % column) | ||
fig, ax = plt.subplots( | ||
1, | ||
2, | ||
sharex=True, | ||
sharey=True, | ||
) | ||
data.hist(column=column, by=columns[-1], layout=(1, 2), ax=ax) | ||
st.pyplot(fig) | ||
|
||
if st.checkbox("Show raw data"): | ||
st.subheader("Raw data") | ||
st.write(data[[column, columns[-1]]]) | ||
|
||
st.markdown(f"Source: <{data_url}>") |
4 changes: 4 additions & 0 deletions
4
r_examples/rsconnect_streamlit/breast-cancer-streamlit-app/requirements.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
matplotlib | ||
pandas | ||
numpy | ||
streamlit |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.