Update introduction.md

carpentries-incubator · Nov 3, 2024 · fe397ed · fe397ed
1 parent d2901b6
commit fe397ed
Showing 1 changed file with 112 additions and 65 deletions.
diff --git a/episodes/introduction.md b/episodes/introduction.md
@@ -1,114 +1,161 @@
 ---
-title: "Using Markdown"
-teaching: 10
-exercises: 2
+title: "Data Storage: Setting up S3"
+teaching: 15
+exercises: 5
 ---
 
 :::::::::::::::::::::::::::::::::::::: questions 
 
-- How do you write a lesson using Markdown and `{sandpaper}`?
+- How can I store and manage data effectively in AWS for SageMaker workflows?
+- What are the best practices for using S3 versus EC2 storage for machine learning projects?
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 ::::::::::::::::::::::::::::::::::::: objectives
 
-- Explain how to use markdown with The Carpentries Workbench
-- Demonstrate how to include pieces of code, figures, and nested challenge blocks
+- Explain data storage options in AWS for machine learning projects.
+- Describe the advantages of S3 for large datasets and multi-user workflows.
+- Outline steps to set up an S3 bucket and manage data within SageMaker.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
-## Introduction
+## Step 1: Data Storage
 
-This is a lesson created via The Carpentries Workbench. It is written in
-[Pandoc-flavored Markdown](https://pandoc.org/MANUAL.txt) for static files and
-[R Markdown][r-markdown] for dynamic files that can render code into output. 
-Please refer to the [Introduction to The Carpentries 
-Workbench](https://carpentries.github.io/sandpaper-docs/) for full documentation.
+> **Hackathon Attendees**: All data uploaded to AWS must relate to your specific Kaggle challenge, except for auxiliary datasets for transfer learning or pretraining. **DO NOT upload any restricted or sensitive data to AWS.**
 
-What you need to know is that there are three sections required for a valid
-Carpentries lesson:
+## Options for Storage: S3 or EC2 Instance
 
- 1. `questions` are displayed at the beginning of the episode to prime the
-    learner for the content.
- 2. `objectives` are the learning objectives for an episode displayed with
-    the questions.
- 3. `keypoints` are displayed at the end of the episode to reinforce the
-    objectives.
+Storing data in an **S3 bucket** is generally preferred for machine learning workflows on AWS, especially when using SageMaker.
 
-:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor
+### What is an S3 Bucket?
 
-Inline instructor notes can help inform instructors of timing challenges
-associated with the lessons. They appear in the "Instructor View"
+An **S3 bucket** is a container in Amazon S3 (Simple Storage Service) where you can store, organize, and manage data files. Buckets act as the top-level directory within S3 and can hold a virtually unlimited number of files and folders, making them ideal for storing large datasets, backups, logs, or any files needed for your project. You access objects in a bucket via a unique **S3 URI** (e.g., `s3://your-bucket-name/your-file.csv`), which you can use to reference data across various AWS services like EC2 and SageMaker.
 
-::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+::::::::::::::::::::::::::::::::::::: callout 
 
-::::::::::::::::::::::::::::::::::::: challenge 
+### Benefits of Using S3 (Recommended for SageMaker)
 
-## Challenge 1: Can you do it?
+- **Scalability**: S3 handles large datasets efficiently, enabling storage beyond the limits of an EC2 instance's disk space.
+- **Cost Efficiency**: S3 storage costs are generally lower than expanding EC2 disk volumes. You only pay for the storage you use.
+- **Separation of Storage and Compute**: You can start and stop EC2 instances without losing access to data stored in S3.
+- **Integration with AWS Services**: SageMaker can read directly from and write back to S3, making it ideal for AWS-based workflows.
+- **Easy Data Sharing**: Datasets in S3 are easier to share with team members or across projects compared to EC2 storage.
+- **Cost-Effective Data Transfer**: When S3 and EC2 are in the same region, data transfer between them is free.
 
-What is the output of this command?
+::::::::::::::::::::::::::::::::::::::::::::::::
 
-```r
-paste("This", "new", "lesson", "looks", "good")
-```
+### When to Store Data Directly on EC2 (e.g., in Jupyter Notebook instance)
 
-:::::::::::::::::::::::: solution 
+Using EC2 for data storage can be a quick solution for temporary needs, but **S3 is generally preferred** for scalability, cost-efficiency, and ease of integration across AWS services, especially for machine learning workflows. An **EC2 instance** provides a virtual server environment with its own local storage, which can be used to store and process data directly on the instance. This method is suitable for **temporary or small datasets** and for **one-off experiments** that don’t require long-term data storage or frequent access from multiple services. 
 
-## Output
-
-```output
-[1] "This new lesson looks good"
-```
+#### Use EC2 storage for:
 
-:::::::::::::::::::::::::::::::::
+- **Temporary or Small Datasets**: If your dataset is under 1 GB and you need quick, one-time processing, EC2 storage can be simpler and faster to set up.
+- **No S3 Access Required**: If your environment has limited permissions or network restrictions preventing S3 access, storing data on EC2 may be preferable.
+- **One-off Experiments**: For experiments that won’t require scaling or future access to data, storing directly on EC2 can be convenient.
 
+#### Limitations of EC2 Storage:
+- **Scalability**: EC2 storage is limited to the instance’s disk capacity, so it may not be ideal for very large datasets.
+- **Cost**: EC2 storage can be more costly for long-term use compared to S3.
+- **Data Persistence**: EC2 data may be lost if the instance is stopped or terminated, unless using Elastic Block Store (EBS) for persistent storage.
 
-## Challenge 2: how do you nest solutions within challenge blocks?
+## Recommended Approach: Use S3 for Data Storage
 
-:::::::::::::::::::::::: solution 
+For flexibility, scalability, and cost efficiency, store data in S3 and load it into EC2 as needed. This setup allows:
 
-You can add a line with at least three colons and a `solution` tag.
+- Starting and stopping EC2 instances as needed
+- Scaling storage without reconfiguring the instance
+- Seamless integration across AWS services
 
-:::::::::::::::::::::::::::::::::
-::::::::::::::::::::::::::::::::::::::::::::::::
+### Steps to Access S3 and Upload Your Dataset
 
-## Figures
+1. Log in to AWS Console and navigate to S3.
+2. Create a new bucket or use an existing one.
+3. Upload your dataset files.
+4. Use the object URL to reference your data in future experiments.
 
-You can use standard markdown for static figures with the following syntax:
+### Adding Tags for Cost Tracking
 
-`![optional caption that appears below the figure](figure url){alt='alt text for
-accessibility purposes'}`
+Adding tags to your S3 buckets is a great way to track project-specific costs and usage over time, especially as data and resources scale up. While tags are required for hackathon participants, we suggest that all users apply tags to easily identify and analyze costs later.
 
-![You belong in The Carpentries!](https://raw.githubusercontent.com/carpentries/logo/master/Badge_Carpentries.svg){alt='Blue Carpentries hex person logo with no text.'}
+![Example of Recommended Tags for an S3 Bucket](path/to/your-image.png){alt="Screenshot showing recommended tags for an S3 bucket, such as Team, Dataset, and Environment"}
 
-::::::::::::::::::::::::::::::::::::: callout
+Suggested tags include:
+- **Team**: Your team name (e.g., `DataScienceClub`)
+- **Dataset**: The specific dataset name (e.g., `CustomerRetention`)
+- **Environment**: The type of environment (e.g., `Development`, `Production`)
 
-Callout sections can highlight information.
+### Detailed Procedure:
 
-They are sometimes used to emphasise particularly important points
-but are also used in some lessons to present "asides": 
-content that is not central to the narrative of the lesson,
-e.g. by providing the answer to a commonly-asked question.
+1. **Sign in to the AWS Management Console**:
+   - Log in to [AWS Console](https://aws.amazon.com/console/) using your credentials.
 
-::::::::::::::::::::::::::::::::::::::::::::::::
+2. **Navigate to S3**:
+   - Type “S3” in the search bar and select **S3 - Scalable Storage in the Cloud**.
+
+3. **Create a New Bucket (or Use an Existing One)**:
+   - Click **Create Bucket** and enter a unique name.
+   - **Hackathon participants**: Use a format like `TeamName-DatasetName` (e.g., `EmissionImpossible-CO2data`).
+   - **Region**: Leave as `us-east-1` (US East N. Virginia).
+   - **Access Control**: Disable ACLs (recommended).
+   - **Public Access**: Turn on "Block all public access".
+   - **Versioning**: Disable unless you need multiple versions of objects.
+   - **Tags**: Include suggested tags for easier cost tracking.
+   - **Encryption**: Use **Server-side encryption with Amazon S3 managed keys (SSE-S3)**.
+
+4. **Upload Files to the Bucket**:
+   - Click on your bucket’s name, then **Upload**.
+   - **Add Files** (e.g., `train.csv`, `test.csv`) and click **Upload** to complete.
+
+5. **Getting the S3 URI for Your Data**:
+   - After uploading, click on a file to find its **Object URI** (e.g., `s3://titanic-dataset-test/test.csv`). Use this URI to load data into SageMaker or EC2.
+
+## S3 Bucket Costs
 
+S3 bucket storage incurs costs based on data storage, data transfer, and request counts.
 
-## Math
+### Storage Costs:
+- Storage is charged per GB per month.
+- Example: Storing 10 GB costs approximately $0.23/month in S3 Standard.
+- **Pricing Tiers**: S3 offers multiple storage classes (Standard, Intelligent-Tiering, Glacier, etc.), with different costs based on access frequency and retrieval times.
+- To calculate specific costs based on your needs, refer to AWS's [S3 Pricing Information](https://aws.amazon.com/s3/pricing/).
 
-One of our episodes contains $\LaTeX$ equations when describing how to create
-dynamic reports with {knitr}, so we now use mathjax to describe this:
+### Data Transfer Costs:
+- **Uploading** data to S3 is free.
+- **Downloading** data (out of S3) incurs charges (~$0.09/GB).
+- **In-region transfer** (e.g., S3 to EC2) is free, while cross-region data transfer is charged (~$0.02/GB).
 
-`$\alpha = \dfrac{1}{(1 - \beta)^2}$` becomes: $\alpha = \dfrac{1}{(1 - \beta)^2}$
+> **[Data Transfer Pricing](https://aws.amazon.com/s3/pricing/)**
 
-Cool, right?
+### Request Costs:
+- GET requests are $0.0004 per 1,000 requests.
+
+> **[Request Pricing](https://aws.amazon.com/s3/pricing/)**
+
+## Removing Unused Data
+
+Choose one of these options:
+
+### Option 1: Delete Data Only
+- **When to Use**: You plan to reuse the bucket.
+- **Steps**:
+   - Go to S3, navigate to the bucket.
+   - Select files to delete, then **Actions > Delete**.
+   - **CLI** (optional): `!aws s3 rm s3://your-bucket-name --recursive`
+
+### Option 2: Delete the S3 Bucket Entirely
+- **When to Use**: You no longer need the bucket or data.
+- **Steps**:
+   - Select the bucket, click **Actions > Delete**.
+   - Type the bucket name to confirm deletion.
+
+Deleting the bucket stops all costs associated with storage, requests, and data transfer.
 
 ::::::::::::::::::::::::::::::::::::: keypoints 
 
-- Use `.md` files for episodes when you want static content
-- Use `.Rmd` files for episodes when you need to generate output
-- Run `sandpaper::check_lesson()` to identify any issues with your lesson
-- Run `sandpaper::build_lesson()` to preview your lesson locally
+- Use S3 for scalable, cost-effective, and flexible storage.
+- EC2 storage is suitable for small, temporary datasets.
+- Track your S3 storage costs, data transfer, and requests to manage expenses.
+- Regularly delete unused data or buckets to avoid ongoing costs.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
-
-[r-markdown]: https://rmarkdown.rstudio.com/