Skip to content

Latest commit

 

History

History
179 lines (153 loc) · 7.17 KB

README.md

File metadata and controls

179 lines (153 loc) · 7.17 KB

DS1: Data Analysis, Databases, and Data Visualization

NOTE: This repo is no longer maintained

This repo is still under construction. Things will be added and changed as the course progresses--check back often!

Getting started

To make sure that you are prepared for the course, please follow the instructions found in the Installation_Instructions.md

Course Description

Learn the foundational skills of data science, including data collection, scrubbing, analysis, and visualization with modern tools, libraries, and databases. Master the science and art of data exploration and visualization to tell stories with discoveries and persuade decision makers with data-driven insights. Collect a data set, explore, analyze, and visualize it to discover trends, then present insights to the class. Create and manage relational databases with SQL and document-based (NoSQL) databases, as well as gain an appreciation of the tradeoffs of both paradigms. Draw entity-relationship diagrams and connect entities with foreign keys and many-to-many relation tables. Balance minimizing redundancy with maximizing performance tradeoffs.

Learning Objective

  1. Students can learn and do data pulling, data manipulation, data processing, data cleaning
  2. Undestand data visulation for conveying findings to non-experts
  3. Apply statistical test to conclude about findings
  4. A/B testing
  5. Analyse data statistically
  6. Multiple hands on and projects in Python using packages such as Pandas, Matplotliob, Seaborn, Spicy.stats

Course Schedule

This schedule is tentative and subject to change as the course progresses.

  • Class 1: Tuesday, August 28th

    • Lecture & Discussion
    • NOTE: Students should ALREADY have Class 1 repo/dataset setup!
    • Data Science Toolkit/Process
    • Anaconda
    • Jupyter Notebooks
    • Pandas
    • Basic Data Literacy
    • Dataframes and series
    • Data Manipulation
    • Creating dataframes from various sources
    • Manipulating data with Pandas and Numpy
    • Challenges
  • Class 2: Thursday, August 30th

    • Lecture & Discussion
    • Building Visualizations with Seaborn
    • Basic Data Visualization
    • Scatter Plots
    • Bar Graphs/Histograms
    • Pie Charts
    • Review of Data Analysis
    • Common data problems
    • DataFrame slicing/indexing/filtering
    • Data types/structures
    • Basic Data Literacy
    • Continuous vs. discrete data
    • 10 minutes to pandas review and Q&A session
    • Challenges
      • Build 3-5 visualizations based off of descriptive questions you have on Titanic data.
  • Class 3: Tuesday, September 4th

    • Lecture & Discussion
    • Brief Review of DS Process (so far)
    • Brief Review of Data Toolkit for Visualization
    • Anaconda
    • Jupyter Notebooks
    • Pandas
    • Seaborn
    • Numpy
    • MatPlotLib
    • Light Introduction to Descriptive Statistics
    • Mean, Median, Mode → How do they relate to data?
    • Introduction to Measures of Central Tendency and Data Spread/Distribution
    • Data Wrangling/Cleaning
    • Explaining Data with a Data Dictionary
    • Intermediate Data Wrangling Concepts
    • CSV Files with Header Rows
    • Introducing New Columns/Attributes
    • Details on Data Types (str ←→ int/float)
    • Normalizing Values by Units (Integer Encoding)
    • Dealing with Categorical Values (One-Hot Encoding)
    • Challenges
      • Sections 1-3 of Basic EDA Tutorial (App Store Dataset)
      • Section 1 is Basic Environment Setup: should already be done.
      • Data Wrangling and Collating
  • Class 4: Thursday, September 6th

    • Lecture & Discussion

    • Quick Review of Homework: Titanic functions and Sections 2-3 of EDA

    • Introduction to Probability

    • Probability Game ($)

    • Conditional Probability

    • Titanic Data Exploration Question Ideation

    • Reading

      • Article on conditional probability of Bob Ross artwork
    • Challenges

      • Sections 4-5 of Basic EDA Tutorial (App Store Dataset)
      • Answer 3-5 probability and conditional probability questions about the titanic dataset
  • Class 5: Tuesday, September 11th

    • Lecture & Discussion
    • Statistical Distributions
    • Review Histograms & Descriptive Stats (Mean, Median, Mode)
    • More Descriptive Stats (Standard Deviation & Variance)
    • Normal Distributions
    • Central Limit Theorem
    • Sampling Methods
    • Reading
      • Article on Mean, Median, Mode, and Range
    • Normal Distribution and Standard Scores
    • Challenges
      • Calculate the mean, median and mode, as well as the standard deviation of the ticket price, age, and Parch columns from the Titanic dataset
      • Visualize more results from Sections 3-5 in App Store EDA tutorial Complete Jupyter notebook on Descriptive Statistics, Sampling, and Distributions
  • Class 6: Thursday, September 13th

    • Lectures & Discussion
    • Deeper Descriptive Statistics
    • Standard Normal Distribution
    • Z-scores
    • Z-scores to probability (cdf & sf)
    • Introduce Project 1: SA NPS Data Wrangling & Analysis
    • Reading
      • Net Promoter Score (NPS) Related Info
    • Challenges
      • Start Project 1: SA NPS Part 1: Data Wrangling
  • Class 7: Tuesday, September 18th

    • Activities & Discussion
    • Code Review of Project 1: SA NPS Part 1: Data Wrangling
    • Lab Time for Project 1: SA NPS Part 2: Data Analysis
  • Class 8: Thursday, September 20th

    • Activities & Discussion
    • Presentations of Project 1: SA NPS Part 2: Data Analysis
    • Maybe using JSON APIs and file globbing in Python
  • Class 9: Tuesday, September 25th

    • Lecture & Discussion
    • Identifying Outliers (Tukey’s Method)
    • Sample size and confidence intervals
  • Class 10: Thursday, September 27th

    • Null Hypothesis – read this article before class
    • Hypothesis Testing
    • Null hypothesis vs. alternative hypothesis
    • Type I & type II errors (false positives & negatives)
    • Acceptable error, Alpha values, P-values
    • Discussion: Final Project Selection
    • Find Dataset for Final Project
    • Project Dataset & Question Approval
  • Class 11: Tuesday, October 2nd

    • Read this article before class -
    • Discuss the concept of time and 3 dimensions – what’s the direct application?
    • Open discussion for students on what the concept of time means with a partner or group
    • Discuss why time is important in data science and programming
    • Interview questions, as well as importance in multiple fields
    • Example 1 of a time series in a data frame
  • Class 12: Thursday, October 4th

    • Code review and discussion of time series manipulation
    • Discuss common pitfalls of working with time series in applications
    • Working with time series in an applications
      • JS application
    • Python application(Flask)
    • Discussion of applications and importance in industry
    • Intro to HW (applications and editing)
    • Go over dummy data for assignment
    • Go over HW guidelines
    • Most common uses of time series
  • Class 13: Tuesday, October 9th

    • Lab Time for Final Project
  • Class 14: Thursday, October 11th

    • Presentations of Final Project