Skip to content

Latest commit

 

History

History
89 lines (78 loc) · 1.97 KB

README.md

File metadata and controls

89 lines (78 loc) · 1.97 KB

machine-learning-using-pyspark

1. Understanding PySpark Ecosystem

  • Big Data
  • Hadoop
  • Spark
  • PySpark
  • Machine Learning using PySpark

2. Foundations of Machine Learning

  • Introduction to Machine Learning
  • Supervised vs Unsupervised
  • Classification vs Regression
  • Data Ingestion
  • Data Wrangling
  • Data Preprocessing
  • Model Training
  • Model Validation
  • Deployment

3. Internal Details of Spark

  • Driver
  • Executors
  • Partitions
  • Jobs
  • Stages
  • Tasks
  • Resilient Distributed Datastructure
  • DataFrames as a High Level Datastructure

4. Low level Understanding using RDD

  • Creation of RDD
  • Transformation methods
  • Aggregation methods
  • Actions
  • Caching
  • Debugging

5. Data Ingestion

  • Loading CSV, JSON & parquet
  • Connecting to databases
  • Getting data from streaming server

5. Data Wrangling using DataFrames

  • Descriptive Statistics
  • Accessing subsets of data - Rows, Columns, Filters
  • Handling Missing Data
  • Dropping rows & columns
  • Handling Duplicates
  • Aggregate functions
  • Merge, Join & Concatenate

6. Data Preprocessing

  • Why Preprocessing ?
  • Scaling Techniques
  • Encoding Techniques
  • Text Processing
  • Dimensionality Reduction
  • Vectorization of Data

7. Regression Learning Models

  • Linear Regression
  • Decision Tree Regressor
  • Random Forest Regressor
  • GBT Regressor
  • Evaluation of Regression Models

8. Classification Learning Models

  • LogisticRegression
  • DecisionTreeClassifier
  • GBT Classifier
  • RandomForestClassifier
  • NaiveBayes
  • MultiLayerPerceptronClassifier
  • Evaluation of Classification Models

9. Clustering Learning Models

  • Motivation behind clustering
  • KMeans
  • GaussianMixtureModel
  • Latent Dirichlet Allocation

10. Recommandation Engine

11. Pipeline & Hyper-parameter Tuning

  • Composite Estimators using Pipelines
  • Model Selection
  • Hyper-parameter Tuning
  • Persisting trained models
  • Deployment