Skip to content

Latest commit

 

History

History
350 lines (266 loc) · 18.7 KB

README.md

File metadata and controls

350 lines (266 loc) · 18.7 KB

Data Science Cheet Sheet

Data science is a concept to unify statistics, data analysis, machine learning, domain knowledge and their related methods in order to understand and analyze actual phenomena with data.

Table of Contents

Statistics

TODO

https://www.youtube.com/watch?v=xxpc-HPKN28

Individuals vs characteristics

Population (census) vs sample

Parameters vs samples

Descriptive vs inferential

Normal distribution and empirical rule (68-95-99.7)

z-score

Inference: Estimation, Testing, Regression

Central limit theorem:

  • The sampling distribution (the distribution of x-bars (mean of the sample) from all possible samples) is also a normal distribution.
  • The mean of x-bars is equal to mean of the population.
  • The standard deviation of the x-bars is standard deviation of population divided by sqrt(n).

https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/more-significance-testing-videos/v/hypothesis-testing-and-p-values

p-value - the probability of getting a certain result given the null-hypothesis.

Machine Learning

Neural Networks

DNN implementations:

Structuring ML Projects

It is better to find a single optimization metric, this way it will be easier to choose a better model. When it's not possible to choose a single optimization metric, you can add satisfying metrics. For example, error rate is an optimization metric and the time it takes to run the classification on an object is a satisfying metric.

Bayes error rate is the lowest possible error rate for any classifier of a random outcome (into, for example, one of two categories) and is analogous to the irreducible error.

If your algorithm is performing worse than a human, then to improve your algorithm you can:

  • Get labeled data from humans.
  • Gain insight from manual error analysis: why did a person get this right?
  • Better analysis of bias/variance.

When to focus on bias and when on variance:

  • If human error is 1%, train error is 8%, dev error is 10%, then focus on avoidable bias, i.e reducing the train error, because it can potentially be reduced by 7 pp, compared to just 2 pp for dev error. To reduce the train error you can try the following:
    • Train a bigger model
    • Train longer
    • Try better optimization algorithms: RMSProp, Adam
    • Try a different NN architecture: RNN, CNN
    • Hyperparameter search
  • If human error is 7%, train error is 8%, dev error is 10%, then focus on variance, i.e reducing the dev error, because it can potentially be reduced by 2 pp, compared to just 1 pp for train error. To reduce the dev error you can try the following:
    • Add more data
    • Regularization: L1, L2, dropout,
    • Data augmentation
    • Try a different NN architecture: RNN, CNN
    • Hyperparameter search

Problems where ML significantly surpasses human-level performance:

  • Online advertising
  • Product recommendations
  • Logistics (predicting transit time)
  • Loan approvals

All of the above are ML on structured data as opposed to natural perception.

Advice from Andrej Karpathy for learning ML: try implementing a NN from scratch, without relying on any libraries like TensorFlow. This will help you learn how deep learning works under the hood.

Error Analysis: focus on what contributes most to the algorithm error. For example, if 90% of errors are due to blurry images and 10% are due to misclassified as a dog instead of a cat, then focus on blurry images to reduce the error.

Incorrectly labeled data:

  • fix it if it contributes a significant portion to error;
  • fix it across train, dev, test datasets universally. Otherwise it may introduce bias to the dataset.

It's important to make your dev and test datasets as close to real-world data, even if it results in train and dev/test datasets be drawn from different distributions. This way you optimise to the right target. In this case to perform bias/variance analysis introduce train-dev dataset, to measure the variance contribution to error.

data-mismatch.png

Transfer Learning - using intermediate NN layers, that were pre-trained on some problem A, for a different problem B. For example problem A can be classifying cats and dogs, problem B can be classifying lung desease in radiology images. It makes sense when:

  • Problems A and B have the same input.
  • There is a lot more input for problem A than for problem B.
  • Low level features from A could be helpful for learning B.

Multi-task Learning - training a NN for a classification problem where an input can be assigned multiple classes, for example an image which can contain cars, pedestrians, stop signs, traffic lights, or any combination of those. It can give better results than training a separate NN for each class, because the intermediate layers are reused.

End-to-end ML - solving a problem using just an ML algorithm without any hand-designed components as part of the whole system. For example, for a speech recognition task an end-to-end ML approach is to use audio as an input for an ML algorithm and the transcript as the output, as opposed to manually extracting features from the audio first, then phonemes, the words, and then generating a transcript.

  • Pros: let the data speak, less hand-designing of components needed.
  • Cons: may need large amount of data, excludes potentially useful hand-designed components.

Convolutional Neural Networks

Colab Notebooks:

Why convolutions:

  • Parameter sharing
  • Sparsity of connections

Computer Vision Networks:

  • AlexNet
  • VGG-16 - 16 layers of "same" ConvLayers and MaxPooling layers
  • ResNet
  • Inception Network

Face Recognition:

  • One Shot Learning, Triplet Loss

Neural Style Transfer:

DeepFake Colab:

YOLO (You Only Look Once) algorithm:

Sequence Models

TODO: Add more details

  • RNN (Recurrent Neural Network) - has a problem of exploding/vanishing gradients.
  • LSTM (Long Short-Term Memory Network) - solves the problem of exploding/vanishing gradients by adding memory units.
  • GRU (Gate Recurrent Unit) - simplified version of LSTM.
  • Attention Model - adds attention mechanism to LSTM: Colab Notebook

Heroes of ML

AI for Industries

AI for Healthcare

AI for Medicine Specialization, Coursera

AI for Diagnosis:

  • Applications of AI for diagnosis (mostly computer vision):
    • Diagnosing edema in lungs from X-Rays scans.
    • Dermatology: detecting whether a mole is a skin cancer: https://www.nature.com/articles/nature21056.
    • Ophthalmology: diagnosing eye disorders using retinal fundus photos (e.g. diagnosing diabetic retinopathy).
    • Histopathology: determining the extent to which a cancer has spread from microscopic images of tissues.
    • Identifying tumors in MRI data - image segmentation. A CNN called U-Net is used for this.
  • Challenges:
    • Patient Overlap - as an example, the model can memorize a necklace on X-Rays images of a single patient and give an over-optimistic test evaluation. To fix this split train and test sets by patient, so that all images of the same patient are either in train or test sets.
    • Set Sampling - when there is an imbalance dataset. Minority class sampling is used
    • Ground Truth / Reference Standard - consensus voting.

AI for Prognosis:

  • Applications of AI for prognosis (mostly applications of Survival analysis):
    • Predicting risk of an event or when an event is likely to happen. E.g. death, heart attack or stroke, for people with a specific condition or for the general population. It's used to inform the patient and to guide the treatment.
      • Risk of breast or ovarian cancer using data from blood tests.
      • Risk of death for a person with a particular cancer.
      • Risk of a heart attack.
      • Risk of lung cancer recurrence after therapy.
  • Survival analysis is a field in statistics that is used to predict when an event of interest will happen. The field emerged from medical research as a way to model a patient's survival — hence the term "survival analysis".
    • Censored data (end-of-study censoring, not-follow-up censoring) - we don't know the exact time of an event but we know that the event didn't happen before time X.
    • Missing data: completely at random, at random, not at random. E.g. blood pressure measurements are missing for younger patients.
    • Hazard, Survival to Hazard, Cumulative Hazard - functions that describe the probability of an event over time.
    • C-index - a measure of performance for a survival model (concordance - patience with worse outcome should have higher risk score).
    • Mortality score - the sum of hazards for different times.
    • Python library for Survival analysis https://github.com/square/pysurvival/.

AI for Treatment:

  • Applications of AI for treatment (mostly statistical methods):
    • Treatment effect estimation - determining whether certain treatment will be effective for a particular patient. The input is features of the patient, e.g. age, blood pressure and the output is the number representing risk reduction or increase for an event e.g. stroke or heart attack. The data from randomized control trials is used to train the model.
  • Treatment effect estimation:
    • NNT (number needed to treat) = 1/ ARR (absolute risk reduction) - number of people who need to receive the treatment in order to benefit one of them.
    • Factual - what happens to the patient with/without treatment - we know it. Counterfactual - what would happen to the patient without/with treatment - we don't know it.
    • Average Treatment Effect - difference between means of outcomes with treatment and without treatment.
    • Conditional Average Treatment Effect - Average Treatment Effect given some conditions on the patient, e.g. age, blood pressure.
    • Two Tree Method (T-Learner) - build two decision trees to estimate risk with and without treatment, then subtract the values given by these trees.
    • C-for-benefit - similar to C-index but for treatment effect estimator evaluation.
  • The task of extracting labels from doctors' unstructured reports on images of lung X-Rays.
    1. occurrences of specific labels are searched for in the text. E.g. if the word "edema" is found in the report, go to the next step. Because "edema" has synonyms, a special medical thesaurus called SNOMED CT is used to find synonyms and related terms.
    2. A Negation Classification is used to determine absence of a disease, e.g. if the report contains "no edema" or "no evidence of edema". This requires labeled data. If there is no labeled data, then a simple Regex or Dependency Parse rules are used.

Applications of Deep Learning in Medicine:

AI for Financial Services

TODO

Random Notes

  • Train, dev, and test datasets:

    • Dev dataset prevents overfitting NN parameters (weights and biases) to the train data
    • Test dataset prevents overfitting NN hyper-parameters (model architecture, number of layers types of layers) to the train and dev data.
  • What stage are we at? Stages of an ML project:

    1. Individual contributor
    2. Delegation
    3. Digitization
    4. Big Data and Analytics
    5. Machine Learning
  • CRISP-DM model

    1. Business understanding
    2. Data understanding
    3. Data preparation
    4. Modeling
    5. Evaluation
    6. Deployment
  • Precision, recall, accuracy, sensitivity, specificity

precisionrecall.png

Log loss (cross entropy loss):

log-loss-cross-entropy.png log-loss-cross-entropy-graph.png

  • Most of the economic value is created by supervised learning.
  • tanh activation works better than sigmoid activation because it makes input data centered. Sigmoid should only be used in the last layer for classification because it's between 0 and 1 (probability).
  • ReLU works better than sigmoid or tanh because, unlike sigmoid and tanh it's derivative is not approaching 0 for very large or small values of input.
  • Weights of the layers of a neural net should be initialized with random small numbers. If they are initialized with zeroes then all neurons of a layer will train to the same weights. If they are initialized as big numbers and sigmoid or tanh activation is used, they will become saturated quickly and the learning will stall.
  • Why deep neural nets work better than shallow for complex functions: to calculate XOR of N parameters a deep neural net needs log(N) units, while a shallow neural net needs 2^n units. A shallow net would be much bigger.

GPT:

Data Science Tools

Google Colab

Resources