Machine Learning - List of questions

Learning Theory

Describe bias and variance with examples.
What is Empirical Risk Minimization?
What is Union bound and Hoeffding's inequality?
Write the formulae for training error and generalization error. Point out the differences.
State the uniform convergence theorem and derive it.
What is sample complexity bound of uniform convergence theorem?
What is error bound of uniform convergence theorem?
What is the bias-variance trade-off theorem?
From the bias-variance trade-off, can you derive the bound on training set size?
What is the VC dimension?
What does the training set size depend on for a finite and infinite hypothesis set? Compare and contrast.
What is the VC dimension for an n-dimensional linear classifier?
How is the VC dimension of a SVM bounded although it is projected to an infinite dimension?
Considering that Empirical Risk Minimization is a NP-hard problem, how does logistic regression and SVM loss work?

Model and feature selection

Why are model selection methods needed?
How do you do a trade-off between bias and variance?
What are the different attributes that can be selected by model selection methods?
Why is cross-validation required?
Describe different cross-validation techniques.
What is hold-out cross validation? What are its advantages and disadvantages?
What is k-fold cross validation? What are its advantages and disadvantages?
What is leave-one-out cross validation? What are its advantages and disadvantages?
Why is feature selection required?
Describe some feature selection methods.
What is forward feature selection method? What are its advantages and disadvantages?
What is backward feature selection method? What are its advantages and disadvantages?
What is filter feature selection method and describe two of them?
What is mutual information and KL divergence?
Describe KL divergence intuitively.

Curse of dimensionality

Describe the curse of dimensionality with examples.
What is local constancy or smoothness prior or regularization?

Universal approximation of neural networks

State the universal approximation theorem? What is the technique used to prove that?
What is a Borel measurable function?
Given the universal approximation theorem, why can't a MLP still reach a arbitrarily small positive error?

Deep Learning motivation

What is the mathematical motivation of Deep Learning as opposed to standard Machine Learning techniques?
In standard Machine Learning vs. Deep Learning, how is the order of number of samples related to the order of regions that can be recognized in the function space?
What are the reasons for choosing a deep model as opposed to shallow model? (1. Number of regions O(2^k) vs O(k) where k is the number of training examples 2. # linear regions carved out in the function space depends exponentially on the depth. )
How Deep Learning tackles the curse of dimensionality?

Support Vector Machine

How can the SVM optimization function be derived from the logistic regression optimization function?
What is a large margin classifier?
Why SVM is an example of a large margin classifier?
SVM being a large margin classifier, is it influenced by outliers? (Yes, if C is large, otherwise not)
What is the role of C in SVM?
In SVM, what is the angle between the decision boundary and theta?
What is the mathematical intuition of a large margin classifier?
What is a kernel in SVM? Why do we use kernels in SVM?
What is a similarity function in SVM? Why it is named so?
How are the landmarks initially chosen in an SVM? How many and where?
Can we apply the kernel trick to logistic regression? Why is it not used in practice then?
What is the difference between logistic regression and SVM without a kernel? (Only in implementation – one is much more efficient and has good optimization packages)
How does the SVM parameter C affect the bias/variance trade off? (Remember C = 1/lambda; lambda increases means variance decreases)
How does the SVM kernel parameter sigma^2 affect the bias/variance trade off?
Can any similarity function be used for SVM? (No, have to satisfy Mercer’s theorem)
Logistic regression vs. SVMs: When to use which one? ( Let's say n and m are the number of features and training samples respectively. If n is large relative to m use log. Reg. or SVM with linear kernel, If n is small and m is intermediate, SVM with Gaussian kernel, If n is small and m is massive, Create or add more fetaures then use log. Reg. or SVM without a kernel)

Bayesian Machine Learning

What are the differences between “Bayesian” and “Freqentist” approach for Machine Learning?
Compare and contrast maximum likelihood and maximum a posteriori estimation.
How does Bayesian methods do automatic feature selection?
What do you mean by Bayesian regularization?
When will you use Bayesian methods instead of Frequentist methods? (Small dataset, large feature set)

Regularization

What is L1 regularization?
What is L2 regularization?
Compare L1 and L2 regularization.
Why does L1 regularization result in sparse models? here

Evaluation of Machine Learning systems

What are accuracy, sensitivity, specificity, ROC?
What are precision and recall?
Describe t-test in the context of Machine Learning.

Clustering

Describe the k-means algorithm.
What is distortion function? Is it convex or non-convex?
Tell me about the convergence of the distortion function.
Topic: EM algorithm
What is the Gaussian Mixture Model?
Describe the EM algorithm intuitively.
What are the two steps of the EM algorithm
Compare GMM vs GDA.

Dimensionality Reduction

Why do we need dimensionality reduction techniques? (data compression, speeds up learning algorithm and visualizing data)
What do we need PCA and what does it do? (PCA tries to find a lower dimensional surface such the sum of the squared projection error is minimized)
What is the difference between logistic regression and PCA?
What are the two pre-processing steps that should be applied before doing PCA? (mean normalization and feature scaling)

Basics of Natural Language Processing

What is WORD2VEC?
What is t-SNE? Why do we use PCA instead of t-SNE?
What is sampled softmax?
Why is it difficult to train a RNN with SGD?
How do you tackle the problem of exploding gradients? (By gradient clipping)
What is the problem of vanishing gradients? (RNN doesn't tend to remember much things from the past)
How do you tackle the problem of vanishing gradients? (By using LSTM)
Explain the memory cell of a LSTM. (LSTM allows forgetting of data and using long memory when appropriate.)
What type of regularization do one use in LSTM?
What is Beam Search?
How to automatically caption an image? (CNN + LSTM)

Miscellaneous

What is the difference between loss function, cost function and objective function?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

list_of_questions_machine_learning.md

list_of_questions_machine_learning.md

Machine Learning - List of questions

Learning Theory

Model and feature selection

Curse of dimensionality

Universal approximation of neural networks

Deep Learning motivation

Support Vector Machine

Bayesian Machine Learning

Regularization

Evaluation of Machine Learning systems

Clustering

Dimensionality Reduction

Basics of Natural Language Processing

Miscellaneous

Files

list_of_questions_machine_learning.md

Latest commit

History

list_of_questions_machine_learning.md

File metadata and controls

Machine Learning - List of questions

Learning Theory

Model and feature selection

Curse of dimensionality

Universal approximation of neural networks

Deep Learning motivation

Support Vector Machine

Bayesian Machine Learning

Regularization

Evaluation of Machine Learning systems

Clustering

Dimensionality Reduction

Basics of Natural Language Processing

Miscellaneous