- Describe bias and variance with examples.
- What is Empirical Risk Minimization?
- What is Union bound and Hoeffding's inequality?
- Write the formulae for training error and generalization error. Point out the differences.
- State the uniform convergence theorem and derive it.
- What is sample complexity bound of uniform convergence theorem?
- What is error bound of uniform convergence theorem?
- What is the bias-variance trade-off theorem?
- From the bias-variance trade-off, can you derive the bound on training set size?
- What is the VC dimension?
- What does the training set size depend on for a finite and infinite hypothesis set? Compare and contrast.
- What is the VC dimension for an n-dimensional linear classifier?
- How is the VC dimension of a SVM bounded although it is projected to an infinite dimension?
- Considering that Empirical Risk Minimization is a NP-hard problem, how does logistic regression and SVM loss work?
- Why are model selection methods needed?
- How do you do a trade-off between bias and variance?
- What are the different attributes that can be selected by model selection methods?
- Why is cross-validation required?
- Describe different cross-validation techniques.
- What is hold-out cross validation? What are its advantages and disadvantages?
- What is k-fold cross validation? What are its advantages and disadvantages?
- What is leave-one-out cross validation? What are its advantages and disadvantages?
- Why is feature selection required?
- Describe some feature selection methods.
- What is forward feature selection method? What are its advantages and disadvantages?
- What is backward feature selection method? What are its advantages and disadvantages?
- What is filter feature selection method and describe two of them?
- What is mutual information and KL divergence?
- Describe KL divergence intuitively.
- Describe the curse of dimensionality with examples.
- What is local constancy or smoothness prior or regularization?
- State the universal approximation theorem? What is the technique used to prove that?
- What is a Borel measurable function?
- Given the universal approximation theorem, why can't a MLP still reach a arbitrarily small positive error?
- What is the mathematical motivation of Deep Learning as opposed to standard Machine Learning techniques?
- In standard Machine Learning vs. Deep Learning, how is the order of number of samples related to the order of regions that can be recognized in the function space?
- What are the reasons for choosing a deep model as opposed to shallow model? (1. Number of regions O(2^k) vs O(k) where k is the number of training examples 2. # linear regions carved out in the function space depends exponentially on the depth. )
- How Deep Learning tackles the curse of dimensionality?
- How can the SVM optimization function be derived from the logistic regression optimization function?
- What is a large margin classifier?
- Why SVM is an example of a large margin classifier?
- SVM being a large margin classifier, is it influenced by outliers? (Yes, if C is large, otherwise not)
- What is the role of C in SVM?
- In SVM, what is the angle between the decision boundary and theta?
- What is the mathematical intuition of a large margin classifier?
- What is a kernel in SVM? Why do we use kernels in SVM?
- What is a similarity function in SVM? Why it is named so?
- How are the landmarks initially chosen in an SVM? How many and where?
- Can we apply the kernel trick to logistic regression? Why is it not used in practice then?
- What is the difference between logistic regression and SVM without a kernel? (Only in implementation – one is much more efficient and has good optimization packages)
- How does the SVM parameter C affect the bias/variance trade off? (Remember C = 1/lambda; lambda increases means variance decreases)
- How does the SVM kernel parameter sigma^2 affect the bias/variance trade off?
- Can any similarity function be used for SVM? (No, have to satisfy Mercer’s theorem)
- Logistic regression vs. SVMs: When to use which one? ( Let's say n and m are the number of features and training samples respectively. If n is large relative to m use log. Reg. or SVM with linear kernel, If n is small and m is intermediate, SVM with Gaussian kernel, If n is small and m is massive, Create or add more fetaures then use log. Reg. or SVM without a kernel)
- What are the differences between “Bayesian” and “Freqentist” approach for Machine Learning?
- Compare and contrast maximum likelihood and maximum a posteriori estimation.
- How does Bayesian methods do automatic feature selection?
- What do you mean by Bayesian regularization?
- When will you use Bayesian methods instead of Frequentist methods? (Small dataset, large feature set)
- What is L1 regularization?
- What is L2 regularization?
- Compare L1 and L2 regularization.
- Why does L1 regularization result in sparse models? here
- What are accuracy, sensitivity, specificity, ROC?
- What are precision and recall?
- Describe t-test in the context of Machine Learning.
- Describe the k-means algorithm.
- What is distortion function? Is it convex or non-convex?
- Tell me about the convergence of the distortion function.
- Topic: EM algorithm
- What is the Gaussian Mixture Model?
- Describe the EM algorithm intuitively.
- What are the two steps of the EM algorithm
- Compare GMM vs GDA.
- Why do we need dimensionality reduction techniques? (data compression, speeds up learning algorithm and visualizing data)
- What do we need PCA and what does it do? (PCA tries to find a lower dimensional surface such the sum of the squared projection error is minimized)
- What is the difference between logistic regression and PCA?
- What are the two pre-processing steps that should be applied before doing PCA? (mean normalization and feature scaling)
- What is WORD2VEC?
- What is t-SNE? Why do we use PCA instead of t-SNE?
- What is sampled softmax?
- Why is it difficult to train a RNN with SGD?
- How do you tackle the problem of exploding gradients? (By gradient clipping)
- What is the problem of vanishing gradients? (RNN doesn't tend to remember much things from the past)
- How do you tackle the problem of vanishing gradients? (By using LSTM)
- Explain the memory cell of a LSTM. (LSTM allows forgetting of data and using long memory when appropriate.)
- What type of regularization do one use in LSTM?
- What is Beam Search?
- How to automatically caption an image? (CNN + LSTM)
- What is the difference between loss function, cost function and objective function?