-
Notifications
You must be signed in to change notification settings - Fork 1
/
cs434_hw2.tex
232 lines (214 loc) · 13.8 KB
/
cs434_hw2.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
\documentclass[12pt,letterpaper]{article}
\usepackage{amsmath} % just math
\usepackage{amssymb} % allow blackboard bold (aka N,R,Q sets)
\usepackage{ulem}
\usepackage{graphicx}
\usepackage{float}
\linespread{1.6} % double spaces lines
\usepackage[left=1in,top=1in,right=1in,bottom=1in,nohead]{geometry}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{floatrow}
\usepackage{blindtext}
\begin{document}
\setcounter{subsection}{1}
\begin{flushright}
\end{flushright}
\begin{flushleft}
\textbf{Eric Zounes} \\
\today \\
CS434: Assignment 2
\end{flushleft}
\section[1.]{Implementation Assignment}
In this assignment we implement a top-down greedy induction algorithm for learning decision trees. In this example we have a data set containing a set of features which describe a monk. There are 6 features provided below: \\
$ x_{1}: $head\_shape$ \in round, square, octagon$ \\
$ x_{2}: $body\_shape$ \in round, square, octagon$ \\
$ x_{3}: $is\_smiling$ \in yes, no$ \\
$ x_{4}: $holding$ \in sword, balloon, flag$ \\
$ x_{5}: $jacket\_color$ \in red, yellow, green, blue$ \\
$ x_{6}: $has\_tie$ \in round, square, octagon$ \\
The class label is a concept defined by the following rule: \\
$($head\_shape$ = $body\_shape$) or ($jacket\_color$ = $red$)$ \\
If $x_{2} = x_{1}$ or $x_{5} = 1, y = 1$ \\
As mentioned, please learn 1. a decision stump; and 2. a depth-2 decision tree from the training data. Include: \\
\begin{enumerate}
\item[1.] Both the learned stump and the depth-2-decision tree. To help grading easier, please provide for each selected test its information gain.
\item[2.] The training and testing error rate of the learned stump and depth-2 decision tree.
\end{enumerate}
Finally, please answer the following questions:
\begin{enumerate}
\item Given the formula for generating the class label, please provide (as compact as possible) decision tree that will correctly classify all training examples.
\item Do you expect the top-down greedy induction algorithm to learn this optimal tree?
\begin{table*}[htbp]
\centering
\begin{minipage}[b]{5in}
\centering
\begin{tabular}[H]{ | l | l |}
\hline
Tree Depth & Error Rate \\ \hline
1 & 0.250 \\
2 & 0.301 \\
3 & 0.317 \\
\hline
\end{tabular}
\caption{Error Rate on Testing Data}
\label{labelname 1}
\end{minipage}
\begin{minipage}[b]{5in}
\centering
\begin{tabular}[H]{ | l | l |}
\hline
Tree Depth & Error Rate \\ \hline
1 & 0.266 \\
2 & 0.287 \\
3 & 0.255 \\
\hline
\end{tabular}
\caption{Error Rate on Training Data}
\label{labelname 2}
\end{minipage}
\end{table}
\begin{figure}[h!]
\centering
\includegraphics[width=4in]{BPerfect.eps}
\caption{Ideal tree based on class defintiion}
\end{figure}
\begin{figure}[h!]
\centering
\includegraphics[width=7in]{D2.eps}
\caption{Depth 2 Learned Decision Tree including Decision Stump as root}
\end{figure}
\end{enumerate}
\subsection{Decision Stump}
The decision stump we attained is the first step of our depth 2 tree. It splits on the rule of $x5$ because it is the quickest to satisfy as it involves only one test. The other rule of $x1 = x2$ would require 2 steps to describe after which time the rule on $x5$ would be not as clearly defined. Thus it makes sense that the 1st rule was based on the value of $x5$.
\subsection{Class Label Graph}
The ideal graph would be to classify based on each of the clauses of the rule. The best possible algorithm that our algorithm could have learned would be to split upon the value of $x5$ which gives you one leaf node and then split the remaining children on $x1$ and $x2$ in either order. When our algorithm was used to create a depth 3 tree we found that the algorithm chose $x2$ for one of the siblings after the first level of the tree. All of this node's children choose 1 as the classifying feature to split on. This tree would then cover all of the rules to contain all of the true cases in leaf nodes
\subsection{Ideal Trees and the Top-down Greedy Induction Algorithm}
Generally the greedy induction algorithm will not create optimal decision trees. While it can make locally optimal decisions for each node it does not neccesarily make decisions which are globally optimal. Many of the branches in the decision tree are not relevant to the classifying rule. When looking at a depth 2 tree created by the algorithm we see that it does make some correct decisions however it chooses several other values to partition upon which were not useful for finding the rule.
\subsection{Testing and Training Error}
The difference between the testing and training error would suggest that there were signs of overtraining as the the error rate seemed to increase with after the first for each new level in the tree with the root test being the most effective. This is not surpr
\section[2.]{Individual Assignment}
\begin{enumerate}
\item We have two identical bags. Bag A contains 4 red marbles and 6 black marbles. Bag B contains 6 red marbles and 6 black marbles. Now we randomly chose a bag and drew a marble from the chosen bag and it turns out to be black. What is the probability that the chosen bag is bag A? \\
To solve this problem we can use the Bayes rule. \\
$ P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\~A)P(\~A)}$ \\
Bag: $\in {A,B}$ \\
Marble: $\in {red,black}$ \\
We can define event: \\
$B =$ Obtained black marble \\
$A =$ Obtained bag A \\
$P(A) = .5$, $P(B) = .6$ \\
$P(A|B) = \frac{.3*.5}{.3*.5 + .5*.5*.5} = 55.45\%$ \\
\item Suppose we have class variable y and three attributes $x_{1}, x_{2}, x_{3}$ and we wish to calculate $P(y = 1 | x_{1} = u_{1}, x_{2} = u_{2}, x_{3} = u_{3})$, and we have no conditional independence information. \\
\begin{enumerate}
\item Which of the following sets of probabilities are sufficient for calculation? \\
\begin{enumerate}
\item $P(y = 1); P(x_{1} = u_{1} | y = 1); P( x_{2} = u_{2} | y = 1); P(x_{3} = u_{3} | y = 1)$ \\
\item $P(x_{1} = u_{1}, x_{2} = u_{2}, x_{3} = u_{3}); P(y = 1); P(x_{1} = u_{1}, x_{2} = u_{2}, x_{3} = u_{3} | y = 1)$ \\
\item $P(y = 1); P(y = 1| x_{1} = u_{1}); p(y = 1 | x_{2} = u_{2}); P(y = 1 | x_{3} = u_{3})$ \\
\end{enumerate}
The 2nd example is sufficient because the probability of each feature must be taken in to account since we have no information about independence.\\
\item Now suppose we know that the variables $x_{1},x_{2},x_{3}$ are conditionally independent given the class variable Y. Which of the above 3 sets are sufficient now? \\
The 1st one is sufficient since we can calculate the P of each feature independently given the prior.\\
\end{enumerate}
\item (Naive Bayes Classifier) We will use the following training set to build a Naive Bayes classifier. A, B, and C are three binary attributes and Y is the target class label. \\[10mm]
\centering
\begin{tabular}{| l | l | l | l |}
\hline
A & B & C & Y \\ \hline
0 & 1 & 1 & 0 \\ \hline
1 & 1 & 1 & 0 \\ \hline
0 & 0 & 0 & 0 \\ \hline
1 & 1 & 0 & 1 \\ \hline
0 & 1 & 0 & 1 \\ \hline
1 & 0 & 1 & 1 \\ \hline
\end{tabular} \\[15mm]
\begin{enumerate}
\item Based on the training data, calculate the prior distribution for Y:P(Y), with and without Laplace smoothing. \\
$P(y = j) = \frac{1}{n} \sum_{i=1}^{n} I(y_{i} = j)$ \\
There are 6 examples and a binary class. So, we can say: \\
$\frac{1}{6} \sum_{i=1}^{6} I(y_{i} = 1)$ \\
$\frac{3}{6} = .5$ \\
With Laplace smoothing we obtain: \\
$\frac{3+1}{6+2*1} = .5$ at $ k = 1$ \\
We are essentially adding two additional samples with each classifier to increase our resolution. \\
\item Based on the training data, calculate the distributions $P(A|Y), P(B|Y)$ and $P(C|Y)$, with and without Laplace smoothing. \\
Without Laplace smoothing we obtain: \\
$P(A|Y) = \frac{2}{6} = .33$, $P(B|Y) = \frac{2}{6} = .33$, $P(C|Y) = \frac{1}{6} = .16$ \\
With Laplace smoothing we obtain: \\
$P(A|Y) = \frac{2+1}{6+2*1} = .375$, $P(B|Y) = \frac{2+1}{6+2*1} = .375$, $P(C|Y) = \frac{1+1}{6+2*1} = .25$ \\
\item What prediction will the Naive Bayes classifier make for a new example $(A = 1, B = 0, C = 0)$, with and without Laplace smoothing? \\
In order to test this new example we must find: \\
$P(y = 1)*P(A = 1 | y = 1)*P(B = 0 | y = 1)*P(C = 0 | y = 1)$ \\
and \\
$P(y = 0)*P(A = 1 | y = 0)*P(B = 0 | y = 0)*P(C = 0 | y = 0)$ \\
$\frac{3}{6} * \frac{1}{3} * \frac{1}{6} * \frac{1}{3} = .00925$ \\
$\frac{3}{6} * \frac{1}{6} * \frac{1}{6} * \frac{1}{6} = .00231$ \\
Thus, the new example is $\in y = 1$ \\
With Laplace smoothing: \\
$\frac{3+1}{6 + 2*1} * \frac{1+1}{3 + 2*1} * \frac{1 + 1}{6 + 2*1} * \frac{1 + 1}{3 + 2*1} = .00200$ \\
$\frac{3+1}{6 + 2*1} * \frac{1+1}{6 + 2*1} * \frac{1 + 1}{6 + 2*1} * \frac{1 + 1}{6 + 2*1} = .00781$ \\
\end{enumerate}
\item Decision tree learning \\
Given the following data set: \\[15mm]
\begin{tabular}{c | c | c || c }
V & W & X & Y \\ \hline
0 & 0 & 0 & 0 \\ \hline
0 & 1 & 0 & 1 \\ \hline
1 & 0 & 0 & 1 \\ \hline
1 & 1 & 0 & 0 \\ \hline
1 & 1 & 1 & 0 \\ \hline
\end{tabular} \\[15mm]
The task is to build a decision tree for classifying Y. \\
\begin{enumerate}
\item Compute the information gain of attributes X, V, and W respectively. \\
\begin{figure}[h!]
\centering
\includegraphics[width=4in]{T3.eps}
\caption{Trees rooted with X, V, and W}
\end{figure}
For V: \\
$H(S) = -\frac{2}{5}log_{2}\frac{2}{5} - \frac{3}{5}log_{2}\frac{3}{5} = .97$ \\
$H(S_{1}) = -\frac{1}{3}log_{2}\frac{1}{3} - \frac{2}{3}log_{2}\frac{2}{3} = .91 $ \\
$H(S_{2}) = -\frac{1}{2}log_{2}\frac{1}{2} - \frac{1}{2}log_{2}\frac{1}{2} = 1.0 $ \\
$.97 - \frac{2}{5}*1.0 -\frac{3}{5}*.91 = .023$ \\
For W: \\
$H(S) = -\frac{2}{5}log_{2}\frac{2}{5} - \frac{2}{5}log_{2}\frac{2}{5} = .97$ \\
$H(S_{1}) = -\frac{1}{3}log_{2}\frac{1}{3} - \frac{2}{3}log_{2}\frac{2}{3} = .91$ \\
$H(S_{2}) = -\frac{1}{2}log_{2}\frac{1}{2} - \frac{1}{2}log_{2}\frac{1}{2} = 1.0$ \\
$.97 - \frac{2}{5}*1.0 -\frac{3}{5}*.91 = .023$ \\
For X: \\
$H(S) = -\frac{2}{5}log_{2}\frac{2}{5} - \frac{3}{5}log_{2}\frac{3}{5} = .97$ \\
$H(S_{1}) = -\frac{0}{2}log_{2}\frac{0}{2} - \frac{1}{1}log_{2}\frac{1}{1} = 0$ \\
$H(S_{2}) = -\frac{2}{4}log_{2}\frac{2}{4} - \frac{2}{4}log_{2}\frac{2}{4} = 1.0$ \\
$.97 - \frac{2}{5}*0 -\frac{3}{5}*.91 = .42$ \\
\item Use information gain for selecting test and produce the full decision tree. \\
\begin{figure}[h!]
\centering
\includegraphics[width=6in]{FullT4.eps}
\caption{Full Trees rooted with X}
\end{figure}
\item Considering the following two strategies for avoid over-fitting. \\
\begin{enumerate}
\item The first strategy stops growing the tree when the information gain of the best test is less than a given threshold $\epsilon$. \\
\\
\\
When we grow the tree beyond the first decision stump we see that splitting on W or V both yield no information gain as they both equally partition the data. At that step we would no longer grow the tree as we do not gain additional information gain at that step. Continuing to grow would fully classify the data but given the limited size of the data set and apparent randomness it would most likely be overfit. The training error for the depth 1 decision tree is uninteresting as it will correctly classify 1 of the 5 examples correctly and splits the remaining 4 evently without making a prediction as to which outcome is more likely.
\item The second strategy grows the full tree first and then prunes the tree bottom up: start from the lowest level of the tree and prune a sub-tree if the information gain of the test is less than a given threshold $\epsilon$. (Note that you should stop checking level $t$ if none of sub-trees at level $t+1$ satisfies the pruning criterion.i) \\
Let $\epsilon$ be $0.001$ for both cases, write down the resulting tree for each strategy and compare their training errors. \\
\\
\\
When we start looking at the leafs of the full tree we see that they both have the same information gain however when we go up the tree to those node's parent we see that there is no information gain. The problem with pruning with such a limited data set is that how you choose to seperate the validation sets will impact the performance of your pruning. If we prune the the node above the two right hand side leafs we would create the same tree as our other method. If we choose to not prune this branch and leave it as a full tree it has no training error yet is very likely to not be practical for testing further data as there is not enough data to draw any conclusions about the decision tree.
\end{enumerate}
\item Discuss the advantages and disadvantages of the two strategies. \\
With this limited dataset there are significant drawbacks to pruning. With such a small dataset how you partition your data into testing and validation sets can have a significant impact on the performance of your pruning. There also do not seem to be significant trends in this data aside from the decision stump meaning that both pruning and limiting the tree size result in inefficient decision trees with high error rates. The full tree rooted at X with the second level split on W or V has no error rate given the training data because all the leafs of the tree completely classify data and prior to that point no branch aside from the decision stump makes a prediction.
\end{enumerate}
\item Show that logistic regression learns a linear decision boundary. \\
Since $P(y = 1 | x) = \frac{1}{1 + e^{-w*x}} > P(y = 0 | x) = \frac{e^{-w*x}}{1 + e^{-w*x}}$ \\
We know that: \\
$log(\frac{P(y = 1 | x)}{P(y = 0 | x)}) = w_{0} + w_{1}x_{1} + \ldots + w_{m}x_{m}$ \\
If we substitute in our equations for P we get: \\
$log(\frac{1 + e^{-w*x}}{e^{-w*x}(1 + e^{-w*x})}) = log(e^{w*x}) = w*x$ \\
Therefore, since the rate of change is constant and it's continuous, we can say that this equation for the decision boudary is linear. \\
\end{enumerate}
\end{document}