-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathHW3-3.py
284 lines (228 loc) · 9.82 KB
/
HW3-3.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
"""
Using Python for Research Homework: Week 3, Case Study 3
In this case study, we will analyze a dataset consisting
of an assortment of wines classified as "high quality" and
"low quality" and will use k-Nearest Neighbors
classification to determine whether or not other
information about the wine helps us correctly guess whether
a new wine will be of high quality.
"""
# DO NOT EDIT
import numpy as np, random, scipy.stats as ss
def majority_vote_fast(votes):
mode, count = ss.mstats.mode(votes)
return mode
def distance(p1, p2):
return np.sqrt(np.sum(np.power(p2 - p1, 2)))
def find_nearest_neighbors(p, points, k=5):
distances = np.zeros(points.shape[0])
for i in range(len(distances)):
distances[i] = distance(p, points[i])
ind = np.argsort(distances)
return ind[:k]
def knn_predict(p, points, outcomes, k=5):
ind = find_nearest_neighbors(p, points, k)
return majority_vote_fast(outcomes[ind])[0]
"""
Exercise 1
Our first step is to import the dataset.
Instructions
Read in the data as a pandas dataframe using pd.read_csv.
The data can be found at
https://courses.edx.org/asset-v1:HarvardX+PH526x+2T2019+type@[email protected]
"""
import pandas as pd
data = pd.read_csv('Wines.csv')
data
"""
Unnamed: 0 fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality color high_quality
0 0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5 red 0
1 1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5 red 0
2 2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5 red 0
3 3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6 red 1
4 4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5 red 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6492 4893 6.2 0.21 0.29 1.6 0.039 24.0 92.0 0.99114 3.27 0.50 11.2 6 white 1
6493 4894 6.6 0.32 0.36 8.0 0.047 57.0 168.0 0.99490 3.15 0.46 9.6 5 white 0
6494 4895 6.5 0.24 0.19 1.2 0.041 30.0 111.0 0.99254 2.99 0.46 9.4 6 white 1
6495 4896 5.5 0.29 0.30 1.1 0.022 20.0 110.0 0.98869 3.34 0.38 12.8 7 white 1
6496 4897 6.0 0.21 0.38 0.8 0.020 22.0 98.0 0.98941 3.26 0.32 11.8 6 white 1
6497 rows × 15 columns
Exercise 2
Next, we will inspect the dataset and perform some mild data
cleaning.
Instructions
In order to get all numeric data, we will change the color
column to an is_red column.
If color == 'red', we will encode a 1 for is_red
If color == 'white', we will encode a 0 for is_red
Create this new column, is_red.
Drop the color, quality, and high_quality columns as we will be
predict the quality of wine using numeric data in a later exercise
Store this all numeric data in a pandas dataframe called
numeric_data
"""
data["is_red"] = (data["color"] == "red").astype(int)
numeric_data = data.drop(["color", "high_quality", "quality"], axis=1)
numeric_data.groupby('is_red').count()
"""
Unnamed: 0 fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
is_red
0 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898
1 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599
Exercise 3
We want to ensure that each variable contributes equally to the kNN classifier, so we will need to scale the data by subtracting the mean of each variable (column) and dividing each variable (column) by its standard deviation. Then, we will use principal components to take a linear snapshot of the data from several different angles, with each snapshot ordered by how well it aligns with variation in the data. In this exercise, we will scale the numeric data and extract the first two principal components.
Instructions
Scale the data using the sklearn.preprocessing function scale()
on numeric_data.
Convert this to a pandas dataframe, and store as numeric_data.
Include the numeric variable names using the parameter
columns = numeric_data.columns.
Use the sklearn.decomposition module PCA() and store it as pca.
Use the fit_transform() function to extract the first two principal
components from the data, and store them as principal_components.
Note: You may get a DataConversionWarning, but you can safely
ignore it
"""
import sklearn.preprocessing
numeric_data = (numeric_data - np.mean(numeric_data)) / np.std(numeric_data)
scaled_data = sklearn.preprocessing.scale(numeric_data)
numeric_data = pd.DataFrame(scaled_data, columns= numeric_data.columns)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principal_components = pca.fit(numeric_data).transform(numeric_data)
principal_components
"""
array([[ 3.98310384, -0.33215565],
[ 3.9491386 , 0.33690522],
[ 3.92240676, 0.11564852],
...,
[-1.4907159 , -0.82577232],
[-1.67454778, -3.6065637 ],
[-1.88763545, -2.86071275]])
Exercise 4
In this exercise, we will plot the first two principal components
of the covariates in the dataset. The high and low quality wines
will be colored using red and blue, respectively.
Instructions
The first two principal components can be accessed using
principal_components[:,0] and principal_components[:,1]. Store
these as x and y respectively, and make a scatter plot of these
first two principal components.
How well are the two groups of wines separated by the first two
principal components?
"""
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from matplotlib.backends.backend_pdf import PdfPages
observation_colormap = ListedColormap(['red', 'blue'])
x = principal_components[:,0]
y = principal_components[:,1]
plt.plot(principal_components[:,0], principal_components[:,1])
plt.title("Principal Components of Wine")
plt.scatter(x, y, alpha = 0.2,
c = data['high_quality'], cmap = observation_colormap, edgecolors = 'none')
plt.xlim(-8, 8); plt.ylim(-8, 8)
plt.xlabel("Principal Component 1"); plt.ylabel("Principal Component 2")
plt.show()
"""
Exercise 5
In this exercise, we will create a function that calculates
the accuracy between predictions and outcomes.
Instructions
Create a function accuracy(predictions, outcomes) that takes two
lists of the same size as arguments and returns a single number,
which is the percentage of elements that are equal for the two
lists.
Use accuracy to compare the percentage of similar elements in the
x and y numpy arrays defined below.
Print your answer.
"""
import numpy as np
np.random.seed(1) # do not change
x = np.random.randint(0, 2, 1000)
y = np.random.randint(0 ,2, 1000)
def accuracy(predictions, outcomes):
return 100*np.mean(predictions == outcomes)
percentage = accuracy(x, y)
percentage
"""
51.5
Exercise 6
The dataset remains stored as data. Because most wines in the
dataset are classified as low quality, one very simple
classification rule is to predict that all wines are of low
quality. In this exercise, we determine the accuracy of this
simple rule.
Instructions
Use accuracy() to calculate how many wines in the dataset are of
low quality. Do this by using 0 as the first argument, and
data["high_quality"] as the second argument.
Print your result.
"""
print(accuracy(0, data["high_quality"]))
"""
36.69385870401724
Exercise 7
In this exercise, we will use the kNN classifier from
scikit-learn to predict the quality of wines in our dataset.
Instructions
Use knn.predict(numeric_data) to predict which wines are high
and low quality and store the result as library_predictions.
Use accuracy to find the accuracy of your predictions, using
library_predictions as the first argument and
data["high_quality"] as the second argument.
Print your answer. Is this prediction better than the simple
classifier in Exercise 6?
"""
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(numeric_data, data['high_quality'])
library_predictions = knn.predict(numeric_data)
#TODO: Use accuracy to find the accuracy of your predictions, using library_predictions
#as the first argument and data["high_quality"] as the second argument.
predictions_accuracy = accuracy(library_predictions, data["high_quality"])
#TODO: Print your answer. Is this prediction better than the simple classifier
#in Exercise 6?
print (predictions_accuracy) # Yes, this is better!
"""
84.20809604432816
Exercise 8
Unlike the scikit-learn function, our homemade kNN classifier
does not take any shortcuts in calculating which neighbors are
closest to each observation, so it is likely too slow to carry
out on the whole dataset. In this exercise, we will select a
subset of our data to use in our homemade kNN classifier.
Instructions
Fix the random generator using random.seed(123), and select 10
rows from the dataset using random.sample(range(n_rows), 10).
Store this selection as selection.
"""
n_rows = data.shape[0]
# Enter your code here.
random.seed(123)
selection = random.sample(range(n_rows), 10)
print(selection)
"""
[428, 2192, 714, 6299, 3336, 2183, 882, 312, 3105, 4392]
Exercise 9
We are now ready to use our homemade kNN classifier and
compare the accuracy of our results to the baseline.
Instructions
For each predictorp in predictors[selection], use
knn_predict(p, predictors[training_indices,:],
outcomes[training_indices], k=5) to predict the quality of
each wine in the prediction set, and store these predictions
as a np.array called my_predictions. Note that knn_predict is
already defined as in the Case 3 videos.
Using the accuracy function, compare these results to the
selected rows from the high_quality variable in data using
my_predictions as the first argument and
data.high_quality.iloc[selection] as the second argument.
Store these results as percentage.
Print your answer.
"""
my_predictions = np.array([knn_predict(p, predictors[training_indices,:], outcomes[training_indices], 5) for p in predictors[selection]])
percentage = accuracy(my_predictions, data.high_quality.iloc[selection])
print(percentage)
"""70.0"""