-
Notifications
You must be signed in to change notification settings - Fork 81
/
Copy pathdimensionreduction.Rmd
396 lines (255 loc) · 17.4 KB
/
dimensionreduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
# Dimension Reduction
```{r,echo=FALSE}
knitr::opts_chunk$set(comment = NA, fig.path = "images/dr-", prompt = TRUE, collapse = TRUE,
tidy = TRUE)
```
Watch a video of this chapter: [Part 1](https://youtu.be/ts6UQnE6E1U) [Part 2](https://youtu.be/BSfw0rpyC2g) [Part 3](https://youtu.be/drNwEvEx3LY)
## Matrix data
The key aspect of matrix data is that every element of the matrix is the same type and represents the same kind of measurement. This is in contrast to a data frame, where every column of a data frame can potentially be of a different class.
Matrix data have some special statistical methods that can be applied to them. One category of statistical dimension reduction techniques is commonly called *principal components analysis* (PCA) or the *singular value decomposition* (SVD). These techniques generally are applied in situations where the rows of a matrix represent observations of some sort and the columns of the matrix represent features or variables (but this is by no means a requirement).
In an abstract sense, the SVD or PCA can be thought of as a way to approximate a high-dimensional matrix (i.e. a large number of columns) with a a few low-dimensional matrices. So there's a bit of data compression angle to it. We'll take a look at what's going on in this chapter.
First, we can simulate some matrix data. Here, we simulate some random Normal data in a matrix that has 40 rows and 10 columns.
```{r randomData,fig.height=5,fig.width=4}
set.seed(12345)
dataMatrix <- matrix(rnorm(400), nrow = 40)
image(1:10,1:40,t(dataMatrix)[,nrow(dataMatrix):1])
```
When confronted with matrix data a quick and easy thing to organize the data a bit is to apply an hierarchical clustering algorithm to it. Such a clustering can be visualized with the `heatmap()` function.
```{r,fig.cap="Heatmap of matrix data"}
heatmap(dataMatrix)
```
Not surprisingly, there aren't really any interesting patterns given that we just simulated random noise. At least it's good to know that the clustering algorithm won't pick up something when there's nothing there!
But now what if there were a pattern in the data? How would we discover it?
Let's first simulate some data that indeed does have a pattern. In the code below, we cycle through all the rows of the matrix and randomly add 3 to the last 5 columns of the matrix.
```{r,tidy=TRUE}
set.seed(678910)
for(i in 1:40){
coinFlip <- rbinom(1,size=1,prob=0.5)
## If coin is heads add a common pattern to that row
if(coinFlip){
dataMatrix[i,] <- dataMatrix[i,] + rep(c(0,3),each=5)
}
}
```
Here's what the new data look like.
```{r,fig.cap="Matrix data with a pattern"}
image(1:10,1:40,t(dataMatrix)[,nrow(dataMatrix):1])
```
You can see that some of the rows on the right side of the matrix have higher values than on the left side.
Now what happens if we cluster the data?
```{r,fig.cap="Clustered data with pattern"}
heatmap(dataMatrix)
```
We can see from the dendrogram on top of the matrix (for the columns) that the columns pretty clearly split into two clusters, which is what we'd expect.
## Patterns in rows and columns
In general, with matrix data, there may be patterns that occur accross the rows and columns of the matrix. In the example above, we shifted the mean of some of the observations in columns 5 through 10. We can display this a bit more explicitly by looking at the row and column means of the data.
```{r,fig.width=12,fig.cap="Pattern in rows and columns",message=FALSE}
library(dplyr)
hh <- dist(dataMatrix) %>% hclust
dataMatrixOrdered <- dataMatrix[hh$order,]
par(mfrow=c(1,3))
## Complete data
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
## Show the row means
plot(rowMeans(dataMatrixOrdered),40:1,,xlab="Row Mean",ylab="Row",pch=19)
## Show the column means
plot(colMeans(dataMatrixOrdered),xlab="Column",ylab="Column Mean",pch=19)
```
However, there may be other patterns beyond a simple mean shift and so more sophisticated methods will be needed. Futhermore, there may be multiple patterns layered on top of each other so we need a method that can distangle these patterns.
## Related problem
Here's another way to formulate the problem that matrix data present. Suppose you have multivariate observations
\[
X_1,\ldots,X_n
\]
so that each of the *n* observations has *m* features,
\[
X_1 = (X_{11},\ldots,X_{1m})
\]
Given this setup, the goal is to find a new set of variables/features that are uncorrelated and explain as much variance in the data as possible. Put another way, if you were to put all these multivariate observations together in one matrix, find the *best* matrix created with fewer variables (lower rank) that explains the original data.
The first goal is *statistical* in nature and the second goal is perhaps better characterized as *lossy data compression*.
## SVD and PCA
If *X* is a matrix with each variable in a column and each observation in a row then the SVD is a matrix decomposition that represents *X* as a matrix product of three matrices:
\[
X = UDV^\prime
\]
where the columns of *U* (left singular vectors) are orthogonal, the columns of $V$ (right singular vectors) are orthogonal and $D$ is a diagonal matrix of singular values.
Principal components analysis (PCA) is simply an application of the SVD. The *principal components* are equal to the right singular values if you first scale the data by subtracting the column mean and dividing each column by its standard deviation (that can be done with the `scale()` function).
## Unpacking the SVD: *u* and *v*
The SVD can be computed in R using the `svd()` function. Here, we scale our original matrix data with the pattern in it and apply the svd.
```{r}
svd1 <- svd(scale(dataMatrixOrdered))
```
The `svd()` function returns a list containing three components named `u`, `d`, and `v`. The `u` and `v` components correspond to the matrices of left and right singular vectors, respectively, while the `d` component is a vector of singular values, corresponding to the diagonal of the matrix *D* described above.
Below we plot the first left and right singular vectors along with the original data.
```{r,fig.width=12,fig.cap="Components of SVD"}
par(mfrow=c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1], main = "Original Data")
plot(svd1$u[,1],40:1,,ylab="Row",xlab="First left singular vector",pch=19)
plot(svd1$v[,1],xlab="Column",ylab="First right singular vector",pch=19)
```
You can see how the first left and right singular vectors pick up the mean shift in both the rows and columns of the matrix.
## SVD for data compression
If we believed that the first left and right singular vectors, call them u1 and v1, captured all of the variation in the data, then we could approximate the original data matrix with
\[
X \approx u_1 v_1^\prime
\]
Thus, we would reduce 400 numbers in the original matrix to 40 + 10 = 50 numbers in the compressed matrix, a nearly 90% reduction in information. Here's what the original data and the approximation would look like.
```{r,fig.width=8,fig.cap="Approximating a matrix"}
## Approximate original data with outer
## product of first singular vectors
approx <- with(svd1, outer(u[, 1], v[, 1]))
## Plot original data and approximated data
par(mfrow = c(1, 2))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1], main = "Original Matrix")
image(t(approx)[,nrow(approx):1], main = "Approximated Matrix")
```
Obviously, the two matrices are not identical, but the approximation seems reasonable in this case. This is not surprising given that there was only one real feature in the original data.
## Components of the SVD - Variance explained
The statistical interpretation of singular values is in the form of variance in the data explained by the various components. The singular values produced by the `svd()` are in order from largest to smallest and when squared are proportional the amount of variance explained by a given singular vector.
To show how this works, here's a very simple example. First, we'll simulate a "dataset" that just takes two values, 0 and 1.
```{r}
constantMatrix <- dataMatrixOrdered*0
for(i in 1:dim(dataMatrixOrdered)[1]) {
constantMatrix[i,] <- rep(c(0,1),each=5)
}
```
Then we can take the SVD of this matrix and show the singular values as well as the proportion of variance explained.
```{r,fig.width=12,fig.cap="Variance explained"}
svd1 <- svd(constantMatrix)
par(mfrow=c(1,3))
image(t(constantMatrix)[,nrow(constantMatrix):1],main="Original Data")
plot(svd1$d,xlab="Column",ylab="Singular value",pch=19)
plot(svd1$d^2/sum(svd1$d^2),xlab="Column",ylab="Prop. of variance explained",pch=19)
```
As we can see from the right-most plot, 100% of the variation in this "dataset" can be explained by the first singular value. Or, all of the variation in this dataset occurs in a single dimension. This is clear because all of the variation in the data occurs as you go from left to right across the columns. Otherwise, the values of the data are constant.
In the plot below, we plot the singular values (left) and the proportion of variance explained for the slightly more complex dataset that we'd been using previously.
```{r,fig.cap="Variance explained by singular vectors",fig.width=8}
par(mfrow=c(1,2))
svd1 <- svd(scale(dataMatrixOrdered))
plot(svd1$d,xlab="Column",ylab="Singular value",pch=19)
plot(svd1$d^2/sum(svd1$d^2),xlab="Column",ylab="Prop. of variance explained",pch=19)
```
We can see that the first component explains about 40% of all the variation in the data. In other words, even though there are 10 dimensions in the data, 40% of the variation in the data can be explained by a single dimension. That suggests that the data could be simplified quite a bit, a phenomenon we observed in the last section where it appeared the data could be reasonably approximated by the first left and right singular vectors.
## Relationship to principal components
As we mentioned above, the SVD has a close connection to principal components analysis (PCA). PCA can be applied to the data by calling the `prcomp()` function in R. Here, we show that the first right singular vector from the SVD is equal to the first principal component vector returned by PCA.
```{r,fig.cap="Singular vectors and principal components"}
svd1 <- svd(scale(dataMatrixOrdered))
pca1 <- prcomp(dataMatrixOrdered,scale=TRUE)
plot(pca1$rotation[,1],svd1$v[,1],pch=19,xlab="Principal Component 1",ylab="Right Singular Vector 1")
abline(c(0,1))
```
Whether you call this procedure SVD or PCA really just depends on who you talk to. Statisticians and people with that kind of background will typically call it PCA while engineers and mathematicians will tend to call it SVD.
## What if we add a second pattern?
Tracking a single patter in a matrix is relatively straightforward, but typically there will be multiple layered patterns in a matrix of data. Here we add two patterns to a simulated dataset. One pattern simple adds a constant to the last 5 columns of data, while the other pattern adds an alternating pattern (every other column).
```{r}
set.seed(678910)
for(i in 1:40){
coinFlip1 <- rbinom(1,size=1,prob=0.5)
coinFlip2 <- rbinom(1,size=1,prob=0.5)
if(coinFlip1) { ## Pattern 1
dataMatrix[i,] <- dataMatrix[i,] + rep(c(0,5),each=5)
}
if(coinFlip2) { ## Pattern 2
dataMatrix[i,] <- dataMatrix[i,] + rep(c(0,5),5)
}
}
hh <- hclust(dist(dataMatrix))
dataMatrixOrdered <- dataMatrix[hh$order,]
```
Here is a plot of this new dataset along with the two different patterns.
```{r,fig.cap="Dataset with two patterns",fig.width=12}
svd2 <- svd(scale(dataMatrixOrdered))
par(mfrow=c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1], main = "Data")
plot(rep(c(0,1),each=5),pch=19,xlab="Column",ylab="Pattern 1", main = "Block pattern")
plot(rep(c(0,1),5),pch=19,xlab="Column",ylab="Pattern 2", main = "Alternating pattern")
```
Now, of course the plot above shows the truth, which in general we will not know.
We can apply the SVD/PCA to this matrix and see how well the patterns are picked up.
```{r,fig.cap="SVD with two patterns", fig.width=12}
svd2 <- svd(scale(dataMatrixOrdered))
par(mfrow=c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(svd2$v[,1],pch=19,xlab="Column",ylab="First right singular vector")
plot(svd2$v[,2],pch=19,xlab="Column",ylab="Second right singular vector")
```
We can see that the first right singular vector seems to pick up both the alternating pattern as well as the block/step pattern in the data. The second right singular vector seems to pick up a similar pattern.
When we look at the variance explained, we can see that the first singular vector picks up a little more than 50% of the variation in the data.
```{r,fig.cap="Variation explained by singular vectors",fig.width=8}
svd1 <- svd(scale(dataMatrixOrdered))
par(mfrow=c(1,2))
plot(svd1$d,xlab="Column",ylab="Singular value",pch=19)
plot(svd1$d^2/sum(svd1$d^2),xlab="Column",ylab="Percent of variance explained",pch=19)
```
## Dealing with missing values
Missing values are a problem that plagues any data analysis and the analysis of matrix data is no exception. Most SVD and PCA routines simply cannot be applied if there are missing values in the dataset. In the event of missing data, there are typically a series of questions that should be asked:
* Determine the reason for the missing data; what is the *process* that lead to the data being missing?
* Is the proportion of missing values so high as to invalidate any sort of analysis?
* Is there information in the dataset that would allow you to predict/infer the values of the missing data?
In the example below, we take our dataset and randomly insert some missing data.
```{r}
dataMatrix2 <- dataMatrixOrdered
## Randomly insert some missing data
dataMatrix2[sample(1:100,size=40,replace=FALSE)] <- NA
```
If we try to apply the SVD on this matrix it won't work.
```{r,error=TRUE}
svd1 <- svd(scale(dataMatrix2))
```
Since in this case we know that the missing data appeared completely randomly in the data, it would make sense to try to impute the values so that we can run the SVD. Here, we use the `impute` package to do a k-nearest-neighbors imputation of the missing data. The `impute` package is available from the [Bioconductor project](http://bioconductor.org).
```{r}
library(impute)
dataMatrix2 <- impute.knn(dataMatrix2)$data
```
Now we can compare how the SVD performs on the original dataset (no missing data) and the imputed dataset. Here, we plot the first right singular vector.
```{r, fig.width=8, fig.cap="SVD on original and imputed data"}
svd1 <- svd(scale(dataMatrixOrdered))
svd2 <- svd(scale(dataMatrix2))
par(mfrow=c(1,2))
plot(svd1$v[,1],pch=19,main="Original dataset")
plot(svd2$v[,1],pch=19,main="Imputed dataset")
```
We can see that the results are not identical but they are pretty close. Obviously, the missing data process was pretty simple in this case and is likely to be more complex in other situations.
## Example: Face data
<!-- ## source("http://dl.dropbox.com/u/7710864/courseraPublic/myplclust.R") -->
In this example, we use some data that make up an image of a face and show how the SVD can be used to produce varying approximations to this "dataset". Here is the original data.
```{r,fig.cap="Face data"}
load("data/face.rda")
image(t(faceData)[,nrow(faceData):1])
```
If we take the SVD and plot the squared and normalized singular values, we can see that the data can be explained by just a few singular vectors, maybe 4 or 5.
```{r,fig.cap="Proportion of variance explained"}
svd1 <- svd(scale(faceData))
plot(svd1$d^2/sum(svd1$d^2),pch=19,xlab="Singular vector",ylab="Variance explained")
```
Now we can start constructing approximations to the data using the left and right singular vectors. Here we create one using just the first left and right singular vectors.
```{r}
## Note that %*% is matrix multiplication
# Here svd1$d[1] is a constant
approx1 <- svd1$u[,1] %*% t(svd1$v[,1]) * svd1$d[1]
```
We can also create ones using 5 and 10 singular vectors, which presumably would be better approximations.
```{r}
# In these examples we need to make the diagonal matrix out of d
approx5 <- svd1$u[,1:5] %*% diag(svd1$d[1:5])%*% t(svd1$v[,1:5])
approx10 <- svd1$u[,1:10] %*% diag(svd1$d[1:10])%*% t(svd1$v[,1:10])
```
Now we can plot each one of these approximations along with the original data.
```{r dependson="approximations",fig.height=4,fig.width=14}
par(mfrow=c(1,4))
image(t(approx1)[,nrow(approx1):1], main = "1 vector")
image(t(approx5)[,nrow(approx5):1], main = "5 vectors")
image(t(approx10)[,nrow(approx10):1], main = "10 vectors")
image(t(faceData)[,nrow(faceData):1], main = "Original data")
```
Here, the approximation using 1 singular vector is pretty poor, but using 5 gets us pretty close to the truth. Using 10 vectors doesn't seem to add much to the features, maybe just a few highlights. So 5 singular vectors is a reasonable approximation in this case.
## Notes and further resources
* For PCA/SVD, the scale/units of the data matters
* PC's/SV's may mix real patterns, as we saw in the example with two overlayed patterns
* SVD can be computationally intensive for very large matrices
* [Advanced data analysis from an elementary point of view](http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf)
* [Elements of statistical learning](http://www-stat.stanford.edu/~tibs/ElemStatLearn/)
* Alternatives and variations
* [Factor analysis](http://en.wikipedia.org/wiki/Factor_analysis)
* [Independent components analysis](http://en.wikipedia.org/wiki/Independent_component_analysis)
* [Latent semantic analysis](http://en.wikipedia.org/wiki/Latent_semantic_analysis)