-
Notifications
You must be signed in to change notification settings - Fork 2
/
TODO
290 lines (225 loc) · 12.3 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
- NaArray objects:
1. Implement subassignment by Mindex. Add unit tests for subassignment.
2. Implement methods rowAnyNAs(), rowMins, rowMaxs(), rowMeans(), rowVars(),
and rowSds(). Add unit tests.
3. Add vignette for NaArray objects.
- Speed up more row summarization methods for SparseArray and NaArray objects
by supporting natively more operations in .Call entry point C_rowStats_SVT.
Note that doing this for rowAnyNAs(), rowMins(), rowMaxs(), rowSums(),
rowMeans(), rowVars(), and rowSds(), led to speedups of 35x, 6x, 6x, 20x,
18x, 5x, and 5x, respectively, on big SVT_SparseArray objects, or more.
It also reduced memory usage significantly.
- Implement which() methods for SparseArray and NaArray objects of type
logical.
- Implement fast nzvals() getter/setter for SVT_SparseMatrix objects and
fast nnavals() getter/setter for NaArray objects. Default methods work
fine but are not as fast as they could be.
- Add unit tests for is_nonzero() and the nz*() functions, and for is_nonna()
and the nna*() functions.
- Implement readMatrixMarket(), similar to Matrix::readMM() but should
return a COO_SparseMatrix object instead of a dgTMatrix object. Note that
the easy/lazy implementation could simply be
as(Matrix::readMM(.), COO_SparseMatrix)
However it wouldn't be really justified to introduce a new function just
for that. So hopefully a native implementation will improve efficiency
enough to be worth it and justify a dedicated function.
See https://math.nist.gov/MatrixMarket/formats.html
- To write a SparseMatrix object to a Matrix Market file, do we need a
dedicated writeMatrixMarket() function or should we just define a
Matrix::writeMM() method for SparseMatrix objects? That method could
simply do
Matrix::writeMM(as(x, "CsparseMatrix"))
- Support names() getter and setter on a 1D SparseArray or NaArray object
as a shortcut for 'dimnames()[[1L]]' and 'dimnames()[[1L]] <- value'
respectively. This is to mimic the R base array API.
- Add nzvals() methods for COO_SparseArray and SVT_SparseArray objects.
Uncomment nzvals() examples in vignette and SparseArray-class.Rd
- Add unit tests for nzwhich() and nzvals() methods for COO_SparseArray
and SVT_SparseArray objects.
- Fix rbind() between an SVT_SparseMatrix and an ordinary vector or matrix:
> rbind(SVT_SparseArray(dim=6:5), logical(5))
Error: C stack usage 7971988 is too close to the limit
> rbind(SVT_SparseArray(dim=6:5), matrix(ncol=5))
Error: C stack usage 7971988 is too close to the limit
- Parallelize more operations (with OpenMP) e.g. rowsum(). Right now only %*%,
crossprod(), tcrossprod(), and the col*() methods (matrixStats operations)
are parallelized.
- Implement coercion from Hits to SVT_SparseMatrix. The returned object
should be an integer SVT_SparseMatrix with only zeros and ones that is
the adjacency matrix of the bipartite graph represented by the Hits object.
It will be fully lacunar.
Note that in the case of a SelfHits object the result will be a square
SVT_SparseMatrix.
Question: should multiple edges between the same two nodes produce values
> 1 in the adjacency matrix? (Google this.) This means that the resulting
SVT_SparseMatrix won't necessarily be fully lacunar.
- Use "sparseness" instead of "sparsity" in the doc when referring to the
quality of being structurally sparse. Use "sparsity" to refer to the
number (>= 0 and <= 1) that measures how sparse an object is.
sparsity = 1 - density
- Subassignments like this need to work:
svt[ , , 1] <- svt[ , , 3, drop=FALSE]
svt[ , , 1] <- svt[ , , 3]
- Implement C_subassign_SVT_with_Rarray() and C_subassign_SVT_with_SVT().
- Speed up row selection: x[row_idx, ]
THIS in particular is VERY slow on a SVT_SparseArray object:
set.seed(123)
svt2 <- 0.5 * poissonSparseMatrix(170000, 5800, density=0.1)
dgcm2 <- as(svt2, "dgCMatrix")
system.time(svt2[-2, ]) # arghhh!!
# user system elapsed
# 4.606 0.180 4.804
system.time(dgcm2[-2, ])
# user system elapsed
# 0.267 0.244 0.513
Good news is that this is very fast:
system.time(svt2[2, ])
# user system elapsed
# 0.001 0.000 0.002
system.time(dgcm2[2, ])
# user system elapsed
# 0.165 0.000 0.165
- Support subsetting by a character matrix in .subset_SVT_by_Mindex().
- Support subsetting by a character vector in .subset_SVT_by_Lindex() in
the 1D case.
- Implement double-bracket subsetting of SVT_SparseArray objects. Both,
the 1D-style form (e.g. 'svt[[8]]') and the N-dimensional form (e.g.
'svt[[2,5,1]]' for a 3D object).
- Implement table() method for SVT_SparseArray objects of type logical,
integer, or raw (should it go in R/SparseArray-summarization.R?)
- Improve readSparseCSV() functionality by adding a few read.table-like
args to it. See https://github.com/Bioconductor/SparseArray/issues/5
for the details.
- We can't use base::apply() or base::asplit() on a SparseArray derivative
because they turn it into an ordinary (i.e. dense) array.
So turn apply() and asplit() into generic functions and implement methods
for SparseArray objects. These methods could either:
1. Mimic the behavior of base::apply() and base::asplit() but preverve
sparseness. However, it should be clarified in the man page for the
matrixStats methods that using apply() to compute stats on an object
with > 2 dimensions is not going to be as efficient as using
the 'dims' argument (almost all matrixStats methods support it). The
latter should be 10x or 100x faster when used on big objects.
2. Just fail with a friendly error message.
- Maybe implement svd() for SVT_SparseMatrix objects?
- Maybe implement chol() for symmetric positive-definite square
SVT_SparseMatrix objects?
- More SBT ("Sparse Buffer Tree") use cases:
1. Implement C helper _push_vals_to_SBT_by_Mindex(), and modify coercion
from COO_SparseArray to SVT_SparseArray to use that instead of
C_subassign_SVT_by_Mindex(). This will probably lead to a cleaner/simpler
implementation. But is it faster too?
2. Revisit implementation of C_subassign_SVT_by_Mindex() and
C_subassign_SVT_by_Lindex(): Can they use an SBT instead of the "extended
leaves" approach? E.g. they would use _push_vals_to_SBT_by_Mindex()
and _push_vals_to_SBT_by_Lindex(), respectively, then "merge" the SBT
with the original SVT. This will probably lead to a cleaner/simpler
implementation. But is it faster too?
3. Revisit implementation of C_readSparseCSV_as_SVT_SparseMatrix(): Try to
use an SBT instead of an ExtendableJaggedArray. Performance should not
be impacted. Then we can get rid of the ExtendableJaggedArray thing.
- Implement `|`, `&`, and `!` for "raw" SVT_SparseArray objects. Important:
they must perform bitwise operations like with "raw" vectors (see ?raw).
- To help implement the Kronecker product (see below): Introduce
arep_times(x, times) and arep_each(x, each) generics. The 'times'
and 'each' args must be integer vectors with the same length as 'dim(x)'.
They are multidimensional versions of 'rep(, times=t)' and 'rep(, each=e)'
that perform the replications along each dimension of the array.
For example, with x = matrix(letters[1:6], ncol=2):
> x
[,1] [,2]
[1,] "a" "d"
[2,] "b" "e"
[3,] "c" "f"
1. arep_times(x, times=c(2, 4)) returns:
> do.call(cbind, rep(list(do.call(rbind, rep(list(x), 2))), 4))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "a" "d" "a" "d" "a" "d" "a" "d"
[2,] "b" "e" "b" "e" "b" "e" "b" "e"
[3,] "c" "f" "c" "f" "c" "f" "c" "f"
[4,] "a" "d" "a" "d" "a" "d" "a" "d"
[5,] "b" "e" "b" "e" "b" "e" "b" "e"
[6,] "c" "f" "c" "f" "c" "f" "c" "f"
2. arep_each(x, each=c(2, 4)) returns:
> matrix(rep(t(matrix(rep(x, each=2), ncol=ncol(x))), each=4),
byrow=TRUE, ncol=ncol(x)*4)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "a" "a" "a" "a" "d" "d" "d" "d"
[2,] "a" "a" "a" "a" "d" "d" "d" "d"
[3,] "b" "b" "b" "b" "e" "e" "e" "e"
[4,] "b" "b" "b" "b" "e" "e" "e" "e"
[5,] "c" "c" "c" "c" "f" "f" "f" "f"
[6,] "c" "c" "c" "c" "f" "f" "f" "f"
Note that the arep_times() and arep_each() generics really belong to
the S4Arrays package (with the methods for ordinary arrays defined there
too).
Then kronecker(X, Y) (see below) can simply be obtained with:
arep_each(X, each=dim(Y)) * arep_times(Y, times=dim(X))
And this should be made the kronecker() method for Array objects.
- Implement kronecker() method (Kronecker product) for SVT_SparseArray objects.
A quick scan of BioC 3.18 software packages reveals that 25+ packages call
the kronecker() function. Would be interesting to know how many of them need
to do this on sparse objects?
Note that one way to go is to simply implement arep_times() and arep_each()
methods for SVT_SparseArray objects. This will give us kronecker() for
free (via the kronecker() method for Array objects defined in S4Arrays,
see above).
Good things to test (on SVT_SparseArray objects) once this is implemented:
1. First mixed-product property (mixing with element-wise array
multiplication a.k.a. "Hadamard product"):
With 4 arrays A, B, C, D, all with the same number of dimensions,
with A and C conformable, with B and D conformable, then:
kronecker(A, B) * kronecker(C, D)
must be the same as:
kronecker(A * C, B * D)
For example:
A <- array(1:60, dim=5:3)
C <- array(runif(60), dim=5:3)
B <- array(101:180, dim=c(2,10,4))
D <- array(runif(80), dim=c(2,10,4))
stopifnot(all.equal(kronecker(A, B) * kronecker(C, D),
kronecker(A * C, B * D)))
stopifnot(all.equal(kronecker(B, A) * kronecker(D, C),
kronecker(B * D, A * C)))
2. Second mixed-product property (mixing with matrix multiplication):
With 4 matrices A, B, C, D, with dimensions that allow one to do
A %*% C and B %*% D, then:
kronecker(A, B) %*% kronecker(C, D)
must be the same as:
kronecker(A %*% C, B %*% D)
For example:
A <- matrix(1:12, ncol=3)
C <- matrix(runif(18), nrow=3)
B <- matrix(101:120, ncol=5)
D <- matrix(runif(10), nrow=5)
stopifnot(all.equal(kronecker(A, B) %*% kronecker(C, D),
kronecker(A %*% C, B %*% D)))
stopifnot(all.equal(kronecker(B, A) %*% kronecker(D, C),
kronecker(B %*% D, A %*% C)))
See https://en.wikipedia.org/wiki/Kronecker_product for other properties
to test.
- Try to speed up SVT_SparseArray transposition by implementing a one-pass
approach that uses ExtendableJaggedArray intermediate buffers (offss, valss).
See src/readSparseCSV.c where this approach is already used.
Note that this will require that ExtendableJaggedArray structs are able
to support other types of columns (only support int at the moment).
- Support 'match(svt, table)' where 'svt' is an SVT_SparseArray object
and 'table' an atomic vector. This will give us 'svt %in% table' for free.
- Implement more matrixStats methods for SVT_SparseMatrix objects. Those
that are still missing and are actually used in Bioconductor are:
rowMeans2, rowSums2, rowRanks, rowQuantiles, rowMads, rowIQRs, rowAlls,
rowCumsums, rowWeightedMeans, rowAnyNAs) + corresponding col* methods.
- Implement more summarization methods for SVT_SparseArray objects.
See R/SparseArray-summarization.R
- Add unit tests for the SVT_SparseArray stuff.
TAKE A LOOK AT THE FOLLOWING POTENTIAL USE CASES:
- Hi-C sequencing data. See Sunduz Keles work (BioC2024) and Jacques Serizay
packages HiCExperiment and HiContacts.
- Support RcppML::nmf
- Support some of the SparseM (CRAN) operations.
- Go after dgCMatrix objects in ExperimentHub (query(eh, "dgCMatrix")),
convert them to SVT_SparseMatrix objects and try to do the things that
are usually done on them.
- Convert 8322787x1098 dgTMatrix (ExperimentHub resource EH5453) to
SVT_SparseMatrix and try to do the things that the curatedMetagenomicData
folks usually do on it.