-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WISH: rowTabulates/colTabulates for floating-point values #274
Comments
A benchmark: library(bench)
data <- matrix(runif(1000), nrow=20)
ties_orig <- function(x) {
r <- rowRanks(x, ties.method = "average", useNames=FALSE)
c(table(r,row(r)))
}
ties_new <- function(x) {
r <- rowRanks(x, ties.method = "average", useNames=FALSE)
r <- array(as.integer(as.factor(r)), dim=dim(r))
c(t(rowTabulates(r)))
}
b <- bench::mark(ties_orig(data), ties_new(data), memory=FALSE)
print(b)
plot(b) This gives:
|
@MLopez-Ibanez here is related issue from few years ago: #152 The speed for finding duplicates can be further improved if we use It should be faster compared with the examples you provided, and substantially so. However, if you manage to come up with a faster version - I would be interested, as that would improve the speed of some functions in matrixTests. |
I'm gonna start by focusing on this feature request; I assume you mean support for Because, I don't think the general case to support any types of floating-point values makes sense. This is because all real values are basically unique - the chance for getting two real values that are exactly identical should be very low. This is why we disallow > X <- matrix(sample(1:4, size=12, replace=TRUE), nrow=3)
> X
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 2 1 2 4
[3,] 1 1 3 1
> rowTabulates(X)
1 2 3 4
[1,] 4 0 0 0
[2,] 1 2 0 1
[3,] 3 0 1 0
> rowTabulates(as.double(X))
Error in rowTabulates(as.double(X)) :
Argument 'x' is not integer, logical, or raw: numeric That said, it sounds like your argument is that it makes sense to support ranks when ranks might take on "1/2" values, where the latter might appear from > R <- rowRanks(X, ties.method = "average")
> R
[,1] [,2] [,3] [,4]
[1,] 2.5 2.5 2.5 2.5
[2,] 2.5 1.0 2.5 4.0
[3,] 2.0 2.0 4.0 2.0
> rowTabulates(R) which effectively are unique values, i.e. there are no > rowTabulates(R)
Error in rowTabulates(R) :
Argument 'x' is not integer, logical, or raw: matrix A somewhat convoluted workaround would be: > R2 <- 2*R
> storage.mode(R2) <- "integer";
> T <- rowTabulates(R2)
> T
2 4 5 8
[1,] 0 0 4 0
[2,] 1 0 2 1
[3,] 0 3 0 1
> colnames(T) <- as.integer(colnames(T))/2
> T
1 2 2.5 4
[1,] 0 0 4 0
[2,] 1 0 2 1
[3,] 0 3 0 1 So, if it is just "ranks" you're after, this feature request becomes a question of how to support this special use case, while still protecting against misuse (the general real-value case). |
Another "numeric" use case, that could be allowed, is when the user asks to tabulate based on a fixed set of known rowTabulates(X, values = c(1.0, 2.0, 2.71, 3.14, 4.0)) The "ranks" case would then be a special case of this, where |
I've implemented support for tabulation of real-valued input, as I outlined above. Pleae, let me know ASAP if this is what you're after, because I'm very close to submit the next release to CRAN as soon as possible; they re-open for submission on 2025-01-08 (the CRAN Team asked for an update for other reasons). |
@HenrikBengtsson I am not the requester of this feature, but maybe it's a suitable place to ask about performance. I find that for ranks library(matrixStats)
rowRankTies <- function(r) {
res <- matrix(0, nrow = nrow(r), ncol = max(r))
colnames(res) <- 1:ncol(res)
for(i in seq_len(nrow(r))) {
res[i,] <- tabulate(r[i,], ncol(res))
}
res[,as.character(sort(unique(as.vector(r))))]
}
X <- matrix(rnorm(100*100000), ncol=100)
X[14,14] <- X[14,13]
R <- rowRanks(X, ties.method = "max")
system.time(r1 <- rowTabulates(R))
user system elapsed
1.356 0.011 1.367
system.time(r2 <- rowRankTies(R))
user system elapsed
0.244 0.018 0.262
all.equal(r1,r2)
[1] TRUE This is the main reasons why I used that custom function for matrixTests, instead of Does this seem right, or maybe I am missing something? |
@karoliskoncevicius , thanks for this. Would you mind moving the performance discussions (e.g. copy-paste your comments above) over to #137? That way we can keep this issue focused on the functionality. |
Sure. And sorry - was not aware there is a separate issue for this. |
Thanks for implementing this. It may be better to move this feature to a branch for now so you can get the CRAN fixes out of the way. It would take me some time to test it. (I'm surprised that the tests don't actually compare the result with |
Okay, I moved it to the feature/tabulate-numerics branch, and it's no longer part of the 'develop' branch (=next release). I also agree it's better to punt on this; although I'm pretty certain that these rank values can only take integers and 1/2 values, it might be that there are other ranking algorithms that produce other non-integer values (e.g. weighted ranks). So, it needs more thinking. PS. I see that CRAN re-opened for submission already today, so I'll go ahead a submit what's in the 'develop' branch. |
I think a better solution to the problem would be to have a function which calculates the row or columns ranks and tabulates them directly without temporary allocating the matrix of ranks. However, I suspect such a feature is out of scope for this package. |
Are you suggesting a function that calculates the ranks AND the ties? R is not very convenient for returning multiple values... |
It would be useful if
rowTabulates/colTabulates
could handlenumeric
values liketable
does.This has many uses. For example,
getAnywhere("friedman.test.default")
contains:(BTW, I wonder how much faster many operations in
base
,stats
andutils
would be if they usedmatrixStats
functions).The text was updated successfully, but these errors were encountered: