Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WISH: rowTabulates/colTabulates for floating-point values #274

Open
MLopez-Ibanez opened this issue Jan 5, 2025 · 12 comments
Open

WISH: rowTabulates/colTabulates for floating-point values #274

MLopez-Ibanez opened this issue Jan 5, 2025 · 12 comments

Comments

@MLopez-Ibanez
Copy link

It would be useful if rowTabulates/colTabulates could handle numeric values like table does.
This has many uses. For example, getAnywhere("friedman.test.default") contains:

r <- t(apply(y, 1L, rank))
TIES <- tapply(c(r), row(r), table)

(BTW, I wonder how much faster many operations in base, stats and utils would be if they used matrixStats functions).

@MLopez-Ibanez
Copy link
Author

A benchmark:

library(bench)
data <- matrix(runif(1000), nrow=20)
ties_orig <- function(x) {
  r <- rowRanks(x, ties.method = "average", useNames=FALSE)
  c(table(r,row(r)))
}

ties_new <- function(x) {
  r <- rowRanks(x, ties.method = "average", useNames=FALSE)
  r <- array(as.integer(as.factor(r)), dim=dim(r))
  c(t(rowTabulates(r)))
}

b <- bench::mark(ties_orig(data), ties_new(data), memory=FALSE)
print(b)
plot(b)

This gives:

# A tibble: 2 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>   <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 ties_orig(d… 2.32ms 2.65ms      341.        NA     0      171     0      502ms
2 ties_new(da… 2.64ms 3.02ms      308.        NA     2.03   152     1      493ms

@karoliskoncevicius
Copy link

karoliskoncevicius commented Jan 5, 2025

@MLopez-Ibanez here is related issue from few years ago: #152

The speed for finding duplicates can be further improved if we use tabulate() instead of table(). Here is an example from one of my packages: https://github.com/karoliskoncevicius/matrixTests/blob/master/R/utils.general.R#L12

It should be faster compared with the examples you provided, and substantially so. However, if you manage to come up with a faster version - I would be interested, as that would improve the speed of some functions in matrixTests.

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Jan 5, 2025

It would be useful if rowTabulates/colTabulates could handle numeric values like table does.

I'm gonna start by focusing on this feature request; I assume you mean support for numeric as in floating-point values. But, it sounds like you're only interested in the special case of "ranks".

Because, I don't think the general case to support any types of floating-point values makes sense. This is because all real values are basically unique - the chance for getting two real values that are exactly identical should be very low. This is why we disallow numeric values, because we consider any attempt trying to tabulate real values is a mistake or a thinko. Our design is to prevent such mistakes from propagating silently in analytical pipelines, and possibly never be noticed and potentially result in invalid scientific results. Better to stop them as soon as possible, which is why we get:

> X <- matrix(sample(1:4, size=12, replace=TRUE), nrow=3)
> X
     [,1] [,2] [,3] [,4]
[1,]    1    1    1    1
[2,]    2    1    2    4
[3,]    1    1    3    1

> rowTabulates(X)
     1 2 3 4
[1,] 4 0 0 0
[2,] 1 2 0 1
[3,] 3 0 1 0

> rowTabulates(as.double(X))
Error in rowTabulates(as.double(X)) : 
  Argument 'x' is not integer, logical, or raw: numeric

That said, it sounds like your argument is that it makes sense to support ranks when ranks might take on "1/2" values, where the latter might appear from ties.method = "average". To specific, something like:

> R <- rowRanks(X, ties.method = "average")
> R
     [,1] [,2] [,3] [,4]
[1,]  2.5  2.5  2.5  2.5
[2,]  2.5  1.0  2.5  4.0
[3,]  2.0  2.0  4.0  2.0
> rowTabulates(R)

which effectively are unique values, i.e. there are no $\epsilon &gt; 0$ differences to worry about. But, with the current built-in protection, we get:

> rowTabulates(R)
Error in rowTabulates(R) : 
  Argument 'x' is not integer, logical, or raw: matrix

A somewhat convoluted workaround would be:

> R2 <- 2*R
> storage.mode(R2) <- "integer";
> T <- rowTabulates(R2)
> T
     2 4 5 8
[1,] 0 0 4 0
[2,] 1 0 2 1
[3,] 0 3 0 1

> colnames(T) <- as.integer(colnames(T))/2
> T
     1 2 2.5 4
[1,] 0 0   4 0
[2,] 1 0   2 1
[3,] 0 3   0 1

So, if it is just "ranks" you're after, this feature request becomes a question of how to support this special use case, while still protecting against misuse (the general real-value case).

@HenrikBengtsson
Copy link
Owner

Another "numeric" use case, that could be allowed, is when the user asks to tabulate based on a fixed set of known values, e.g.

rowTabulates(X, values = c(1.0, 2.0, 2.71, 3.14, 4.0))

The "ranks" case would then be a special case of this, where values only take on perfect integers or 1/2 values. When not specifying values, it basically defaults to values = unique(X). For the latter case, which should only make sense for ranks, we could check for non-integer or non-1/2 values and throw an error. Such an approach would be feasible to implement, and I think it can be done without introducing a huge penalty. It would be specific to numeric values, so it would not affect the performance on the other, currently supported data types.

@HenrikBengtsson
Copy link
Owner

I've implemented support for tabulation of real-valued input, as I outlined above. Pleae, let me know ASAP if this is what you're after, because I'm very close to submit the next release to CRAN as soon as possible; they re-open for submission on 2025-01-08 (the CRAN Team asked for an update for other reasons).

@karoliskoncevicius
Copy link

@HenrikBengtsson I am not the requester of this feature, but maybe it's a suitable place to ask about performance.

I find that for ranks rowTabulates() is a lot slower compared to tabulate() in a loop. Here is a demo:

library(matrixStats)

rowRankTies <- function(r) {
  res <- matrix(0, nrow = nrow(r), ncol = max(r))
  colnames(res) <- 1:ncol(res)
  for(i in seq_len(nrow(r))) {
    res[i,] <- tabulate(r[i,], ncol(res))
  }
  res[,as.character(sort(unique(as.vector(r))))]
}

X <- matrix(rnorm(100*100000), ncol=100)
X[14,14] <- X[14,13]

R <- rowRanks(X, ties.method = "max")

system.time(r1 <- rowTabulates(R))
   user  system elapsed 
  1.356   0.011   1.367 

system.time(r2 <- rowRankTies(R))
   user  system elapsed 
  0.244   0.018   0.262 

all.equal(r1,r2)
[1] TRUE

This is the main reasons why I used that custom function for matrixTests, instead of rowTabulates().

Does this seem right, or maybe I am missing something?

@HenrikBengtsson
Copy link
Owner

@karoliskoncevicius , thanks for this. Would you mind moving the performance discussions (e.g. copy-paste your comments above) over to #137? That way we can keep this issue focused on the functionality.

@karoliskoncevicius
Copy link

Sure. And sorry - was not aware there is a separate issue for this.

@MLopez-Ibanez
Copy link
Author

I've implemented support for tabulation of real-valued input, as I outlined above. Pleae, let me know ASAP if this is what you're after, because I'm very close to submit the next release to CRAN as soon as possible; they re-open for submission on 2025-01-08 (the CRAN Team asked for an update for other reasons).

Thanks for implementing this. It may be better to move this feature to a branch for now so you can get the CRAN fixes out of the way. It would take me some time to test it.

(I'm surprised that the tests don't actually compare the result with c(table(r,row(r))) to make sure it is equal).

HenrikBengtsson added a commit that referenced this issue Jan 7, 2025
@HenrikBengtsson
Copy link
Owner

It may be better to move this feature to a branch for now so you can get the CRAN fixes out of the way. It would take me some time to test it.

Okay, I moved it to the feature/tabulate-numerics branch, and it's no longer part of the 'develop' branch (=next release).

I also agree it's better to punt on this; although I'm pretty certain that these rank values can only take integers and 1/2 values, it might be that there are other ranking algorithms that produce other non-integer values (e.g. weighted ranks). So, it needs more thinking.

PS. I see that CRAN re-opened for submission already today, so I'll go ahead a submit what's in the 'develop' branch.

@HenrikBengtsson HenrikBengtsson modified the milestones: 1.5.0, Next release Jan 7, 2025
@yaccos
Copy link
Contributor

yaccos commented Jan 8, 2025

I think a better solution to the problem would be to have a function which calculates the row or columns ranks and tabulates them directly without temporary allocating the matrix of ranks. However, I suspect such a feature is out of scope for this package.

@MLopez-Ibanez
Copy link
Author

I think a better solution to the problem would be to have a function which calculates the row or columns ranks and tabulates them directly without temporary allocating the matrix of ranks. However, I suspect such a feature is out of scope for this package.

Are you suggesting a function that calculates the ranks AND the ties? R is not very convenient for returning multiple values...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants