WISH: rowTabulates/colTabulates for floating-point values #274

MLopez-Ibanez · 2025-01-05T16:45:12Z

It would be useful if rowTabulates/colTabulates could handle numeric values like table does.
This has many uses. For example, getAnywhere("friedman.test.default") contains:

r <- t(apply(y, 1L, rank))
TIES <- tapply(c(r), row(r), table)

(BTW, I wonder how much faster many operations in base, stats and utils would be if they used matrixStats functions).

The text was updated successfully, but these errors were encountered:

MLopez-Ibanez · 2025-01-05T17:08:22Z

A benchmark:

library(bench)
data <- matrix(runif(1000), nrow=20)
ties_orig <- function(x) {
  r <- rowRanks(x, ties.method = "average", useNames=FALSE)
  c(table(r,row(r)))
}

ties_new <- function(x) {
  r <- rowRanks(x, ties.method = "average", useNames=FALSE)
  r <- array(as.integer(as.factor(r)), dim=dim(r))
  c(t(rowTabulates(r)))
}

b <- bench::mark(ties_orig(data), ties_new(data), memory=FALSE)
print(b)
plot(b)

This gives:

# A tibble: 2 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>   <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 ties_orig(d… 2.32ms 2.65ms      341.        NA     0      171     0      502ms
2 ties_new(da… 2.64ms 3.02ms      308.        NA     2.03   152     1      493ms

karoliskoncevicius · 2025-01-05T17:37:02Z

@MLopez-Ibanez here is related issue from few years ago: #152

The speed for finding duplicates can be further improved if we use tabulate() instead of table(). Here is an example from one of my packages: https://github.com/karoliskoncevicius/matrixTests/blob/master/R/utils.general.R#L12

It should be faster compared with the examples you provided, and substantially so. However, if you manage to come up with a faster version - I would be interested, as that would improve the speed of some functions in matrixTests.

HenrikBengtsson · 2025-01-05T23:47:10Z

It would be useful if rowTabulates/colTabulates could handle numeric values like table does.

I'm gonna start by focusing on this feature request; I assume you mean support for numeric as in floating-point values. But, it sounds like you're only interested in the special case of "ranks".

Because, I don't think the general case to support any types of floating-point values makes sense. This is because all real values are basically unique - the chance for getting two real values that are exactly identical should be very low. This is why we disallow numeric values, because we consider any attempt trying to tabulate real values is a mistake or a thinko. Our design is to prevent such mistakes from propagating silently in analytical pipelines, and possibly never be noticed and potentially result in invalid scientific results. Better to stop them as soon as possible, which is why we get:

> X <- matrix(sample(1:4, size=12, replace=TRUE), nrow=3)
> X
     [,1] [,2] [,3] [,4]
[1,]    1    1    1    1
[2,]    2    1    2    4
[3,]    1    1    3    1

> rowTabulates(X)
     1 2 3 4
[1,] 4 0 0 0
[2,] 1 2 0 1
[3,] 3 0 1 0

> rowTabulates(as.double(X))
Error in rowTabulates(as.double(X)) : 
  Argument 'x' is not integer, logical, or raw: numeric

That said, it sounds like your argument is that it makes sense to support ranks when ranks might take on "1/2" values, where the latter might appear from ties.method = "average". To specific, something like:

> R <- rowRanks(X, ties.method = "average")
> R
     [,1] [,2] [,3] [,4]
[1,]  2.5  2.5  2.5  2.5
[2,]  2.5  1.0  2.5  4.0
[3,]  2.0  2.0  4.0  2.0
> rowTabulates(R)

which effectively are unique values, i.e. there are no $\epsilon > 0$ differences to worry about. But, with the current built-in protection, we get:

> rowTabulates(R)
Error in rowTabulates(R) : 
  Argument 'x' is not integer, logical, or raw: matrix

A somewhat convoluted workaround would be:

> R2 <- 2*R
> storage.mode(R2) <- "integer";
> T <- rowTabulates(R2)
> T
     2 4 5 8
[1,] 0 0 4 0
[2,] 1 0 2 1
[3,] 0 3 0 1

> colnames(T) <- as.integer(colnames(T))/2
> T
     1 2 2.5 4
[1,] 0 0   4 0
[2,] 1 0   2 1
[3,] 0 3   0 1

So, if it is just "ranks" you're after, this feature request becomes a question of how to support this special use case, while still protecting against misuse (the general real-value case).

HenrikBengtsson · 2025-01-06T01:04:43Z

Another "numeric" use case, that could be allowed, is when the user asks to tabulate based on a fixed set of known values, e.g.

rowTabulates(X, values = c(1.0, 2.0, 2.71, 3.14, 4.0))

The "ranks" case would then be a special case of this, where values only take on perfect integers or 1/2 values. When not specifying values, it basically defaults to values = unique(X). For the latter case, which should only make sense for ranks, we could check for non-integer or non-1/2 values and throw an error. Such an approach would be feasible to implement, and I think it can be done without introducing a huge penalty. It would be specific to numeric values, so it would not affect the performance on the other, currently supported data types.

…les [#274]

HenrikBengtsson · 2025-01-06T17:56:48Z

I've implemented support for tabulation of real-valued input, as I outlined above. Pleae, let me know ASAP if this is what you're after, because I'm very close to submit the next release to CRAN as soon as possible; they re-open for submission on 2025-01-08 (the CRAN Team asked for an update for other reasons).

karoliskoncevicius · 2025-01-06T19:44:36Z

@HenrikBengtsson I am not the requester of this feature, but maybe it's a suitable place to ask about performance.

I find that for ranks rowTabulates() is a lot slower compared to tabulate() in a loop. Here is a demo:

library(matrixStats)

rowRankTies <- function(r) {
  res <- matrix(0, nrow = nrow(r), ncol = max(r))
  colnames(res) <- 1:ncol(res)
  for(i in seq_len(nrow(r))) {
    res[i,] <- tabulate(r[i,], ncol(res))
  }
  res[,as.character(sort(unique(as.vector(r))))]
}

X <- matrix(rnorm(100*100000), ncol=100)
X[14,14] <- X[14,13]

R <- rowRanks(X, ties.method = "max")

system.time(r1 <- rowTabulates(R))
   user  system elapsed 
  1.356   0.011   1.367 

system.time(r2 <- rowRankTies(R))
   user  system elapsed 
  0.244   0.018   0.262 

all.equal(r1,r2)
[1] TRUE

This is the main reasons why I used that custom function for matrixTests, instead of rowTabulates().

Does this seem right, or maybe I am missing something?

HenrikBengtsson · 2025-01-06T20:18:45Z

@karoliskoncevicius , thanks for this. Would you mind moving the performance discussions (e.g. copy-paste your comments above) over to #137? That way we can keep this issue focused on the functionality.

karoliskoncevicius · 2025-01-06T20:20:06Z

Sure. And sorry - was not aware there is a separate issue for this.

MLopez-Ibanez · 2025-01-07T10:28:10Z

I've implemented support for tabulation of real-valued input, as I outlined above. Pleae, let me know ASAP if this is what you're after, because I'm very close to submit the next release to CRAN as soon as possible; they re-open for submission on 2025-01-08 (the CRAN Team asked for an update for other reasons).

Thanks for implementing this. It may be better to move this feature to a branch for now so you can get the CRAN fixes out of the way. It would take me some time to test it.

(I'm surprised that the tests don't actually compare the result with c(table(r,row(r))) to make sure it is equal).

…ith doubles [#274]" This reverts commit 17a5ed3.

HenrikBengtsson · 2025-01-07T17:10:35Z

It may be better to move this feature to a branch for now so you can get the CRAN fixes out of the way. It would take me some time to test it.

Okay, I moved it to the feature/tabulate-numerics branch, and it's no longer part of the 'develop' branch (=next release).

I also agree it's better to punt on this; although I'm pretty certain that these rank values can only take integers and 1/2 values, it might be that there are other ranking algorithms that produce other non-integer values (e.g. weighted ranks). So, it needs more thinking.

PS. I see that CRAN re-opened for submission already today, so I'll go ahead a submit what's in the 'develop' branch.

yaccos · 2025-01-08T10:01:01Z

I think a better solution to the problem would be to have a function which calculates the row or columns ranks and tabulates them directly without temporary allocating the matrix of ranks. However, I suspect such a feature is out of scope for this package.

MLopez-Ibanez · 2025-01-08T18:00:04Z

I think a better solution to the problem would be to have a function which calculates the row or columns ranks and tabulates them directly without temporary allocating the matrix of ranks. However, I suspect such a feature is out of scope for this package.

Are you suggesting a function that calculates the ranks AND the ties? R is not very convenient for returning multiple values...

HenrikBengtsson added the feature request label Jan 6, 2025

HenrikBengtsson added a commit that referenced this issue Jan 6, 2025

Add support for using colTabulates() and rowTabulates() with doub…

17a5ed3

…les [#274]

HenrikBengtsson added this to the Next release milestone Jan 6, 2025

HenrikBengtsson added a commit that referenced this issue Jan 7, 2025

Revert "Add support for using colTabulates() and rowTabulates() w…

e6d29cd

…ith doubles [#274]" This reverts commit 17a5ed3.

HenrikBengtsson modified the milestones: 1.5.0, Next release Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WISH: rowTabulates/colTabulates for floating-point values #274

WISH: rowTabulates/colTabulates for floating-point values #274

MLopez-Ibanez commented Jan 5, 2025

MLopez-Ibanez commented Jan 5, 2025

karoliskoncevicius commented Jan 5, 2025 •

edited

Loading

HenrikBengtsson commented Jan 5, 2025 •

edited

Loading

HenrikBengtsson commented Jan 6, 2025

HenrikBengtsson commented Jan 6, 2025

karoliskoncevicius commented Jan 6, 2025

HenrikBengtsson commented Jan 6, 2025

karoliskoncevicius commented Jan 6, 2025

MLopez-Ibanez commented Jan 7, 2025

HenrikBengtsson commented Jan 7, 2025

yaccos commented Jan 8, 2025

MLopez-Ibanez commented Jan 8, 2025

WISH: rowTabulates/colTabulates for floating-point values #274

WISH: rowTabulates/colTabulates for floating-point values #274

Comments

MLopez-Ibanez commented Jan 5, 2025

MLopez-Ibanez commented Jan 5, 2025

karoliskoncevicius commented Jan 5, 2025 • edited Loading

HenrikBengtsson commented Jan 5, 2025 • edited Loading

HenrikBengtsson commented Jan 6, 2025

HenrikBengtsson commented Jan 6, 2025

karoliskoncevicius commented Jan 6, 2025

HenrikBengtsson commented Jan 6, 2025

karoliskoncevicius commented Jan 6, 2025

MLopez-Ibanez commented Jan 7, 2025

HenrikBengtsson commented Jan 7, 2025

yaccos commented Jan 8, 2025

MLopez-Ibanez commented Jan 8, 2025

karoliskoncevicius commented Jan 5, 2025 •

edited

Loading

HenrikBengtsson commented Jan 5, 2025 •

edited

Loading