Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance of pmin/pmax using .EACHI vs making an intermediate large allocation without .EACHI #4598

Open
myoung3 opened this issue Jul 12, 2020 · 9 comments
Labels
documentation non-equi joins rolling, overlapping, non-equi joins

Comments

@myoung3
Copy link
Contributor

myoung3 commented Jul 12, 2020

I'm in the process of developing a function that takes values measured over intervals and averages those values to new non-aligned intervals and I've noticed that pmin and pmax are causing performance issues in some settings. The title isn't exactly accurate because I can reproduce the pmin/pmax performance issues even without using .EACHI, but I'll present this in the context of my the interval averaging function I'm writing to demonstrate why this matters.

First I'll create some data. x contains values measured over intervals (ie "value" is a time-integrated average and start/end denote
the closed interval over which the average was taken); y contains new intervals over which I'd like to calculate averages from the data in x:


uid <- 1L:1000L
x <- CJ(id=uid,start=seq(1L,2000L,by=7L))
x[, end:=start+6L]
x[, value:=rnorm(.N)]

##specifically choose averaging intervals in y to be smaller in length than the observations intervals in x
#this will result in a large intermediate allocation below in the non-equijoin unless .EACHI is used.
y <- CJ(id=uid, start=seq(1L,1991L,by=3L))
y[,end:=start+9L]

Just to demonstrate what I'm trying to accomplish, here is the naive averaging approach that starts by expanding the interval dataset into a time-series. Obviously this is sub-optimal since it creates an unnecessary intermediate large table.

##approach 0
x_expanded <- x[,list(id=id, time=start:end,value=value),by=1:nrow(x)]
x_expanded[, time2:=time]
setkey(x_expanded, id,time,time2)
z_0 <- foverlaps(y,x_expanded) #~6 million rows
out0 <- z_0[,list(avg_value=mean(value)),by=c("id","start","end")]

Instead, the better approach would be to use a non-equi join combined with a weighted average over .EACHI from the join:


#first, copy interval columns and rename for clarity since 
 #I can never understand what a non-equi join does with the join columns in the return table:
setnames(x,c("start","end"),c("xstart","xend"))
setnames(y,c("start","end"),c("ystart","yend"))
x[, xstart2:=xstart]
x[, xend2:=xend]
y[, ystart2:=ystart]
y[, yend2:=yend]


#approach 1: the .EACHI approach:
time1 <- system.time({
out1 <- x[i=y,list(avg_value=weighted.mean(x=value,w=pmin(xend2,yend2)-pmax(xstart2,ystart2)+1L)),
           by=.EACHI, on=c("id","xend>=ystart","xstart<=yend"),nomatch=NULL]
})
setkey(out1, NULL) #for identical test below since the following approaches don't return a keyed table

#compare to approaches where an intermediate join table is created

#approach 2: intermediate table
#note that this isn't even using GFORCE optimization and it's still faster than the above approach
time2 <- system.time({
z <- x[i=y,on=c("id","xend>=ystart","xstart<=yend"),nomatch=NULL,allow.cartesian=TRUE]
z[, new_interval_length:=pmin(xend2,yend2)-pmax(xstart2,ystart2)+1L]
out2 <- z[, list(avg_value=weighted.mean(x=value,w=new_interval_length)),
          by=c("id","xend","xstart")]

})

#however, the performance issue comes back if you try to create the interval length variable in-place

#approach 3: intermediate table and interval length defined in place
time3 <- system.time({
  z <- x[i=y,on=c("id","xend>=ystart","xstart<=yend"),nomatch=NULL,allow.cartesian=TRUE]
  out3 <- z[, list(avg_value=weighted.mean(x=value,w=pmin(xend2,yend2)-pmax(xstart2,ystart2)+1L)),
            by=c("id","xend","xstart")]
  
})

The timings show that using the .EACHI is slower and this seems to be specifically because pmin/pmax are called in the argument definition of list(). This is not ideal because I'd like to use the .EACHI approach to reduce the memory usage of the averaging function I'm writing.


> identical(out1,out2) 
[1] TRUE 
> identical(out2,out3) 
[1] TRUE 

> time1    
user  system elapsed   
11.712   1.152  12.261  
> time2   
 user  system elapsed    
5.324   1.016   4.554  
> time3    
user  system elapsed   
12.740   0.980  12.155

So basically pmin/pmax seem to be causing slowness passed directly as an argument to weighted.mean.

In base R the following return basically the same timings for me:

system.time(
for(i in unique(z$id)){
  current_index <- z$id==i
  weighted.mean(x=z$value[current_index],
                w=pmin(z$xend2[current_index],z$yend2[current_index]) -
                  pmax(z$xstart2[current_index], z$ystart2[current_index]) + 1L)
}
)

system.time(
  for(i in unique(z$id)){
    current_index <- z$id==i
    new_interval_length <- pmin(z$xend2[current_index],z$yend2[current_index]) -
      pmax(z$xstart2[current_index], z$ystart2[current_index]) + 1L
    weighted.mean(x=z$value[current_index],
                  w=new_interval_length)
  }
)
@myoung3
Copy link
Contributor Author

myoung3 commented Jul 12, 2020

I should mention that defining new_interval_length as a local variable in a local environment in the .EACHI approach does not solve the issue:

time4 <- system.time({
  out1 <- x[i=y,list(avg_value={
    new_interval_length <- pmin(xend2,yend2)-pmax(xstart2,ystart2)+1L
    weighted.mean(x=value,w=new_interval_length)
    
  }),
            by=.EACHI, on=c("id","xend>=ystart","xstart<=yend"),nomatch=NULL]
})

time4 is basically the same as time1

@ColeMiller1
Copy link
Contributor

Thank you for the report - I can reproduce in both 1.12.8 and dev. It is always helpful to look at verbose = TRUE - this shows us that the eval(j) is what slows down the by = .EACHI approach [note, the output was truncated]:

## by = .EACHI

## Making each group and running j (GForce FALSE) ... 
## memcpy contiguous groups took 0.050s for 664000 groups
## eval(j) took 10.931s for 664000 calls
## 11.3s elapsed (11.3s cpu) 

## after making the intermediate table

## Making each group and running j (GForce FALSE) ... 
##  memcpy contiguous groups took 0.045s for 664000 groups
##  eval(j) took 2.912s for 664000 calls
## 3.190s elapsed (3.170s cpu) 

This suggests that pmin and pmax are taking up some time evaluating. I think this makes sense - these functions do not need grouping yet in this case we still call them 664,000 times. According to my timings, weighted.mean(...) takes up 3 seconds alone.

Taking a step back, it makes sense that extra calls take more time - here's an arbitrary example that adds some very light weight functions but still takes twice as long:

library(data.table)
dt = data.table(grp = seq_len(664000L),
                val = rnorm(664000L))

bench::mark(dt[, .(val = val), by = grp],
            dt[, .(val = (val) + (2 - 2)), by = grp])
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                                 min median `itr/sec` mem_alloc
#>   <bch:expr>                               <bch> <bch:>     <dbl> <bch:byt>
#> 1 dt[, .(val = val), by = grp]             254ms  258ms      3.87    16.8MB
#> 2 dt[, .(val = (val) + (2 - 2)), by = grp] 503ms  503ms      1.99    15.2MB
#> # ... with 1 more variable: `gc/sec` <dbl>

You may be able to improve timings if you make a simplified function to call so that things are more lightweight. This is in the middle of the road - faster than the original by = .EACHI but there are still a lot of extra calls. This alternative function seems to work for this dataset but if there are NAs or 0s in your wts, this likely would be incorrect.

weighted.mean2 = function (val, xstart, xend, ystart, yend) {
  wts = ifelse(xend > yend, yend, xend) - ifelse(xstart > ystart, xstart, ystart)+1L
  return(sum(val * wts) / sum(wts))
}

time7 = system.time({
  out7 <- x[y,
            list(avg_value = weighted.mean2(value, xstart2, xend2, ystart2, yend2)),
            by=.EACHI, on=c("id","xend>=ystart","xstart<=yend"),nomatch=NULL, verbose = TRUE]
})

##Making each group and running j (GForce FALSE) ... 
##  memcpy contiguous groups took 0.057s for 664000 groups
##  eval(j) took 7.332s for 664000 calls
## 7.660s elapsed (7.610s cpu) 
setkey(out7, NULL)
all.equal(out1, out7)
## [1] TRUE

Overall, it may be difficult to balance memory performance with actual function speed. The next step would be to look into Rcpp and I am going to make a separate issue related to it. I spent more time on it today but there seems to be a bug with columns from the i table.

@myoung3
Copy link
Contributor Author

myoung3 commented Jul 12, 2020

Thanks for the response. It seems like this really comes down to a j statement containing a mix of group-by operations and rowwise (vectorized) operations that could just as well be done without a by statement. In most context this doesn't matter since you could just first do the non-grouping operation once over all rows in one [.data.table call, then follow-up with a grouping operation in a separate [.data.table call using by. But in the context of .EACHI both of these need to be called within the by-loop.

If I had run constructed my base-R comparison correctly this would have been clear. The following demonstrates the issue independently of data.table:

system.time(
  for(i in unique(z$id)){
    current_index <- z$id==i
    weighted.mean(x=z$value[current_index],
                  w=pmin(z$xend2[current_index],z$yend2[current_index]) -
                    pmax(z$xstart2[current_index], z$ystart2[current_index]) + 1L)
  }
)

system.time({
new_interval_length <- pmin(z$xend2,z$yend2) -
  pmax(z$xstart2, z$ystart2) + 1L
  for(i in unique(z$id)){
    current_index <- z$id==i
    weighted.mean(x=z$value[current_index],
                  w=new_interval_length[current_index])
  }
})

ie--pmin and pmax have nonnegligible overhead.

That's not to say that there isn't opportunity for data.table to do better than base R.

It seems like it shouldn't be too hard to create a function to calculate interval intersect lengths (or the weighted average directly) with less overhead. I may also play around with this but I probably won't get too far very quickly since I have limited C/C++ experience.

@ColeMiller1
Copy link
Contributor

@myoung3 could you include the timings in your example? But one thing to note is that I do not believe your example is equivalent - for data.table, there was by=c("id","xend","xstart").

length(unique(z$id))
## [1] 1000

uniqueN(z, c("id", "xstart", "xend"))
## [1] 664000

@myoung3
Copy link
Contributor Author

myoung3 commented Jul 12, 2020

@ColeMiller1 Oops yes I forgot to index over the intervals in y in that example.

If I were to actually do the task in base R it would take forever because of indexing and looping slowness. But I think the following demonstrates the point pretty concisely:

n <- 664000
v1 <- c(5L,8L)
v2 <- c(6L,7L)
v3 <- c(2L,1L)
v4 <- c(3L,2L)
system.time(replicate(n, expr=pmin(v1,v2)-pmax(v3,v4)+1L))
system.time(pmin(rep(v1,n),rep(v2,n))-pmax(rep(v3,n),rep(v4,n)+1L))
> system.time(replicate(n, expr=pmin(v1,v2)-pmax(v3,v4)+1L))
   user  system elapsed 
  7.928   0.000   7.931 
> system.time(pmin(rep(v1,n),rep(v2,n))-pmax(rep(v3,n),rep(v4,n)+1L))
   user  system elapsed 
  0.020   0.000   0.019 

So there's nothing wrong with data.table here specifically. But this issue demonstrates that when it's necessary to perform an intermediate vectorized operation on columns resulting from a join within .EACHI groups, there are time-performance costs relative to not using .EACHI since that vectorized operation is called repeatedly. In a sense, this the defining "feature" of .EACHI (it avoids a large but unnecessary intermediate allocation by reusing the same allocation for each group, and as such it's not possible to perform a vectorized operation over the entire intermediate table because that intermediate table never exists in memory--at least not all at the same time), but it really seems to highlight when baseR or other vectorized functions have overhead.

@ColeMiller1
Copy link
Contributor

I thought there was another post from you about using .Internal(pmin(TRUE, v1, v2)). I looked at it and it does seem like it would be faster. Using .Internal skips out on:

elts <- list(...)
if (length(elts) == 0L) 
    stop("no arguments")
if (all(vapply(elts, function(x) is.atomic(x) && !is.object(x), 
    NA))) {
    mmm <- .Internal(pmin(na.rm, ...))
    mostattributes(mmm) <- attributes(elts[[1L]])

Regardless, if you think that this is resolved, feel free to close the issue. I am not sure if there is much more data.table can do. I will point out that data.table matches the performance on the dummy data which is pretty amazing:

n <- 664000
vals = c(1.1, 2.2)
v1 <- c(5L,8L)
v2 <- c(6L,7L)
v3 <- c(2L,1L)
v4 <- c(3L,2L)
system.time(for(i in seq_len(n)) {weighted.mean(vals, pmin(v1,v2)-pmax(v3,v4)+1L) })

##   user  system elapsed 
##  18.66    0.05   18.80 

time1 ## the original by = .EACHI example
##   user  system elapsed 
##  17.32    0.08   17.61 

@myoung3
Copy link
Contributor Author

myoung3 commented Jul 13, 2020

@ColeMiller1 Yes I had a post on .Internal(pmin()) but I felt I should first study the source for pmin() to make sure it wasn't doing any important but time-consuming checks so I deleted it until I was more confident that the time spent outside of .Internal(pmin,.) was actually being wasted. This probably is the solution though.

I could close this but first I'll ask whether there's any interest in including a discussion of this topic in any documentation? Are there other situations where people might use .EACHI but need to calculate intermediate variables which could be done via a single vectorized operation if the entire dataset existed at once? If so, it might be worth discussing the performance tradeoff between .EACHI and the intermediate allocation approach (unless work is done to find optimized functions). But maybe this use-case is so unique it's not worth discussing. @mattdowle ?

Yes actually doing the indexing operations in base R on the real data takes ages. data.table does all the indexing basically instantly which is just mind-blowing.

@myoung3
Copy link
Contributor Author

myoung3 commented Jul 13, 2020

Actually I just searched through the vignettes and there is zero discussion of the memory benefits of using .EACHI. At the very least, it's probably worth it to just mention the memory allocation trick of .EACHI. I'm now not even sure where I read about how .EACHI works. Maybe stackoverflow or NEWS

@ColeMiller1
Copy link
Contributor

I did not find any direct documentation about memory considerations for by = .EACHI, either. Regarding speed about realizing an intermediate table in order to perform a vectorized operation, that does seem relatively rare as most of the time these operations are lightweight such as +. I will ping @MichaelChirico just in case I missed something.

There are some related issues: #2181 (and closing PR #4398) and there's a link to this SO post.

Finally, below is an Rcpp approach. It's the fastest of the approaches although I will note that NA or Inf values may cause issues without modification to the function. In summary, using it with data.table alllows for 664,000 groups to be ordered and processed in less than 2 seconds.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double custom_fx(NumericVector vals,
                        IntegerVector xstart,
                        IntegerVector xend,
                        IntegerVector ystart,
                        IntegerVector yend) {
    
    int n = vals.size();
    
    int wt_sum = 0;
    double val_sum = 0;
    
    int y_end = yend[0];
    int y_start = ystart[0];
    for (int i = 0; i < n; i++) {
        int wt = 1;
        
        if (xend[i] > y_end) {
            wt += y_end;
        } else {
            wt += xend[i];
        }
        
        if (xstart[i] > y_start) {
            wt -= xstart[i];
        } else {
            wt -= y_start;
        }
        
        val_sum += vals[i] * wt;
        wt_sum += wt;
    }    
    return(val_sum / wt_sum);
}
x[y,
  list(avg_value = custom_fx(value, xstart2, xend2, ystart2, yend2)),
  by=.EACHI, on=c("id","xend>=ystart","xstart<=yend"),nomatch=NULL, verbose = TRUE]

## Original by = .EACHI
##   user  system elapsed 
##  11.31    0.06   11.40

## Using C++
##   user  system elapsed 
##   1.94    0.02    1.96 

@ColeMiller1 ColeMiller1 added documentation non-equi joins rolling, overlapping, non-equi joins and removed performance labels Jul 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation non-equi joins rolling, overlapping, non-equi joins
Projects
None yet
Development

No branches or pull requests

2 participants