Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

on performing slower than double setkey #1232

Closed
MichaelChirico opened this issue Jul 18, 2015 · 8 comments
Closed

on performing slower than double setkey #1232

MichaelChirico opened this issue Jul 18, 2015 · 8 comments

Comments

@MichaelChirico
Copy link
Member

I recently integrated the new on functionality into some code of mine that was being dragged down by repetitive key switching (here for some context), so I was excited for the new on feature to (potentially) speed things up. I was quite surprised to find that actually the code ran about 30% slower (45 instead of 35 minutes) using on.

I was able to reproduce this using large data.tables beefed up from @jangorecki's join_on tests:

nn<-1e6
mm<-1e2

times=50L

set.seed(45L)
DT1 = data.table(x=sample(letters[1:3], nn, TRUE), y=sample(6:10, nn, TRUE), 
                 a=sample(100, nn,T), b=runif(nn))
DT2 = CJ(x=letters[1:3], y=6:10)[, mul := sample(20, 15)][sample(15L, mm,T)]

times2<-times1<-numeric(times)
for (ii in 1:times){
  cp1<-copy(DT1); cp2<-copy(DT2)
  strt<-get_nanotime()
  cp1[cp2,on="x",allow.cartesian=T]
  stp<-get_nanotime()
  times1[ii]<-stp-strt

  cp1<-copy(DT1); cp2<-copy(DT2)
  strt<-get_nanotime()
  setkey(cp1,x)[setkey(cp2,x),allow.cartesian=T]
  stp<-get_nanotime()
  times2[ii]<-stp-strt
}
> median(times1)/median(times2)
[1] 1.274535

So about 27% slower here. Maybe I'm not understanding the purpose of on, but I thought that the double-keyed approach should basically be an upper bound for how long on takes. And indeed on is faster when the tables are smaller:

nn<-1e3

> median(times1)/median(times2)
[1] 0.9491699

So, roughly 5% faster when DT1 is smaller.

nn<-1e6; mm<-5
> median(times1)/median(times2)
[1] 0.9394226

Roughly 7% faster when DT2 is smaller.

@jangorecki
Copy link
Member

I would advise not to benchmark creation of the data.table objects, but just isolated join operation. Now it can have some more noise because of that.

@MichaelChirico
Copy link
Member Author

@jangorecki then how to avoid the freebie given to the second operation? namely, after the first run of usekey, DT1 and DT2 will be sorted, no?

my thinking is that there's noise, but it should be more or less equal across operations. the real trouble is it distorts the magnitude of difference between the operations.

@jangorecki
Copy link
Member

I would benchmark then manually instead of microbenchmark function.
Prepare unkeyed data, make loop, in loop copy data, start timing by microbenchmark::get_nanotime(), proceed with setkey and join, stop timing. Similarly for the on but without setting key. Maybe someone else can share a better way to benchmark that.

@MichaelChirico
Copy link
Member Author

@jangorecki thanks for the pointer, I've updated my timings.

@jangorecki
Copy link
Member

Thanks for nice timing. I run your script and got 1.236222.
The commit related to that change is relatively small f87a8aa22f836fdb0d5362f45889c72f63da7663. Maybe some of the readers will be able track the issue. The only thing I see as additional overhead vs. previews version is line 604: i = .shallow(i, retain.key = io).

@arunsrinivasan
Copy link
Member

Advantages of on=:

  1. It's easier to tell on which columns the joins are performed just by looking at the syntax.

    X[Y, on="a"]

    Imagine doing setkey() somewhere on the top of your script/function and performing a join somewhere down the line.. on= is right there as another argument.

  2. Relatively easier to grasp. No need to understand what setkey() does, and why we need to do that to be able to join etc..

  3. When you've set key on some columns, but want to join with another data.table based on another column, you don't have to re-key/re-order just for that join. Check this nice Q from @douglasclark on SO.

  4. Performance: Note that setkey() does two things:
    a) computes the order in which the data.table has to be sorted in increasing order and reorders it, and
    b) marks those columns as key columns in the form of sorted attribute.

    For most datasets it doesn't matter, but as your data gets larger, the time to reorder becomes the bottleneck compared to computing the order. There are cases where this may not be worth it. For example:

    require(data.table)
    set.seed(1L)
    X = setDT(lapply(1:10, function(x) sample(100, 2.6e7, TRUE)))[, id := sample(letters)]
    Y = data.table(id=letters, val = sample(26))
    
    print(object.size(X), units="Mb") # 1190.2 Mb
    print(object.size(Y), units="Mb") # 0 Mb
    
    X[Y, val := i.val, on="id"] # 0.8 seconds
    
    # using setkey
    setkey(X, id)[Y, val := i.val] # 3 seconds
  5. Sometimes it may just be crucial to retain the original order.

In spite of all this, for most cases setkey() would speed up the operations because the data is sorted in RAM and therefore will be quite cache efficient. So if you've huge datasets where you wouldn't be reordering often, you could do that.

All in all, both setkey() and on= have their benefits and use cases. I myself am a hybrid user. It might take time and effort to figure out as to when and where each case is the most performant. Personally, I find teaching on= usage for joins at first makes more sense than setkey() + join, asit helps explain better data.table's philosophy of performing joins as subsets.

Hope this helps.

PS: In spite of these points, it might be possible that on= joins' performance could be improved. But we'll need a significantly worse performance on a single join (compared to key-based joins) to isolate that case, I think.

@jangorecki
Copy link
Member

Thanks for detailed description. Can we turn that issue into FR for documentation? Besides that the issue probably will be coming back in future, the performance impact (where it may speed-up and where it may slow down) is worth to mention. Definitely not a high priority.

@arunsrinivasan arunsrinivasan mentioned this issue Aug 15, 2015
33 tasks
@arunsrinivasan
Copy link
Member

@jangorecki noted down under #944

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants