`on` performing slower than double `setkey` #1232

MichaelChirico · 2015-07-18T23:09:36Z

I recently integrated the new on functionality into some code of mine that was being dragged down by repetitive key switching (here for some context), so I was excited for the new on feature to (potentially) speed things up. I was quite surprised to find that actually the code ran about 30% slower (45 instead of 35 minutes) using on.

I was able to reproduce this using large data.tables beefed up from @jangorecki's join_on tests:

nn<-1e6
mm<-1e2

times=50L

set.seed(45L)
DT1 = data.table(x=sample(letters[1:3], nn, TRUE), y=sample(6:10, nn, TRUE), 
                 a=sample(100, nn,T), b=runif(nn))
DT2 = CJ(x=letters[1:3], y=6:10)[, mul := sample(20, 15)][sample(15L, mm,T)]

times2<-times1<-numeric(times)
for (ii in 1:times){
  cp1<-copy(DT1); cp2<-copy(DT2)
  strt<-get_nanotime()
  cp1[cp2,on="x",allow.cartesian=T]
  stp<-get_nanotime()
  times1[ii]<-stp-strt

  cp1<-copy(DT1); cp2<-copy(DT2)
  strt<-get_nanotime()
  setkey(cp1,x)[setkey(cp2,x),allow.cartesian=T]
  stp<-get_nanotime()
  times2[ii]<-stp-strt
}
> median(times1)/median(times2)
[1] 1.274535

So about 27% slower here. Maybe I'm not understanding the purpose of on, but I thought that the double-keyed approach should basically be an upper bound for how long on takes. And indeed on is faster when the tables are smaller:

nn<-1e3

> median(times1)/median(times2)
[1] 0.9491699

So, roughly 5% faster when DT1 is smaller.

nn<-1e6; mm<-5
> median(times1)/median(times2)
[1] 0.9394226

Roughly 7% faster when DT2 is smaller.

The text was updated successfully, but these errors were encountered:

jangorecki · 2015-07-19T00:34:17Z

I would advise not to benchmark creation of the data.table objects, but just isolated join operation. Now it can have some more noise because of that.

MichaelChirico · 2015-07-19T21:44:47Z

@jangorecki then how to avoid the freebie given to the second operation? namely, after the first run of usekey, DT1 and DT2 will be sorted, no?

my thinking is that there's noise, but it should be more or less equal across operations. the real trouble is it distorts the magnitude of difference between the operations.

jangorecki · 2015-07-19T21:51:12Z

I would benchmark then manually instead of microbenchmark function.
Prepare unkeyed data, make loop, in loop copy data, start timing by microbenchmark::get_nanotime(), proceed with setkey and join, stop timing. Similarly for the on but without setting key. Maybe someone else can share a better way to benchmark that.

MichaelChirico · 2015-07-20T14:58:27Z

@jangorecki thanks for the pointer, I've updated my timings.

jangorecki · 2015-07-20T16:24:44Z

Thanks for nice timing. I run your script and got 1.236222.
The commit related to that change is relatively small f87a8aa22f836fdb0d5362f45889c72f63da7663. Maybe some of the readers will be able track the issue. The only thing I see as additional overhead vs. previews version is line 604: i = .shallow(i, retain.key = io).

arunsrinivasan · 2015-08-14T17:33:45Z

Advantages of on=:

It's easier to tell on which columns the joins are performed just by looking at the syntax.
```
X[Y, on="a"]
```
Imagine doing setkey() somewhere on the top of your script/function and performing a join somewhere down the line.. on= is right there as another argument.
Relatively easier to grasp. No need to understand what setkey() does, and why we need to do that to be able to join etc..
When you've set key on some columns, but want to join with another data.table based on another column, you don't have to re-key/re-order just for that join. Check this nice Q from @douglasclark on SO.
Performance: Note that setkey() does two things:
a) computes the order in which the data.table has to be sorted in increasing order and reorders it, and
b) marks those columns as key columns in the form of sorted attribute.

For most datasets it doesn't matter, but as your data gets larger, the time to reorder becomes the bottleneck compared to computing the order. There are cases where this may not be worth it. For example:
```
require(data.table)
set.seed(1L)
X = setDT(lapply(1:10, function(x) sample(100, 2.6e7, TRUE)))[, id := sample(letters)]
Y = data.table(id=letters, val = sample(26))

print(object.size(X), units="Mb") # 1190.2 Mb
print(object.size(Y), units="Mb") # 0 Mb

X[Y, val := i.val, on="id"] # 0.8 seconds

# using setkey
setkey(X, id)[Y, val := i.val] # 3 seconds
```
Sometimes it may just be crucial to retain the original order.

In spite of all this, for most cases setkey() would speed up the operations because the data is sorted in RAM and therefore will be quite cache efficient. So if you've huge datasets where you wouldn't be reordering often, you could do that.

All in all, both setkey() and on= have their benefits and use cases. I myself am a hybrid user. It might take time and effort to figure out as to when and where each case is the most performant. Personally, I find teaching on= usage for joins at first makes more sense than setkey() + join, asit helps explain better data.table's philosophy of performing joins as subsets.

Hope this helps.

PS: In spite of these points, it might be possible that on= joins' performance could be improved. But we'll need a significantly worse performance on a single join (compared to key-based joins) to isolate that case, I think.

jangorecki · 2015-08-14T18:35:26Z

Thanks for detailed description. Can we turn that issue into FR for documentation? Besides that the issue probably will be coming back in future, the performance impact (where it may speed-up and where it may slow down) is worth to mention. Definitely not a high priority.

arunsrinivasan · 2015-08-15T10:21:15Z

@jangorecki noted down under #944

arunsrinivasan closed this as completed Aug 14, 2015

arunsrinivasan added the question label Aug 14, 2015

arunsrinivasan mentioned this issue Aug 15, 2015

Vignettes #944

Open

33 tasks

arunsrinivasan added the documentation label Aug 15, 2015

MichaelChirico mentioned this issue Oct 2, 2015

Should update unkeyed join error message given on #1368

Closed

MichaelChirico mentioned this issue Jan 2, 2016

[R-Forge #2341] Auto setkey when worth it #612

Open

MichaelChirico mentioned this issue Aug 24, 2016

X[Y, on = ...] significantly slower than X[Y] for the same join fields #1825

Closed

zeomal mentioned this issue Apr 25, 2020

Create datatable-joins.Rmd for Joins vignette #4398

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`on` performing slower than double `setkey` #1232

`on` performing slower than double `setkey` #1232

MichaelChirico commented Jul 18, 2015

jangorecki commented Jul 19, 2015

MichaelChirico commented Jul 19, 2015

jangorecki commented Jul 19, 2015

MichaelChirico commented Jul 20, 2015

jangorecki commented Jul 20, 2015

arunsrinivasan commented Aug 14, 2015

jangorecki commented Aug 14, 2015

arunsrinivasan commented Aug 15, 2015

on performing slower than double setkey #1232

on performing slower than double setkey #1232

Comments

MichaelChirico commented Jul 18, 2015

jangorecki commented Jul 19, 2015

MichaelChirico commented Jul 19, 2015

jangorecki commented Jul 19, 2015

MichaelChirico commented Jul 20, 2015

jangorecki commented Jul 20, 2015

arunsrinivasan commented Aug 14, 2015

jangorecki commented Aug 14, 2015

arunsrinivasan commented Aug 15, 2015

`on` performing slower than double `setkey` #1232

`on` performing slower than double `setkey` #1232