Increase threshold to trigger allow.cartesian error? #2455

MichaelChirico · 2017-11-01T04:42:39Z

One of the only times I find myself using allow.cartesian = TRUE is when I'm doing clustered bootstrap estimates, for example:

# sample to induce staggered group sizes
set.seed(203940)
x = as.data.table(iris)[sample(.N, .N, TRUE)]
setkey(x, Species)
x[ , .N, keyby = Species]
#       Species  N
# 1:     setosa 43
# 2: versicolor 59
# 3:  virginica 48
spec = unique(x$Species)
BB = 100
runs = sapply(seq_len(BB), function(ii) {
  tryCatch({
    #sample _groups_ with replacement
    x[.(sample(spec, length(spec), TRUE)), 
      lm(Sepal.Length ~ Sepal.Width)$coefficients]
    #success
    TRUE
    #failure
  }, error = function(x) FALSE)
})
mean(runs)
# [1] 0.63

This fails about 40% of the time, because the staggered group sizes means that, even though we pull the same number of groups at each iteration, the resulting number of rows often exceeds that of the original table. This is expected behavior, so it's somewhat bothersome to have to specify allow.cartesian, especially since the argument doesn't really capture what I'm trying to do (this is nothing near a Cartesian join).

Diagnosing a bit more, we see:

sizes = replicate(1e5, x[.(sample(spec, length(spec), TRUE)),
                         .N, allow.cartesian = TRUE])
table(sizes)
# sizes
# 129   134   139   144   145   150   155   161   166   177 
# 3684 11086 11084  3832 11073 22338 11007 10998 11219  3679

The number of rows never exceeds about 20% of the table size (of course this depends on the underlying group sizes). 1.2*(nrow(x) + nrow(i)) seems as good a threshold as any... not sure how useful this would be to other users, so just throwing it out there for now.

Could also consider basing the threshold on proximity to nrow(x)*nrow(i) (i.e., full Cartesian) instead of excess over nrow(x) + nrow(i), say, if it's more than 40% of the way to being Cartesian, throw the error? (that threshold would be 180 in this case, i.e., the same as 20% over the summative row total)

The text was updated successfully, but these errors were encountered:

jangorecki · 2017-11-01T09:00:36Z

You can now control it using options. Even when implementing proposed feature I think it should by default be 0%. I was depending on that in some applications, mostly because some SQL db models assumes no duplicated entries in key column.

MichaelChirico · 2017-11-01T10:28:29Z

You mean datatable.allow.cartesian?

I guess one version of implementing this would be to make that option more flexible, something like 0 = current behavior, 1 = allow full cartesian joins without error.

MichaelChirico · 2018-04-01T08:33:44Z

I was surprised giving this answer that I needed to trigger allow.cartesian = TRUE for a non-equi-join. This case seems natural for the number of rows to grow, doesn't it? Should allow.cartesian be TRUE by default in this case?

jangorecki · 2020-05-03T17:18:23Z

I think that allow.cartesian was introduced to prevent many-to-many joins, not really full cartesian join. I don't think the option to prevent full cartesian will be that useful, because full cartesian rarely happens by accident, while partial cartesian, many-to-many join, may happen quite commonly.
It may be good idea to turn allow.cartesian into double scalar, as you proposed, where 1.00 will mean real full cartesian. Note that allow.cartesian does not really cover many-to-many cases (test 49.1).

jangorecki · 2020-11-21T18:25:03Z

Related #4383

jan-glx · 2023-08-17T16:20:20Z

I am routinely bitten by this, because I forget to add .allow.cartesian=TRUE but it works fine until I use the same code on different data. However, I too think it should rather trigger as soon as there are any duplicate keys in both x and i (even if its the not the same values that are duplicated), because it often helps me to catch bugs early.
A warning would be more convenient though, and would avoid breaking existing code; the threshold for the existing error could then be raised to something that would significantly affect usability of the current R session.

sidenotes: I would prefer mult="many-to-many" or allow.many_to_many=TRUE over allow.cartesian (see also this comment
The datatable.allow.cartesian changes actual behavior and should not be used (apart from keeping unfixable pre 2013 old code working).

jangorecki · 2023-08-17T16:51:36Z

AFAIK detecting duplicates of values that are not being used for join (no matching value in another table) is an extra overhead, but please double check.

Does datatable.join.many option (possibly combined with allow.cartesian) proposed in #4370 resolves your use case?
Manual entry can be found in https://github.com/Rdatatable/data.table/pull/4370/files?file-filters%5B%5D=.Rd&show-viewed-files=true#diff-451c5f42e8d06446646d0a1dd8a8b5c4e252d744cfc88ac1f73022fa5c132658R20
If not then MRE would be useful.

jan-glx · 2023-08-17T23:01:07Z

Not sure to be honest. Not sure if my input can add anything here. Are you proposing that I switch from [ to mergelist in general, or would these parameters be implemented in [ as well?
In any case, considering only mult=all, from your link I gathered that default would be join.many=getOption("datatable.join.many", TRUE), i.e. don't check, don't warn, don't error. Setting it to FALSE would mean do check thoroughly (with some costs), don't warn, error if check was positive

don't check, don't warn, don't error by default is fine with me, having parameter to turn on "check" enables defensive programming (great) but I would probably be to lazy to type it every time.
Having a global option to set the default however is not good/useless because it not only changes output but the behavior thus I cannot turn it on (= set it to FALSE)) because that might break some packages (that did not develop with datatable.join.many=FALSE and thus did not set join.many=TRUE where necessary)
If a positive check would only lead to a warning (with the parameter better be named warn.many.to.many or so) i.e. only affecting output then it would be fine with me again although it would not protect from crashing R session anymore.
I feel like, ideally, there should be a parameter e.g. many.to.many = c("warn", "allow", "error") :
- "warn", does a check, warn if positive, don't error
- "error", does a check, don't warn, error if positive [not really needed, perhaps warn.many.to.many=TRUE is a better option]
- "allow", does no check, doesn't want, doesn't error
- setting it to"allow" explicitly turns off the check to a) turn off the warning in case one really wants a many-to-many join or to b) avoid the potential computation costs of the check because one is sure that there are not duplicates in both.
- to remain completely backwards compatible (apart from some reasonable warning messages), I would soft deprecate allow.cartesian, make the signature
  allow.cartesian=NULL meaning that the value is computed as many.to.many %in% "allow", while the default for many.to.many would be "warn" if (is.null(allow.cartesian)|| !allow.cartesian) and "allow"` otherwise
what is also certainly not fine for me is the current behavior of don't really check, never warn, sometimes error (as for OP)

mergelist looks cool btw (would have needed it a handful of times)

OP MRE demonstrates the main issue already.

jangorecki added the joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins label Apr 6, 2020

jangorecki changed the title ~~Q - Increase threshold to trigger allow.cartesian error?~~ Increase threshold to trigger allow.cartesian error? Apr 6, 2020

jangorecki mentioned this issue Apr 16, 2020

allow.cartesian should be more precise #4383

Open

jangorecki added the feature request label May 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase threshold to trigger allow.cartesian error? #2455

Increase threshold to trigger allow.cartesian error? #2455

MichaelChirico commented Nov 1, 2017 •

edited

Loading

jangorecki commented Nov 1, 2017

MichaelChirico commented Nov 1, 2017

MichaelChirico commented Apr 1, 2018

jangorecki commented May 3, 2020

jangorecki commented Nov 21, 2020

jan-glx commented Aug 17, 2023

jangorecki commented Aug 17, 2023 •

edited

Loading

jan-glx commented Aug 17, 2023

Increase threshold to trigger allow.cartesian error? #2455

Increase threshold to trigger allow.cartesian error? #2455

Comments

MichaelChirico commented Nov 1, 2017 • edited Loading

jangorecki commented Nov 1, 2017

MichaelChirico commented Nov 1, 2017

MichaelChirico commented Apr 1, 2018

jangorecki commented May 3, 2020

jangorecki commented Nov 21, 2020

jan-glx commented Aug 17, 2023

jangorecki commented Aug 17, 2023 • edited Loading

jan-glx commented Aug 17, 2023

MichaelChirico commented Nov 1, 2017 •

edited

Loading

jangorecki commented Aug 17, 2023 •

edited

Loading