Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase threshold to trigger allow.cartesian error? #2455

Open
MichaelChirico opened this issue Nov 1, 2017 · 8 comments
Open

Increase threshold to trigger allow.cartesian error? #2455

MichaelChirico opened this issue Nov 1, 2017 · 8 comments
Labels
feature request joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins

Comments

@MichaelChirico
Copy link
Member

MichaelChirico commented Nov 1, 2017

One of the only times I find myself using allow.cartesian = TRUE is when I'm doing clustered bootstrap estimates, for example:

# sample to induce staggered group sizes
set.seed(203940)
x = as.data.table(iris)[sample(.N, .N, TRUE)]
setkey(x, Species)
x[ , .N, keyby = Species]
#       Species  N
# 1:     setosa 43
# 2: versicolor 59
# 3:  virginica 48
spec = unique(x$Species)
BB = 100
runs = sapply(seq_len(BB), function(ii) {
  tryCatch({
    #sample _groups_ with replacement
    x[.(sample(spec, length(spec), TRUE)), 
      lm(Sepal.Length ~ Sepal.Width)$coefficients]
    #success
    TRUE
    #failure
  }, error = function(x) FALSE)
})
mean(runs)
# [1] 0.63

This fails about 40% of the time, because the staggered group sizes means that, even though we pull the same number of groups at each iteration, the resulting number of rows often exceeds that of the original table. This is expected behavior, so it's somewhat bothersome to have to specify allow.cartesian, especially since the argument doesn't really capture what I'm trying to do (this is nothing near a Cartesian join).

Diagnosing a bit more, we see:

sizes = replicate(1e5, x[.(sample(spec, length(spec), TRUE)),
                         .N, allow.cartesian = TRUE])
table(sizes)
# sizes
# 129   134   139   144   145   150   155   161   166   177 
# 3684 11086 11084  3832 11073 22338 11007 10998 11219  3679 

The number of rows never exceeds about 20% of the table size (of course this depends on the underlying group sizes). 1.2*(nrow(x) + nrow(i)) seems as good a threshold as any... not sure how useful this would be to other users, so just throwing it out there for now.

Could also consider basing the threshold on proximity to nrow(x)*nrow(i) (i.e., full Cartesian) instead of excess over nrow(x) + nrow(i), say, if it's more than 40% of the way to being Cartesian, throw the error? (that threshold would be 180 in this case, i.e., the same as 20% over the summative row total)

@jangorecki
Copy link
Member

You can now control it using options. Even when implementing proposed feature I think it should by default be 0%. I was depending on that in some applications, mostly because some SQL db models assumes no duplicated entries in key column.

@MichaelChirico
Copy link
Member Author

You mean datatable.allow.cartesian?

I guess one version of implementing this would be to make that option more flexible, something like 0 = current behavior, 1 = allow full cartesian joins without error.

@MichaelChirico
Copy link
Member Author

I was surprised giving this answer that I needed to trigger allow.cartesian = TRUE for a non-equi-join. This case seems natural for the number of rows to grow, doesn't it? Should allow.cartesian be TRUE by default in this case?

@jangorecki jangorecki added the joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins label Apr 6, 2020
@jangorecki jangorecki changed the title Q - Increase threshold to trigger allow.cartesian error? Increase threshold to trigger allow.cartesian error? Apr 6, 2020
@jangorecki
Copy link
Member

I think that allow.cartesian was introduced to prevent many-to-many joins, not really full cartesian join. I don't think the option to prevent full cartesian will be that useful, because full cartesian rarely happens by accident, while partial cartesian, many-to-many join, may happen quite commonly.
It may be good idea to turn allow.cartesian into double scalar, as you proposed, where 1.00 will mean real full cartesian. Note that allow.cartesian does not really cover many-to-many cases (test 49.1).

@jangorecki
Copy link
Member

Related #4383

@jan-glx
Copy link
Contributor

jan-glx commented Aug 17, 2023

I am routinely bitten by this, because I forget to add .allow.cartesian=TRUE but it works fine until I use the same code on different data. However, I too think it should rather trigger as soon as there are any duplicate keys in both x and i (even if its the not the same values that are duplicated), because it often helps me to catch bugs early.
A warning would be more convenient though, and would avoid breaking existing code; the threshold for the existing error could then be raised to something that would significantly affect usability of the current R session.

sidenotes: I would prefer mult="many-to-many" or allow.many_to_many=TRUE over allow.cartesian (see also this comment
The datatable.allow.cartesian changes actual behavior and should not be used (apart from keeping unfixable pre 2013 old code working).

@jangorecki
Copy link
Member

jangorecki commented Aug 17, 2023

AFAIK detecting duplicates of values that are not being used for join (no matching value in another table) is an extra overhead, but please double check.

Does datatable.join.many option (possibly combined with allow.cartesian) proposed in #4370 resolves your use case?
Manual entry can be found in https://github.com/Rdatatable/data.table/pull/4370/files?file-filters%5B%5D=.Rd&show-viewed-files=true#diff-451c5f42e8d06446646d0a1dd8a8b5c4e252d744cfc88ac1f73022fa5c132658R20
If not then MRE would be useful.

@jan-glx
Copy link
Contributor

jan-glx commented Aug 17, 2023

Not sure to be honest. Not sure if my input can add anything here. Are you proposing that I switch from [ to mergelist in general, or would these parameters be implemented in [ as well?
In any case, considering only mult=all, from your link I gathered that default would be join.many=getOption("datatable.join.many", TRUE), i.e. don't check, don't warn, don't error. Setting it to FALSE would mean do check thoroughly (with some costs), don't warn, error if check was positive

  • don't check, don't warn, don't error by default is fine with me, having parameter to turn on "check" enables defensive programming (great) but I would probably be to lazy to type it every time.

  • Having a global option to set the default however is not good/useless because it not only changes output but the behavior thus I cannot turn it on (= set it to FALSE)) because that might break some packages (that did not develop with datatable.join.many=FALSE and thus did not set join.many=TRUE where necessary)

  • If a positive check would only lead to a warning (with the parameter better be named warn.many.to.many or so) i.e. only affecting output then it would be fine with me again although it would not protect from crashing R session anymore.

  • I feel like, ideally, there should be a parameter e.g. many.to.many = c("warn", "allow", "error") :

    • "warn", does a check, warn if positive, don't error
    • "error", does a check, don't warn, error if positive [not really needed, perhaps warn.many.to.many=TRUE is a better option]
    • "allow", does no check, doesn't want, doesn't error
    • setting it to"allow" explicitly turns off the check to a) turn off the warning in case one really wants a many-to-many join or to b) avoid the potential computation costs of the check because one is sure that there are not duplicates in both.
    • to remain completely backwards compatible (apart from some reasonable warning messages), I would soft deprecate allow.cartesian, make the signature
      allow.cartesian=NULL meaning that the value is computed as many.to.many %in% "allow", while the default for many.to.many would be "warn" if (is.null(allow.cartesian)|| !allow.cartesian) and "allow"` otherwise
  • what is also certainly not fine for me is the current behavior of don't really check, never warn, sometimes error (as for OP)

mergelist looks cool btw (would have needed it a handful of times)

OP MRE demonstrates the main issue already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins
Projects
None yet
Development

No branches or pull requests

3 participants