-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase threshold to trigger allow.cartesian error? #2455
Comments
You can now control it using |
You mean I guess one version of implementing this would be to make that option more flexible, something like 0 = current behavior, 1 = allow full cartesian joins without error. |
I was surprised giving this answer that I needed to trigger |
I think that allow.cartesian was introduced to prevent many-to-many joins, not really full cartesian join. I don't think the option to prevent full cartesian will be that useful, because full cartesian rarely happens by accident, while partial cartesian, many-to-many join, may happen quite commonly. |
Related #4383 |
I am routinely bitten by this, because I forget to add sidenotes: I would prefer |
AFAIK detecting duplicates of values that are not being used for join (no matching value in another table) is an extra overhead, but please double check. Does datatable.join.many option (possibly combined with allow.cartesian) proposed in #4370 resolves your use case? |
Not sure to be honest. Not sure if my input can add anything here. Are you proposing that I switch from
OP MRE demonstrates the main issue already. |
One of the only times I find myself using
allow.cartesian = TRUE
is when I'm doing clustered bootstrap estimates, for example:This fails about 40% of the time, because the staggered group sizes means that, even though we pull the same number of groups at each iteration, the resulting number of rows often exceeds that of the original table. This is expected behavior, so it's somewhat bothersome to have to specify
allow.cartesian
, especially since the argument doesn't really capture what I'm trying to do (this is nothing near a Cartesian join).Diagnosing a bit more, we see:
The number of rows never exceeds about 20% of the table size (of course this depends on the underlying group sizes).
1.2*(nrow(x) + nrow(i))
seems as good a threshold as any... not sure how useful this would be to other users, so just throwing it out there for now.Could also consider basing the threshold on proximity to
nrow(x)*nrow(i)
(i.e., full Cartesian) instead of excess overnrow(x) + nrow(i)
, say, if it's more than 40% of the way to being Cartesian, throw the error? (that threshold would be 180 in this case, i.e., the same as 20% over the summative row total)The text was updated successfully, but these errors were encountered: