You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After reading #6717, I've gone back to the drawing board to see if we are missing something with how multiple works.
I am now convinced that we have gotten multiple slightly wrong, making it more painful than it should be. I think the intentions are right, the execution of what it warns on is just slightly off.
There are 4 types of joins worth considering in this discussion:
1-to-1
many-to-1
1-to-many
many-to-many
With equi-joins, multiple will warn on 1-to-many and many-to-many, i.e. anything with *-to-many where something in x matches multiple rows in y.
My big proposal here is that:
Warning on many-to-many is correct.
Warning on 1-to-many is wrong.
I think that many-to-1 and 1-to-many are both very useful and safe joins. But more importantly, these are essentially symmetrical styles of joins, so it doesn't make much sense to warn on one of these and not the other. Most of the time, you actually see these listed online as just1-to-many, with the opposite being inferred, because they are so similar. It's just a matter of how you frame your question.
Our original thought process was:
It is confusing for users when a join results in more rows than there are in x.
This is what can happen in 1-to-many and many-to-many, which explains our current behavior, but I think a more appropriate assertion is:
It is confusing for users when a join can result in a (possibly cartesian) explosion of nrow(x) * nrow(y) rows.
This can only happen with many-to-many, and is worth guarding against because it typically means there is a misspecified join. This assertion is supported by three trustworthy references online:
Each record in both tables can relate to none or any number of records in the other table. These relationships require a third table, called an associate or linking table, because relational systems cannot directly accommodate the relationship.
Stata's merge guide which does m:m merges slightly differently than we do, but the principle still stands, I think:
Because m:m merges are such a bad idea, we are not going to show you an example. If you think
that you need an m:m merge, then you probably need to work with your data so that you can use a
1:m or m:1 merge.
FileMaker Pro's help docs which discuss how many-to-many relationships can always be replaced with a 3rd table that creates two 1-to-many relationships to that table.
I do think that multiple = "error" is still quite useful. It's really great when you want to enforce that you have a 1-to-1 or many-to-1 style join where y is a kind of lookup table that contains extra information you want to bind on to x without changing its rows. That is a different use case than what we are discussing here.
Implementing what I'm suggesting here would silence the join warnings seen here: #6717 (comment)
I think that that user is doing "everything right," i.e. there is nothing wrong with those joins. Even the three-way join of:
is just two 1:many joins, which I'm now convinced are normal and expected join types.
So how would we do this? I think we need an argument specifically targeting many-to-many joins, like:
#> @param many Handling of many-to-many joins. These typically indicate#> a structural problem in your join or data.#> #> - `"ignore"`, allow many-to-many joins.#> - `"warning"`, warn on many-to-many joins.#> - `"error"`, error on many-to-many joins.#> #> If `NULL`, will default to `"warning"` for equi joins.#> #> Not applicable for non-equi joins (NOT SURE ABOUT THIS).many=NULL/ c("ignore", "warning", "error")
We'd then change multiple to have an unconditional default of "all", and we'd quickly deprecate "warning" because it would no longer be meaningful in any way (you'd just use "error" if you were worried about that case).
And of course I'd have to figure out how to exactly detect the many-many case on the vctrs side.
This framing is also supported by a great data.table discussion about allow.cartesianRdatatable/data.table#4383. The a) case there aligns with what multiple = "error" catches, and their b) case aligns with what many would catch. They've essentially come to the same conclusion that I have about what is needed to guard against these effectively.
I also think this is something I'd eventually be comfortable making an error by default, which was one of our goals with multiple's default originally.
The text was updated successfully, but these errors were encountered:
@hannes I've ended up implementing this in #6753 as a new relationship argument that allows you to specify the expected relationship between the keys of x and y, like "one_to_many" or "one_to_one". It errors if you specify a relationship that doesn't actually hold when performing the join.
The default for equality joins is "warn_many_to_many", which doesn't assume any known relationship, but will check for a many-to-many relationship along the way and will warn if it finds one, encouraging you to look back at your data or set relationship = "many_to_many" to make the relationship explicit if that is expected.
After reading #6717, I've gone back to the drawing board to see if we are missing something with how
multiple
works.I am now convinced that we have gotten
multiple
slightly wrong, making it more painful than it should be. I think the intentions are right, the execution of what it warns on is just slightly off.There are 4 types of joins worth considering in this discussion:
1-to-1
many-to-1
1-to-many
many-to-many
With equi-joins,
multiple
will warn on1-to-many
andmany-to-many
, i.e. anything with*-to-many
where something inx
matches multiple rows iny
.My big proposal here is that:
many-to-many
is correct.1-to-many
is wrong.I think that
many-to-1
and1-to-many
are both very useful and safe joins. But more importantly, these are essentially symmetrical styles of joins, so it doesn't make much sense to warn on one of these and not the other. Most of the time, you actually see these listed online as just1-to-many
, with the opposite being inferred, because they are so similar. It's just a matter of how you frame your question.Our original thought process was:
This is what can happen in
1-to-many
andmany-to-many
, which explains our current behavior, but I think a more appropriate assertion is:This can only happen with
many-to-many
, and is worth guarding against because it typically means there is a misspecified join. This assertion is supported by three trustworthy references online:IBM's database relationship guide which says this about
many-to-many
, suggesting they aren't a good idea:Stata's merge guide which does
m:m
merges slightly differently than we do, but the principle still stands, I think:FileMaker Pro's help docs which discuss how many-to-many relationships can always be replaced with a 3rd table that creates two 1-to-many relationships to that table.
I do think that
multiple = "error"
is still quite useful. It's really great when you want to enforce that you have a1-to-1
ormany-to-1
style join wherey
is a kind of lookup table that contains extra information you want to bind on tox
without changing its rows. That is a different use case than what we are discussing here.Implementing what I'm suggesting here would silence the join warnings seen here:
#6717 (comment)
I think that that user is doing "everything right," i.e. there is nothing wrong with those joins. Even the three-way join of:
is just two
1:many
joins, which I'm now convinced are normal and expected join types.So how would we do this? I think we need an argument specifically targeting
many-to-many
joins, like:We'd then change
multiple
to have an unconditional default of"all"
, and we'd quickly deprecate"warning"
because it would no longer be meaningful in any way (you'd just use"error"
if you were worried about that case).And of course I'd have to figure out how to exactly detect the
many-many
case on the vctrs side.This framing is also supported by a great data.table discussion about
allow.cartesian
Rdatatable/data.table#4383. Thea)
case there aligns with whatmultiple = "error"
catches, and theirb)
case aligns with whatmany
would catch. They've essentially come to the same conclusion that I have about what is needed to guard against these effectively.I also think this is something I'd eventually be comfortable making an error by default, which was one of our goals with
multiple
's default originally.The text was updated successfully, but these errors were encountered: