Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On multiple matches, and how we can improve #6731

Closed
DavisVaughan opened this issue Feb 16, 2023 · 2 comments · Fixed by #6753
Closed

On multiple matches, and how we can improve #6731

DavisVaughan opened this issue Feb 16, 2023 · 2 comments · Fixed by #6753
Milestone

Comments

@DavisVaughan
Copy link
Member

DavisVaughan commented Feb 16, 2023

After reading #6717, I've gone back to the drawing board to see if we are missing something with how multiple works.

I am now convinced that we have gotten multiple slightly wrong, making it more painful than it should be. I think the intentions are right, the execution of what it warns on is just slightly off.

There are 4 types of joins worth considering in this discussion:

  • 1-to-1
  • many-to-1
  • 1-to-many
  • many-to-many

With equi-joins, multiple will warn on 1-to-many and many-to-many, i.e. anything with *-to-many where something in x matches multiple rows in y.

My big proposal here is that:

  • Warning on many-to-many is correct.
  • Warning on 1-to-many is wrong.

I think that many-to-1 and 1-to-many are both very useful and safe joins. But more importantly, these are essentially symmetrical styles of joins, so it doesn't make much sense to warn on one of these and not the other. Most of the time, you actually see these listed online as just 1-to-many, with the opposite being inferred, because they are so similar. It's just a matter of how you frame your question.

Our original thought process was:

It is confusing for users when a join results in more rows than there are in x.

This is what can happen in 1-to-many and many-to-many, which explains our current behavior, but I think a more appropriate assertion is:

It is confusing for users when a join can result in a (possibly cartesian) explosion of nrow(x) * nrow(y) rows.

This can only happen with many-to-many, and is worth guarding against because it typically means there is a misspecified join. This assertion is supported by three trustworthy references online:

  • IBM's database relationship guide which says this about many-to-many, suggesting they aren't a good idea:

    Each record in both tables can relate to none or any number of records in the other table. These relationships require a third table, called an associate or linking table, because relational systems cannot directly accommodate the relationship.

  • Stata's merge guide which does m:m merges slightly differently than we do, but the principle still stands, I think:

    Because m:m merges are such a bad idea, we are not going to show you an example. If you think
    that you need an m:m merge, then you probably need to work with your data so that you can use a
    1:m or m:1 merge.

  • FileMaker Pro's help docs which discuss how many-to-many relationships can always be replaced with a 3rd table that creates two 1-to-many relationships to that table.


I do think that multiple = "error" is still quite useful. It's really great when you want to enforce that you have a 1-to-1 or many-to-1 style join where y is a kind of lookup table that contains extra information you want to bind on to x without changing its rows. That is a different use case than what we are discussing here.


Implementing what I'm suggesting here would silence the join warnings seen here:
#6717 (comment)

I think that that user is doing "everything right," i.e. there is nothing wrong with those joins. Even the three-way join of:

student_term <- left_join(student, term, by = join_by(student.id))
left_join(student_term, course, by = join_by(student.id, term))

is just two 1:many joins, which I'm now convinced are normal and expected join types.


So how would we do this? I think we need an argument specifically targeting many-to-many joins, like:

#> @param many Handling of many-to-many joins. These typically indicate
#>   a structural problem in your join or data.
#> 
#>   - `"ignore"`, allow many-to-many joins.
#>   - `"warning"`, warn on many-to-many joins.
#>   - `"error"`, error on many-to-many joins.
#> 
#>   If `NULL`, will default to `"warning"` for equi joins.
#> 
#>   Not applicable for non-equi joins (NOT SURE ABOUT THIS).

many = NULL / c("ignore", "warning", "error")

We'd then change multiple to have an unconditional default of "all", and we'd quickly deprecate "warning" because it would no longer be meaningful in any way (you'd just use "error" if you were worried about that case).

And of course I'd have to figure out how to exactly detect the many-many case on the vctrs side.

This framing is also supported by a great data.table discussion about allow.cartesian Rdatatable/data.table#4383. The a) case there aligns with what multiple = "error" catches, and their b) case aligns with what many would catch. They've essentially come to the same conclusion that I have about what is needed to guard against these effectively.

I also think this is something I'd eventually be comfortable making an error by default, which was one of our goals with multiple's default originally.

@hannes
Copy link
Contributor

hannes commented Feb 17, 2023

Only warning on many-to-many makes a lot of sense to me, too!

@DavisVaughan
Copy link
Member Author

@hannes I've ended up implementing this in #6753 as a new relationship argument that allows you to specify the expected relationship between the keys of x and y, like "one_to_many" or "one_to_one". It errors if you specify a relationship that doesn't actually hold when performing the join.

The default for equality joins is "warn_many_to_many", which doesn't assume any known relationship, but will check for a many-to-many relationship along the way and will warn if it finds one, encouraging you to look back at your data or set relationship = "many_to_many" to make the relationship explicit if that is expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants