On `multiple` matches, and how we can improve #6731

DavisVaughan · 2023-02-16T14:05:12Z

After reading #6717, I've gone back to the drawing board to see if we are missing something with how multiple works.

I am now convinced that we have gotten multiple slightly wrong, making it more painful than it should be. I think the intentions are right, the execution of what it warns on is just slightly off.

There are 4 types of joins worth considering in this discussion:

1-to-1
many-to-1
1-to-many
many-to-many

With equi-joins, multiple will warn on 1-to-many and many-to-many, i.e. anything with *-to-many where something in x matches multiple rows in y.

My big proposal here is that:

Warning on many-to-many is correct.
Warning on 1-to-many is wrong.

I think that many-to-1 and 1-to-many are both very useful and safe joins. But more importantly, these are essentially symmetrical styles of joins, so it doesn't make much sense to warn on one of these and not the other. Most of the time, you actually see these listed online as just 1-to-many, with the opposite being inferred, because they are so similar. It's just a matter of how you frame your question.

Our original thought process was:

It is confusing for users when a join results in more rows than there are in x.

This is what can happen in 1-to-many and many-to-many, which explains our current behavior, but I think a more appropriate assertion is:

It is confusing for users when a join can result in a (possibly cartesian) explosion of nrow(x) * nrow(y) rows.

This can only happen with many-to-many, and is worth guarding against because it typically means there is a misspecified join. This assertion is supported by three trustworthy references online:

IBM's database relationship guide which says this about many-to-many, suggesting they aren't a good idea:

Each record in both tables can relate to none or any number of records in the other table. These relationships require a third table, called an associate or linking table, because relational systems cannot directly accommodate the relationship.
Stata's merge guide which does m:m merges slightly differently than we do, but the principle still stands, I think:

Because m:m merges are such a bad idea, we are not going to show you an example. If you think
that you need an m:m merge, then you probably need to work with your data so that you can use a
1:m or m:1 merge.
FileMaker Pro's help docs which discuss how many-to-many relationships can always be replaced with a 3rd table that creates two 1-to-many relationships to that table.

I do think that multiple = "error" is still quite useful. It's really great when you want to enforce that you have a 1-to-1 or many-to-1 style join where y is a kind of lookup table that contains extra information you want to bind on to x without changing its rows. That is a different use case than what we are discussing here.

Implementing what I'm suggesting here would silence the join warnings seen here:
#6717 (comment)

I think that that user is doing "everything right," i.e. there is nothing wrong with those joins. Even the three-way join of:

student_term <- left_join(student, term, by = join_by(student.id))
left_join(student_term, course, by = join_by(student.id, term))

is just two 1:many joins, which I'm now convinced are normal and expected join types.

So how would we do this? I think we need an argument specifically targeting many-to-many joins, like:

#> @param many Handling of many-to-many joins. These typically indicate
#>   a structural problem in your join or data.
#> 
#>   - `"ignore"`, allow many-to-many joins.
#>   - `"warning"`, warn on many-to-many joins.
#>   - `"error"`, error on many-to-many joins.
#> 
#>   If `NULL`, will default to `"warning"` for equi joins.
#> 
#>   Not applicable for non-equi joins (NOT SURE ABOUT THIS).

many = NULL / c("ignore", "warning", "error")

We'd then change multiple to have an unconditional default of "all", and we'd quickly deprecate "warning" because it would no longer be meaningful in any way (you'd just use "error" if you were worried about that case).

And of course I'd have to figure out how to exactly detect the many-many case on the vctrs side.

This framing is also supported by a great data.table discussion about allow.cartesian Rdatatable/data.table#4383. The a) case there aligns with what multiple = "error" catches, and their b) case aligns with what many would catch. They've essentially come to the same conclusion that I have about what is needed to guard against these effectively.

I also think this is something I'd eventually be comfortable making an error by default, which was one of our goals with multiple's default originally.

The text was updated successfully, but these errors were encountered:

hannes · 2023-02-17T12:43:26Z

Only warning on many-to-many makes a lot of sense to me, too!

DavisVaughan · 2023-02-22T18:44:32Z

@hannes I've ended up implementing this in #6753 as a new relationship argument that allows you to specify the expected relationship between the keys of x and y, like "one_to_many" or "one_to_one". It errors if you specify a relationship that doesn't actually hold when performing the join.

The default for equality joins is "warn_many_to_many", which doesn't assume any known relationship, but will check for a many-to-many relationship along the way and will warn if it finds one, encouraging you to look back at your data or set relationship = "many_to_many" to make the relationship explicit if that is expected.

eipi10 mentioned this issue Feb 16, 2023

Is it possible to set an option to globally turn off the multiple matches warning on joins? #6717

Closed

This was referenced Feb 21, 2023

Advise on how to silence the multiple = "warning" message #6269

Merged

Implement relationship for vec_locate_matches() r-lib/vctrs#1791

Merged

Implement relationship and refine join match warning #6753

Merged

DavisVaughan added this to the 1.1.1 milestone Feb 23, 2023

DavisVaughan closed this as completed in #6753 Feb 24, 2023

cynthiahqy mentioned this issue Nov 10, 2023

Feature request: Could the relationship argument for multiple matches check by group? #6963

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On `multiple` matches, and how we can improve #6731

On `multiple` matches, and how we can improve #6731

DavisVaughan commented Feb 16, 2023 •

edited

Loading

hannes commented Feb 17, 2023

DavisVaughan commented Feb 22, 2023

On multiple matches, and how we can improve #6731

On multiple matches, and how we can improve #6731

Comments

DavisVaughan commented Feb 16, 2023 • edited Loading

hannes commented Feb 17, 2023

DavisVaughan commented Feb 22, 2023

On `multiple` matches, and how we can improve #6731

On `multiple` matches, and how we can improve #6731

DavisVaughan commented Feb 16, 2023 •

edited

Loading