Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement relationship for vec_locate_matches() #1791

Merged
merged 3 commits into from
Feb 22, 2023

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Feb 21, 2023

Related to tidyverse/dplyr#6731

@lionel- if you just want to do a high level review, that would be ok with me

This PR implements relationship, a new argument for vec_locate_matches() that supersedes and expands upon multiple = "error" / "warning".

relationship is based on database table relationships and will get exposed in dplyr as well. It has a number of options:

  • "none" (will be the default for dplyr inequality joins)
  • "one_to_one"
  • "one_to_many"
  • "many_to_one"
  • "many_to_many"
  • "warn_many_to_many" (will be the default for dplyr equality joins, and likely rolling joins)

The dplyr default will be NULL, which will choose between "none" and "warn_many_to_many" automatically.

relationship adds a constraint on the matching procedure, for example, "many_to_one" says that:

  • Each element of needles can match at most 1 value in haystack
  • Each element of haystack can match any number of values in needles

This "many_to_one" option is then a perfect replacement for multiple = "error", which I really like because now multiple is purely about optionally reducing the number of matches, not about erroring.

Notably, relationship is not a uniqueness check on the inputs up front. It is instead a check on the actual matched results. I think this better aligns with the SQL definitions. It means that you can do relationship = "one_to_one" and have duplicate needles and not get an error if they don't match anything. If that is a problem then you are supposed to use no_match to handle the "unmatched needles" case (or unmatched on the dplyr side):

x <- c(1, 1, 2)

# fine
vec_locate_matches(x, 2, relationship = "one_to_one")
#>   needles haystack
#> 1       1       NA
#> 2       2       NA
#> 3       3        1

# error
vec_locate_matches(x, 1, relationship = "one_to_one")
#> Error in `vec_locate_matches()`:
#> ! Each element can match at most 1 observation.
#> ✖ The element at location 1 has multiple matches.

# handling unmatched `needles`
vec_locate_matches(x, 2, relationship = "one_to_one", no_match = "error")
#> Error in `vec_locate_matches()`:
#> ! Each element must have a match.
#> ✖ The element at location 1 does not have a match.

I've gone ahead and silently soft-deprecated multiple = "error" and multiple = "warning" by removing them from the documentation and by moving all their tests to a deprecation section.

And silently soft-deprecate `multiple = "error"` and `multiple = "warning"`
Comment on lines -111 to -113
#' - `"warning"` throws a warning if multiple matches are detected, but
#' otherwise falls back to `"all"`.
#' - `"error"` throws an error if multiple matches are detected.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed from the docs. No formal deprecation with deprecate_soft() yet, I'd like to remove usage of them from dplyr and ivs first

Comment on lines +130 to +135
#' - `"many_to_many"` expects:
#' - Each value in `needles` matches any number of values in `haystack`.
#' - Each value in `haystack` matches any number of values in `needles`.
#'
#' This performs no checks, and is identical to `"none"`, but is provided to
#' allow you to be explicit about this relationship if you know it exists.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made sure to note that many_to_many doesn't really do anything

Comment on lines +1678 to 1685
// Check is always needed for `multiple = "all"`.
// This also handles `relationship` options too, since if `multiple` is
// `"any"`, `"first"`, or `"last"`, we can't invalidate a `relationship`.
bool check_multiple_needles =
multiple == VCTRS_MULTIPLE_all ||
// TODO: Remove deprecated support for `multiple = "error"/"warning"`
multiple == VCTRS_MULTIPLE_error ||
multiple == VCTRS_MULTIPLE_warning;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figuring out if we need to check for multiple matches in needles is much simpler than for haystack.

We basically just need to check if multiple = "all", regardless of the value of relationship.

If multiple = "all", we have to check for multiple matches in needles no matter what because we do a sort at the end if we detect multiple matches.

If multiple %in% c("any", "first", "last") then we don't need to check for multiple matches in needles, because by definition they will be reduced down to 1 match through multiple.

Comment on lines +1869 to +1884
// TODO: Remove deprecated support for `multiple = "error"/"warning"`
case VCTRS_MULTIPLE_error:
stop_matches_multiple(
loc_first_multiple_needles,
needles_arg,
haystack_arg,
error_call
);
case VCTRS_MULTIPLE_warning: {
warn_matches_multiple(
loc_first_multiple_needles,
needles_arg,
haystack_arg,
error_call
);
break;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squeezing in some backwards compatible support for multiple = "error" / "warning" here in the "we detected multiple matches in needles" case

Comment on lines +1867 to +1868
case VCTRS_MULTIPLE_all:
break;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case corresponds to how check_multiple_needles is defined, where multiple = "all" is the main case where we want to know if there are multiple matches in needles, but we don't error or warn or anything.

Comment on lines +2015 to +2057
if (track_matches_haystack) {
if (check_multiple_haystack) {
// `true` if a match already existed
any_multiple_haystack = v_detect_matches_haystack[loc_haystack];

if (any_multiple_haystack) {
loc_first_multiple_haystack = loc_haystack;

switch (relationship) {
case VCTRS_RELATIONSHIP_one_to_one:
stop_matches_relationship_one_to_one(
loc_first_multiple_haystack,
"haystack",
needles_arg,
haystack_arg,
error_call
);
case VCTRS_RELATIONSHIP_one_to_many:
stop_matches_relationship_one_to_many(
loc_first_multiple_haystack,
needles_arg,
haystack_arg,
error_call
);
case VCTRS_RELATIONSHIP_warn_many_to_many: {
if (any_multiple_needles) {
warn_matches_relationship_many_to_many(
loc_first_multiple_needles,
loc_first_multiple_haystack,
needles_arg,
haystack_arg,
error_call
);
}

// We know there are multiple and don't need to continue checking
check_multiple_haystack = false;

// Only continue tracking if needed for `remaining`
track_matches_haystack = retain_remaining_haystack;

break;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic idea here is that as we process the haystack matches, we record the location in haystack that the match corresponds to in v_detect_matches_haystack. If we see that location more than once, that corresponds to a multiple match and we take the appropriate action based on relationship.

If the "action" is to just throw a warning, then we throw it once and after that we typically no longer need to check for multiple haystack matches.

The exception is if remaining = "error" / <int> is set, in which case we still have to track usage of all of the haystack locations to figure out which locations weren't used at all (i.e. they are "unmatched" in dplyr join terminology). So that is what track_matches_haystack = retain_remaining_haystack; is for.

Comment on lines +187 to +201
(expect_error(vec_locate_matches(c(2, 1), c(1, 1), relationship = "one_to_one"))
)
Output
<error/vctrs_error_matches_relationship_one_to_one>
Error in `vec_locate_matches()`:
! Each element can match at most 1 observation.
x The element at location 2 has multiple matches.
Code
(expect_error(vec_locate_matches(c(1, 1), c(1, 2), relationship = "one_to_one"))
)
Output
<error/vctrs_error_matches_relationship_one_to_one>
Error in `vec_locate_matches()`:
! Each element can match at most 1 observation.
x The element at location 1 has multiple matches.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is one place I think it is confusing that we have needles_arg = "" and haystack_arg = "" as defaults. One of these errors is related to needles and one is related to haystack but the errors are the same.

I think we should default them to "needles" and "haystack" and then completely disallow the empty string of "" as a possible arg name in vec_locate_matches(). This would also greatly simplify my error generation code on the R side

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think taking "" args is probably a remnant of the early days before we had developed our approach to error reporting. We probably didn't have error_call arguments at the time, so we were hiding the call and it made sense to hide the arguments as well.

Now that we show the call, we should probably start naming the arguments by default as you suggest.

Comment on lines +225 to +235
# `relationship` handles warn-many-to-many case

Code
(expect_warning(vec_locate_matches(c(1, 2, 1), c(1, 2, 2), relationship = "warn_many_to_many"))
)
Output
<warning/vctrs_warning_matches_relationship_many_to_many>
Warning in `vec_locate_matches()`:
Detected an unexpected many-to-many relationship.
x The element at location 2 has multiple matches.
x The element at location 1 has multiple matches.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The many-to-many warning is another place the needles_arg / haystack_arg defaults are particularly confusing. The first line relates to needles and the second is for haystack

@DavisVaughan DavisVaughan requested a review from lionel- February 21, 2023 23:44
Copy link
Member

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a nice improvement!

#' This performs no checks, and is identical to `"none"`, but is provided to
#' allow you to be explicit about this relationship if you know it exists.
#'
#' - `"warn_many_to_many"` doesn't assume there is any known relationship, but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should warn be a suffix instead of a prefix? I think that's what I'd have expected to make it clearer it's a variant.

Copy link
Member Author

@DavisVaughan DavisVaughan Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't really decide, because it's not like "many_to_many" errors and "many_to_many_warn" would just downgrade that error to a warning.

They do fairly different things:

  • "many_to_many" is looking for violations of the many-to-many relationship, i.e. many-to-many is expected. But that relationship can't be violated so it ends up being an expressive no-op (it makes more sense for "many_to_one" and the other variants)

  • "warn_many_to_many" is looking for the existance of a many-to-many relationship, i.e. many-to-many is not expected, and it warns if it finds one.

So they didn't really feel like variants to me

I'm not picky about this though, so if you still feel like the suffix is better then I'll change it, but I'll wait to hear what you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion either way, we can go with your first intuition.

Comment on lines +187 to +201
(expect_error(vec_locate_matches(c(2, 1), c(1, 1), relationship = "one_to_one"))
)
Output
<error/vctrs_error_matches_relationship_one_to_one>
Error in `vec_locate_matches()`:
! Each element can match at most 1 observation.
x The element at location 2 has multiple matches.
Code
(expect_error(vec_locate_matches(c(1, 1), c(1, 2), relationship = "one_to_one"))
)
Output
<error/vctrs_error_matches_relationship_one_to_one>
Error in `vec_locate_matches()`:
! Each element can match at most 1 observation.
x The element at location 1 has multiple matches.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think taking "" args is probably a remnant of the early days before we had developed our approach to error reporting. We probably didn't have error_call arguments at the time, so we were hiding the call and it made sense to hide the arguments as well.

Now that we show the call, we should probably start naming the arguments by default as you suggest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants