Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't resolve columns with tidyselect when tbl cannot be materialized #529

Merged
merged 5 commits into from
Mar 12, 2024

Conversation

yjunechoe
Copy link
Collaborator

@yjunechoe yjunechoe commented Mar 11, 2024

Summary

{tidyselect} requires a data frame present in order to resolve the column selections against. Therefore, this PR ensures that when create_agent(tbl = ) receives a table that cannot (yet) materialize, the validation steps dont't/can't use tidyselect.

This preserves v0.11 behavior, where use of tidyselect triggers materialization of the tbl and thus always errors if tbl cannot (yet) materialize:

# remotes::install_github("rstudio/[email protected]")
create_agent(~blah) %>% rows_distinct(columns = starts_with("x"))
#> Error: object 'blah' not found

In the PR, this now errors with a slightly more informative message:

create_agent(~blah) %>% rows_distinct(columns = starts_with("x"))
#> Error in `rows_distinct()`:
#> ! Cannot use tidyselect helpers for an undefined lazy tbl.
#> Caused by error:
#> ! object 'blah' not found
#> Run `rlang::last_trace()` to see where the error occurred.

Relatedly, the PR also allows the bypassing of tidyselect when tidyselect is not strictly necessary (e.g., when columns is a character vector). This resolves #528 by not forwarding the c(...) character vector to tidyselect at all.

agent1 <- create_agent(~undefined_df) |>
  col_exists(c("cyl", "vs", "am", "gear", "carb"))
get_agent_report(agent1, display_table = FALSE)[, 1:3]
#> # A tibble: 5 × 3
#>       i type       columns
#>   <int> <chr>      <chr>  
#> 1     1 col_exists cyl    
#> 2     2 col_exists vs     
#> 3     3 col_exists am     
#> 4     4 col_exists gear   
#> 5     5 col_exists carb

In sum, {tidyselect} support in columns is a lot more limited when the tbl cannot be materialized, but this is inevitable. This behavior already existed in v0.11 and it's just a genuine tradeoff users will have to make if they want an extreme version of laziness for their tbl.

Remaining consideration with rows_distinct()

Unfortunately, there is one regression with rows_distinct(), because in v0.12 we have the columns = everything() default, whereas in v0.11 we had the columns = NULL default. When the tbl cannot be materialized, rows_distinct() errors on v0.12 but passes through in v0.11.

More concretely put, this used to work in v0.11 but not anymore in v0.12:

agent <- create_agent(~x) %>% 
  rows_distinct()
x <- mtcars
interrogate(agent)

This behavior actually doesn't have anything to do with {tidyselect} though: rows_distinct() is exceptional in that columns = NULL becomes distinct({{ NULL }}) in interrogate(), and distinct(NULL) happens to have a special behavior of checking for row-level distinctness across all columns. In other words, NULL is never resolved to a column selection, which is also why rows_distinct() used to not show anything in the columns column of the agent report.

I have yet to add this behavior of rows_distinct() back in. Adding back support for this should be simple (we'd just intercept everything() and pass down NULL instead) - it'd just require being explicit about this exceptionalism of rows_distinct() in the code.

Also, just to reiterate, this extreme laziness will always work if columns is supplied as character vector or vars(); it's just the dynamic tidyselect expression that requires tbl to be materialize-able.

Let me know what you think!

Copy link
Member

@rich-iannone rich-iannone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@rich-iannone
Copy link
Member

Just wanted to say thanks and don't worry about the weirdness of rows_distinct()/rows_complete() (it could always be fixed up later).

@rich-iannone rich-iannone merged commit ade8a31 into rstudio:main Mar 12, 2024
12 of 13 checks passed
@yjunechoe yjunechoe deleted the tidyselect-lazy-tbl branch March 12, 2024 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conflicting validation report when using YAML in {targets} compared to interactive mode
2 participants