Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collapse_index on a broken index gives difficult-to-debug errors #56

Closed
md0u80c9 opened this issue Mar 25, 2018 · 13 comments
Closed

collapse_index on a broken index gives difficult-to-debug errors #56

md0u80c9 opened this issue Mar 25, 2018 · 13 comments

Comments

@md0u80c9
Copy link

Hi,

This is really feedback from an 'I shot myself in the foot and spent ages working out why/how' thing. I don't have a reprex per se - I suspect one could be made if I think about it for long enough...Apologies that I haven't had time to explore the issue further (yet).

I managed to produce a broken tibbletime index. The way I think I did this was by left joining two tibbletime tables. Because the tables started out with the same name, I renamed the index on the LHS before joining (so far so good - so our intended index is on the LHS with a new name, and the old index on the RHS has the original name and is still present but now may contain NULLs).

However I then tried to call dplyr::mutate with collapse_index on the old index name to make the collapsed index field by mistake rather than the new name.

The error you get is: Error in mutate_impl(.data, dots) :
Evaluation error: missing value where TRUE/FALSE needed.

Clearly that is a long way away from where the error truly lies (ie. using an invalid index).

My thoughts:

  • Is this an issue with joining two tibbletime tables with valid indices? Is the index on the RHS invalidated (ie. no longer recognised as a tibbletime index) as it should?
  • Is this possible with a non-indexed date/time field containing NULLs? Would that result in the same error? Or would collapse_index spot that we're using an invalid index in those circumstances?
  • If collapse_index is able to work on any columns (which I doubt), does collapse_index need to check for the presence of NULLs before working, and give a more descriptive error?

Hope this makes sense. If I get the time I'll try to make a reprex for you of this is in action if it doesn't make sense from the above!

@DavisVaughan
Copy link
Collaborator

I think there are two problems here.

library(tibbletime)
library(dplyr)

data(FB)

FB_time <- as_tbl_time(FB, date)

head(FB_time)
#> # A time tibble: 6 x 8
#> # Index: date
#>   symbol date        open  high   low close     volume adjusted
#>   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>      <dbl>    <dbl>
#> 1 FB     2013-01-02  27.4  28.2  27.4  28.0  69846400.     28.0
#> 2 FB     2013-01-03  27.9  28.5  27.6  27.8  63140600.     27.8
#> 3 FB     2013-01-04  28.0  28.9  27.8  28.8  72715400.     28.8
#> 4 FB     2013-01-07  28.7  29.8  28.6  29.4  83781800.     29.4
#> 5 FB     2013-01-08  29.5  29.6  28.9  29.1  45871300.     29.1
#> 6 FB     2013-01-09  29.7  30.6  29.5  30.6 104787700.     30.6

FB_time2 <- rename(FB_time, date2 = date)

# Notice how "date" is the index but "date2" is the column name
FB_time2
#> # A time tibble: 1,008 x 8
#> # Index: date
#>    symbol date2       open  high   low close     volume adjusted
#>    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>      <dbl>    <dbl>
#>  1 FB     2013-01-02  27.4  28.2  27.4  28.0  69846400.     28.0
#>  2 FB     2013-01-03  27.9  28.5  27.6  27.8  63140600.     27.8
#>  3 FB     2013-01-04  28.0  28.9  27.8  28.8  72715400.     28.8
#>  4 FB     2013-01-07  28.7  29.8  28.6  29.4  83781800.     29.4
#>  5 FB     2013-01-08  29.5  29.6  28.9  29.1  45871300.     29.1
#>  6 FB     2013-01-09  29.7  30.6  29.5  30.6 104787700.     30.6
#>  7 FB     2013-01-10  30.6  31.5  30.3  31.3  95316400.     31.3
#>  8 FB     2013-01-11  31.3  32.0  31.1  31.7  89598000.     31.7
#>  9 FB     2013-01-14  32.1  32.2  30.6  31.0  98892800.     31.0
#> 10 FB     2013-01-15  30.6  31.7  29.9  30.1 173242600.     30.1
#> # ... with 998 more rows

# Can't find the "date" index
collapse_by(FB_time2)
#> Error in mutate_impl(.data, dots): Evaluation error: cannot coerce type 'closure' to vector of type 'double'.

# This is your problem. NA values mess up the collapse_by()
FB_time$date[1] <- NA
collapse_by(FB_time)
#> Error in mutate_impl(.data, dots): Evaluation error: missing value where TRUE/FALSE needed.

Created on 2018-03-26 by the reprex package (v0.2.0).

@DavisVaughan
Copy link
Collaborator

For the first problem of changing the name of the index, I think that every function that could run into this problem calls get_index_quo() at some point. I think it would make sense to let get_index_quo() assert that the index is still a valid column name, and error out otherwise.

@DavisVaughan
Copy link
Collaborator

Additionally, I am going to support rename() as a dplyr verb. This means that if the user does date2=date like above, the resulting object is no longer a tbl_time, but is returned as a tibble because the index column (date) has been "removed".

@md0u80c9
Copy link
Author

Is renaming the index column name itself bad? If we can detect it is a rename, would it be possible to honour the renamed index variable instead?

@DavisVaughan
Copy link
Collaborator

The only way to possibly detect it is if it came from a dplyr::rename(). Trying to capture a manual rename from a user would be pretty hard I think, and introduce a lot of randomness i might not be able to control. I would rather be consistent across the board rather than just whitelist dplyr::rename() as I think even that might be difficult.

DavisVaughan pushed a commit that referenced this issue Mar 26, 2018
…names. This catches manual renaming or removal of an index and throws an error when the user next uses a tibbletime function requiring the index. Also refactor getter assertions. See #56
@md0u80c9
Copy link
Author

That’s reasonable. Is it worth /possible to warn the user that performing dplyr rename on a tbl_time outputs a tibble not a tbl_time?

Doing an explicit as_tibble forced downcast would be a lot better than coming back to the code later and trying to guess why the index doesn’t work!

@DavisVaughan
Copy link
Collaborator

I can see the value in both, and I kind of changed my mind after a bit of thinking. I added a branch that implements rename() with support for renaming the index column. It required a bit of tidyeval magic but I think it works nicely. Can you see if it works for you?

devtools::install_github("business-science/tibbletime", ref = "fix/valid-index-quo")

@DavisVaughan
Copy link
Collaborator

DavisVaughan commented Mar 26, 2018

If it seems like it works nicely, I'll use it. If not, I'll revert back to returning a tibble. I can include a warning, but the way it currently works is to use the same system as the case where the user does select(FB_tbl_time, open). That returns a tibble too, not a tbl_time. That does not currently output a warning. Should it? It seems like the user knows what they are doing.

@md0u80c9
Copy link
Author

Hmm it might be worth warning if the output will be downcast just to avoid difficult-to-spot errors later.

If you are re-reading the code later seeing code which reads:

Dplyr::select(as_tibble(myTblTime) , foo, bar) gives you a hint you’re about to just get a tibble back,

Whereas:

Dplyr::select(myTblTime, foo, bar)

Doesn’t intuitively tell you that you’re about to lose time indexing.

Just a “warning - dplyr::select will downcast ‘myTblTime’ to a tibble. If you intend to do this, force downcast myTblTime using as_tibble” would suffice as a reminder.

@md0u80c9
Copy link
Author

(Only saying this having had a really fun session debugging my code on Sunday as you can guess!)

Dplyr code feels really obvious at the time. But things like tidyeval can make an innocuous-looking dplyr statement very tricky indeed, so might be worth just enforcing good practice.

@DavisVaughan
Copy link
Collaborator

One thing i worry about is being too noisy, especially when programming with this. I do the select() thing all the time inside package functions, knowing that it will drop the tbl_time class, but i dont want the user to see that because i am just doing it for some intermediate calculation. I could obviously opt for some kind of verbose = TRUE argument in select.tbl_time() but that feels clumsy given how clean the dplyr interface is

@md0u80c9
Copy link
Author

I think I'd compare this to doing a dplyr::join where if you fail to specify all the join parameters - you get a message which tells you what else has been used to make the join.

I think if you can downcast the object type to a tibble before using with dplyr this is a sensible way to omit the message whilst acknowledging that you know you're only going to get a tibble and not a tbl_time.

I wouldn't be keen on the verbose = TRUE because it doesn't cleanly indicate what it's hiding and I agree it would be clumsy. as_tibble(tbl_time) for the input feels cleaner and makes object type in = object type out, which reads more logically.

Perhaps the message should be cleaner: 'dplyr::select downcasts the tbl_time to a tibble. See X for more information' where X is a help page which advises which operations would take in a tbl_time and output a tibble, and advise the coding acknowledgement.

@md0u80c9
Copy link
Author

Tested with the new version this evening. We get a warning that the index has gone. I have then just reapplied a new index to the renamed column. All looks good so far :-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants