collapse_index on a broken index gives difficult-to-debug errors #56

md0u80c9 · 2018-03-25T10:25:19Z

Hi,

This is really feedback from an 'I shot myself in the foot and spent ages working out why/how' thing. I don't have a reprex per se - I suspect one could be made if I think about it for long enough...Apologies that I haven't had time to explore the issue further (yet).

I managed to produce a broken tibbletime index. The way I think I did this was by left joining two tibbletime tables. Because the tables started out with the same name, I renamed the index on the LHS before joining (so far so good - so our intended index is on the LHS with a new name, and the old index on the RHS has the original name and is still present but now may contain NULLs).

However I then tried to call dplyr::mutate with collapse_index on the old index name to make the collapsed index field by mistake rather than the new name.

The error you get is: Error in mutate_impl(.data, dots) :
Evaluation error: missing value where TRUE/FALSE needed.

Clearly that is a long way away from where the error truly lies (ie. using an invalid index).

My thoughts:

Is this an issue with joining two tibbletime tables with valid indices? Is the index on the RHS invalidated (ie. no longer recognised as a tibbletime index) as it should?
Is this possible with a non-indexed date/time field containing NULLs? Would that result in the same error? Or would collapse_index spot that we're using an invalid index in those circumstances?
If collapse_index is able to work on any columns (which I doubt), does collapse_index need to check for the presence of NULLs before working, and give a more descriptive error?

Hope this makes sense. If I get the time I'll try to make a reprex for you of this is in action if it doesn't make sense from the above!

DavisVaughan · 2018-03-26T12:19:10Z

I think there are two problems here.

library(tibbletime)
library(dplyr)

data(FB)

FB_time <- as_tbl_time(FB, date)

head(FB_time)
#> # A time tibble: 6 x 8
#> # Index: date
#>   symbol date        open  high   low close     volume adjusted
#>   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>      <dbl>    <dbl>
#> 1 FB     2013-01-02  27.4  28.2  27.4  28.0  69846400.     28.0
#> 2 FB     2013-01-03  27.9  28.5  27.6  27.8  63140600.     27.8
#> 3 FB     2013-01-04  28.0  28.9  27.8  28.8  72715400.     28.8
#> 4 FB     2013-01-07  28.7  29.8  28.6  29.4  83781800.     29.4
#> 5 FB     2013-01-08  29.5  29.6  28.9  29.1  45871300.     29.1
#> 6 FB     2013-01-09  29.7  30.6  29.5  30.6 104787700.     30.6

FB_time2 <- rename(FB_time, date2 = date)

# Notice how "date" is the index but "date2" is the column name
FB_time2
#> # A time tibble: 1,008 x 8
#> # Index: date
#>    symbol date2       open  high   low close     volume adjusted
#>    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>      <dbl>    <dbl>
#>  1 FB     2013-01-02  27.4  28.2  27.4  28.0  69846400.     28.0
#>  2 FB     2013-01-03  27.9  28.5  27.6  27.8  63140600.     27.8
#>  3 FB     2013-01-04  28.0  28.9  27.8  28.8  72715400.     28.8
#>  4 FB     2013-01-07  28.7  29.8  28.6  29.4  83781800.     29.4
#>  5 FB     2013-01-08  29.5  29.6  28.9  29.1  45871300.     29.1
#>  6 FB     2013-01-09  29.7  30.6  29.5  30.6 104787700.     30.6
#>  7 FB     2013-01-10  30.6  31.5  30.3  31.3  95316400.     31.3
#>  8 FB     2013-01-11  31.3  32.0  31.1  31.7  89598000.     31.7
#>  9 FB     2013-01-14  32.1  32.2  30.6  31.0  98892800.     31.0
#> 10 FB     2013-01-15  30.6  31.7  29.9  30.1 173242600.     30.1
#> # ... with 998 more rows

# Can't find the "date" index
collapse_by(FB_time2)
#> Error in mutate_impl(.data, dots): Evaluation error: cannot coerce type 'closure' to vector of type 'double'.

# This is your problem. NA values mess up the collapse_by()
FB_time$date[1] <- NA
collapse_by(FB_time)
#> Error in mutate_impl(.data, dots): Evaluation error: missing value where TRUE/FALSE needed.

Created on 2018-03-26 by the reprex package (v0.2.0).

DavisVaughan · 2018-03-26T12:31:38Z

For the first problem of changing the name of the index, I think that every function that could run into this problem calls get_index_quo() at some point. I think it would make sense to let get_index_quo() assert that the index is still a valid column name, and error out otherwise.

DavisVaughan · 2018-03-26T12:48:57Z

Additionally, I am going to support rename() as a dplyr verb. This means that if the user does date2=date like above, the resulting object is no longer a tbl_time, but is returned as a tibble because the index column (date) has been "removed".

md0u80c9 · 2018-03-26T13:04:05Z

Is renaming the index column name itself bad? If we can detect it is a rename, would it be possible to honour the renamed index variable instead?

DavisVaughan · 2018-03-26T13:10:17Z

The only way to possibly detect it is if it came from a dplyr::rename(). Trying to capture a manual rename from a user would be pretty hard I think, and introduce a lot of randomness i might not be able to control. I would rather be consistent across the board rather than just whitelist dplyr::rename() as I think even that might be difficult.

…See #56

…names. This catches manual renaming or removal of an index and throws an error when the user next uses a tibbletime function requiring the index. Also refactor getter assertions. See #56

md0u80c9 · 2018-03-26T13:38:17Z

That’s reasonable. Is it worth /possible to warn the user that performing dplyr rename on a tbl_time outputs a tibble not a tbl_time?

Doing an explicit as_tibble forced downcast would be a lot better than coming back to the code later and trying to guess why the index doesn’t work!

DavisVaughan · 2018-03-26T13:51:16Z

I can see the value in both, and I kind of changed my mind after a bit of thinking. I added a branch that implements rename() with support for renaming the index column. It required a bit of tidyeval magic but I think it works nicely. Can you see if it works for you?

devtools::install_github("business-science/tibbletime", ref = "fix/valid-index-quo")

DavisVaughan · 2018-03-26T13:54:57Z

If it seems like it works nicely, I'll use it. If not, I'll revert back to returning a tibble. I can include a warning, but the way it currently works is to use the same system as the case where the user does select(FB_tbl_time, open). That returns a tibble too, not a tbl_time. That does not currently output a warning. Should it? It seems like the user knows what they are doing.

md0u80c9 · 2018-03-26T14:12:20Z

Hmm it might be worth warning if the output will be downcast just to avoid difficult-to-spot errors later.

If you are re-reading the code later seeing code which reads:

Dplyr::select(as_tibble(myTblTime) , foo, bar) gives you a hint you’re about to just get a tibble back,

Whereas:

Dplyr::select(myTblTime, foo, bar)

Doesn’t intuitively tell you that you’re about to lose time indexing.

Just a “warning - dplyr::select will downcast ‘myTblTime’ to a tibble. If you intend to do this, force downcast myTblTime using as_tibble” would suffice as a reminder.

md0u80c9 · 2018-03-26T14:15:04Z

(Only saying this having had a really fun session debugging my code on Sunday as you can guess!)

Dplyr code feels really obvious at the time. But things like tidyeval can make an innocuous-looking dplyr statement very tricky indeed, so might be worth just enforcing good practice.

DavisVaughan · 2018-03-26T14:22:45Z

One thing i worry about is being too noisy, especially when programming with this. I do the select() thing all the time inside package functions, knowing that it will drop the tbl_time class, but i dont want the user to see that because i am just doing it for some intermediate calculation. I could obviously opt for some kind of verbose = TRUE argument in select.tbl_time() but that feels clumsy given how clean the dplyr interface is

md0u80c9 · 2018-03-26T21:01:01Z

I think I'd compare this to doing a dplyr::join where if you fail to specify all the join parameters - you get a message which tells you what else has been used to make the join.

I think if you can downcast the object type to a tibble before using with dplyr this is a sensible way to omit the message whilst acknowledging that you know you're only going to get a tibble and not a tbl_time.

I wouldn't be keen on the verbose = TRUE because it doesn't cleanly indicate what it's hiding and I agree it would be clumsy. as_tibble(tbl_time) for the input feels cleaner and makes object type in = object type out, which reads more logically.

Perhaps the message should be cleaner: 'dplyr::select downcasts the tbl_time to a tibble. See X for more information' where X is a help page which advises which operations would take in a tbl_time and output a tibble, and advise the coding acknowledgement.

md0u80c9 · 2018-03-27T19:00:01Z

Tested with the new version this evening. We get a warning that the index has gone. I have then just reapplied a new index to the renamed column. All looks good so far :-).

DavisVaughan pushed a commit that referenced this issue Mar 26, 2018

Support dplyr::rename() to catch renamed indices and return tibbles. …

5115412

…See #56

DavisVaughan closed this as completed Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collapse_index on a broken index gives difficult-to-debug errors #56

collapse_index on a broken index gives difficult-to-debug errors #56

md0u80c9 commented Mar 25, 2018

DavisVaughan commented Mar 26, 2018

DavisVaughan commented Mar 26, 2018

DavisVaughan commented Mar 26, 2018

md0u80c9 commented Mar 26, 2018

DavisVaughan commented Mar 26, 2018

md0u80c9 commented Mar 26, 2018

DavisVaughan commented Mar 26, 2018

DavisVaughan commented Mar 26, 2018 •

edited

Loading

md0u80c9 commented Mar 26, 2018

md0u80c9 commented Mar 26, 2018

DavisVaughan commented Mar 26, 2018

md0u80c9 commented Mar 26, 2018

md0u80c9 commented Mar 27, 2018

collapse_index on a broken index gives difficult-to-debug errors #56

collapse_index on a broken index gives difficult-to-debug errors #56

Comments

md0u80c9 commented Mar 25, 2018

DavisVaughan commented Mar 26, 2018

DavisVaughan commented Mar 26, 2018

DavisVaughan commented Mar 26, 2018

md0u80c9 commented Mar 26, 2018

DavisVaughan commented Mar 26, 2018

md0u80c9 commented Mar 26, 2018

DavisVaughan commented Mar 26, 2018

DavisVaughan commented Mar 26, 2018 • edited Loading

md0u80c9 commented Mar 26, 2018

md0u80c9 commented Mar 26, 2018

DavisVaughan commented Mar 26, 2018

md0u80c9 commented Mar 26, 2018

md0u80c9 commented Mar 27, 2018

DavisVaughan commented Mar 26, 2018 •

edited

Loading