-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(How) should we explicitly support partial version histories? #352
Comments
We may also want to address our inability to specify by signal/measurement when we don't have version data available. E.g., we have today's version data for signal A but only yesterday's for B. I'm not sure we can do this without user help or assumptions on data; e.g., I don't think epidatr can tell us that we received data from a provider but it was all the same for some version. |
Not sure I follow what the add/change/remove this measurement mean. Also, we'd need this column for every data column separately. The problem cases I ran into came out of
I think it would be sufficient for
but I'm not sure when
would be meaningfully different from the other two? Anyways, I think everything but |
Here's an example where explicit NA / missing row would differ from LOCF.
library(tibble)
library(epiprocess)
#>
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#>
#> filter
x1 = tribble(~geo_value, ~time_value, ~version, ~x,
1L, 1L, 2L, 10L,
) %>% as_epi_archive()
x2 = tribble(~geo_value, ~time_value, ~version, ~x,
# value was updated to an explicit NA or the corresponding row no longer appears
# in a snapshot of the data set:
1L, 1L, 5L, NA_integer_,
) %>% as_epi_archive()
y1 = tribble(~geo_value, ~time_value, ~version, ~y,
1L, 1L, 2L, 8L,
) %>% as_epi_archive()
y2 = tribble(~geo_value, ~time_value, ~version, ~y,
# no updates = we kept the old value of 8L
#
# unrelated measurement to make archive construction easier:
1L, 1000L, 5L, 10000L,
) %>% as_epi_archive()
xy1 = epix_merge(x1, y1)
xy2 = epix_merge(x2, y2)
xy_na = epix_rbind(xy1, xy2, sync="na")
xy_locf = epix_rbind(xy1, xy2, sync="locf")
xy_desired = tribble(~geo_value, ~time_value, ~version, ~x, ~y,
1L, 1L, 2L, 10L, 8L,
1L, 1L, 5L, NA_integer_, 8L,
1L, 1000L, 5L, NA_integer_, 10000L,
) %>% as_epi_archive()
xy_na$DT
#> geo_value time_value version x y
#> 1: 1 1 2 10 8
#> 2: 1 1 5 NA NA
#> 3: 1 1000 5 NA 10000
xy_locf$DT
#> geo_value time_value version x y
#> 1: 1 1 2 10 8
#> 2: 1 1000 5 NA 10000
xy_desired$DT
#> geo_value time_value version x y
#> 1: 1 1 2 10 8
#> 2: 1 1 5 NA 8
#> 3: 1 1000 5 NA 10000 Created on 2023-07-27 with reprex v2.0.2 |
I meant sort of like a git diff of one snapshot vs. the previous one.
Yes. I think that would also be the case with the |
In developing
epix_rbind()
@dsweber2 identified some gnarly edge cases because NAs are overloaded inepi_archive
s, especially if we allowepi_archive
s holding partial version histories. We don't currently allow partial version histories, and some other operations (besides theepix_merge()
output ambiguity) may malfunction if we try to use them.Augmenting the
epi_archive
format may help disambiguate sources of missingness in diff data.One potential scheme: add an extra column per signal (like NA codes in the API) indicating for each NA measurement, whether
Or maybe something like one of the original
epi_archive
formats considered: flagging every measurement (NA or otherwise) with:Some things to think about:
issues
queries from epidatr, for example. And we need to think about variousepix_*()
functions as well.The text was updated successfully, but these errors were encountered: