(How) should we explicitly support partial version histories? #352

brookslogan · 2023-07-26T16:48:59Z

In developing epix_rbind() @dsweber2 identified some gnarly edge cases because NAs are overloaded in epi_archives, especially if we allow epi_archives holding partial version histories. We don't currently allow partial version histories, and some other operations (besides the epix_merge() output ambiguity) may malfunction if we try to use them.

Augmenting the epi_archive format may help disambiguate sources of missingness in diff data.

One potential scheme: add an extra column per signal (like NA codes in the API) indicating for each NA measurement, whether

it's an explicit NA
it's a missing row, or
it should be LOCF'd from the previous version

Or maybe something like one of the original epi_archive formats considered: flagging every measurement (NA or otherwise) with:

add this measurement
change this measurement
remove this measurement

Some things to think about:

How do we determine these flags? I'm not sure we get add/change/remove information out of issues queries from epidatr, for example. And we need to think about various epix_*() functions as well.
Would the user ever need to specify these flags, or would we be able to add these on automatically?
What if we try combining version histories randomly from different parts of history, or haphazardly alternating between doing rbinds and merges to arrive at a full archive? Will we need some extra metadata? Will it blow the entire thing up? Can we put some restrictions in to disallow any tricky cases?
Performance implications: time and memory cost of dealing with this. Do/can we make it opt-in if it's expensive?

The text was updated successfully, but these errors were encountered:

brookslogan · 2023-07-26T17:13:46Z

We may also want to address our inability to specify by signal/measurement when we don't have version data available. E.g., we have today's version data for signal A but only yesterday's for B. I'm not sure we can do this without user help or assumptions on data; e.g., I don't think epidatr can tell us that we received data from a provider but it was all the same for some version.

dsweber2 · 2023-07-26T21:27:39Z

Not sure I follow what the add/change/remove this measurement mean. Also, we'd need this column for every data column separately.

The problem cases I ran into came out of epix_merge; it arises when there are time_values in one column not present in another in that particular sub-archive. As far as cases I've thought about, we can more or less build the needed behavior into epix_rbind, and let the user decide whether they want to overwrite or not. The only edge case that would escape that I can think of:

y2 <- tibble(
        ~geo_value, ~time_value, ~version, ~y_value,
        "g1",           1L,           5, NA,
        "g1",           1L,           6, "YB6",
      )
x2 <- tibble(
        ~geo_value, ~time_value, ~version, ~y_value,
        "g1",           1L,           4:6, paste0("XA", 4:6),
      )
x1 <- tibble(
        ~geo_value, ~time_value, ~version, ~x_value,
        "g1",           1L,           1:3, paste0("XA", 1:3),
      )
y1 <- tibble(
        ~geo_value, ~time_value, ~version, ~y_value,
        "g1",           1L,           1:3, paste0("YA", 1:3),
      )
first <- epix_merge(x1, y1)
second <- epix_merge(x2, y2)
epix_rbind(first, second)

y here has an odd history, where there is initially data, which changes to an NA, but is then re-released with a value at a later issue. Then a "real" NA would be overwritten (specifically, the one for version=5), in addition to the non-real one. This seems niche to me, so I'm not sure if we'd want to support it.

I think it would be sufficient for epix_merge to output an extra column per input column indicating whether it added a given NA; this could be a boolean, so it'd be pretty space efficient. This would cover

it's an explicit NA, it's a missing row

but I'm not sure when

it should be LOCF'd from the previous version

would be meaningfully different from the other two?

Anyways, I think everything but epix_rbind would be able to ignore this flag and treat NA's as NA's regardless (more or less). Then its up to epix_rbind whether to overwrite the ones where this flag is true, though I have a hard time seeing why a user wouldn't want to.

brookslogan · 2023-07-27T23:55:05Z

Here's an example where explicit NA / missing row would differ from LOCF.

epix_merge() would need to consider the version range of the archive to decide whether things should be flagged as the latter or the former
- (We'd probably need to have some rule to make this work like "archives can contain data for partial issue ranges, but it must cover the same issue range for every epikey and value column".... or something like that.)
epix_rbind() would need to consider these flags, performing some LOCF, unless we lazily do this LOCF later when snapshots are requested in epix_as_of() or epix_slide().
I'm wondering if we should just make these post-merge flags part of the archive format to begin with. ~~If we do, I'm thinking we might be able to get rid of extra parameters in epix_rbind() that would otherwise be providing the same information at a different point in time.~~ [On second thought, I'm unsure. I'm having trouble wrapping my mind around this situation, and feel like the "types of NAs" might be a useful clarifying concept if we could make it work. Or maybe forbidding some types of NAs so we can make more assumptions...]

library(tibble)
library(epiprocess)
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter
x1 = tribble(~geo_value, ~time_value, ~version,           ~x,
                     1L,          1L,       2L,          10L,
) %>% as_epi_archive()
x2 = tribble(~geo_value, ~time_value, ~version,           ~x,
# value was updated to an explicit NA or the corresponding row no longer appears
# in a snapshot of the data set:
                     1L,          1L,       5L,  NA_integer_,
) %>% as_epi_archive()
y1 = tribble(~geo_value, ~time_value, ~version,           ~y,
                     1L,          1L,       2L,           8L,
) %>% as_epi_archive()
y2 = tribble(~geo_value, ~time_value, ~version,           ~y,
# no updates = we kept the old value of 8L
#
# unrelated measurement to make archive construction easier:
                     1L,       1000L,       5L,       10000L,
) %>% as_epi_archive()
xy1 = epix_merge(x1, y1)
xy2 = epix_merge(x2, y2)
xy_na = epix_rbind(xy1, xy2, sync="na")
xy_locf = epix_rbind(xy1, xy2, sync="locf")
xy_desired = tribble(~geo_value, ~time_value, ~version,           ~x,     ~y,
                             1L,          1L,       2L,          10L,     8L,
                             1L,          1L,       5L,  NA_integer_,     8L,
                             1L,       1000L,       5L,  NA_integer_, 10000L,
) %>% as_epi_archive()
xy_na$DT
#>    geo_value time_value version  x     y
#> 1:         1          1       2 10     8
#> 2:         1          1       5 NA    NA
#> 3:         1       1000       5 NA 10000
xy_locf$DT
#>    geo_value time_value version  x     y
#> 1:         1          1       2 10     8
#> 2:         1       1000       5 NA 10000
xy_desired$DT
#>    geo_value time_value version  x     y
#> 1:         1          1       2 10     8
#> 2:         1          1       5 NA     8
#> 3:         1       1000       5 NA 10000

^{Created on 2023-07-27 with reprex v2.0.2}

brookslogan · 2023-07-28T16:35:20Z

Not sure I follow what the add/change/remove this measurement mean.

I meant sort of like a git diff of one snapshot vs. the previous one.

Add = there was no value/row for this epikey-time-signal in the previous snapshot, but there is one in the current. The value could be an explicit NA.
Change = there was a value/row, but the value's different in the current snapshot. The value could be an explicit NA.
- If the first version we have available for a measurement is a "change", then we're working with a partial archive that's missing some data from previous versions, and we can't produce as_of data for the current version without epix_rbinding on the missing history.
Remove = there was a value/row in the previous snapshot, but in the current, there's no value/row. (Implicit NA.)

Also, we'd need this column for every data column separately.

Yes. I think that would also be the case with the epix_merge() output flags you proposed. (Maybe there's some chance ALTREPs would automatically save us space here in the simple case where the flag is the same for every row of the archive, or we could try to force ALTREPs if this is the case.)

brookslogan added bug Something isn't working question Further information is requested performance labels Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(How) should we explicitly support partial version histories? #352

(How) should we explicitly support partial version histories? #352

brookslogan commented Jul 26, 2023

brookslogan commented Jul 26, 2023

dsweber2 commented Jul 26, 2023

brookslogan commented Jul 27, 2023 •

edited

Loading

brookslogan commented Jul 28, 2023

(How) should we explicitly support partial version histories? #352

(How) should we explicitly support partial version histories? #352

Comments

brookslogan commented Jul 26, 2023

brookslogan commented Jul 26, 2023

dsweber2 commented Jul 26, 2023

brookslogan commented Jul 27, 2023 • edited Loading

brookslogan commented Jul 28, 2023

brookslogan commented Jul 27, 2023 •

edited

Loading