Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(How) should we explicitly support partial version histories? #352

Open
brookslogan opened this issue Jul 26, 2023 · 4 comments
Open

(How) should we explicitly support partial version histories? #352

brookslogan opened this issue Jul 26, 2023 · 4 comments
Labels
bug Something isn't working performance question Further information is requested

Comments

@brookslogan
Copy link
Contributor

In developing epix_rbind() @dsweber2 identified some gnarly edge cases because NAs are overloaded in epi_archives, especially if we allow epi_archives holding partial version histories. We don't currently allow partial version histories, and some other operations (besides the epix_merge() output ambiguity) may malfunction if we try to use them.

Augmenting the epi_archive format may help disambiguate sources of missingness in diff data.

One potential scheme: add an extra column per signal (like NA codes in the API) indicating for each NA measurement, whether

  • it's an explicit NA
  • it's a missing row, or
  • it should be LOCF'd from the previous version

Or maybe something like one of the original epi_archive formats considered: flagging every measurement (NA or otherwise) with:

  • add this measurement
  • change this measurement
  • remove this measurement

Some things to think about:

  • How do we determine these flags? I'm not sure we get add/change/remove information out of issues queries from epidatr, for example. And we need to think about various epix_*() functions as well.
  • Would the user ever need to specify these flags, or would we be able to add these on automatically?
  • What if we try combining version histories randomly from different parts of history, or haphazardly alternating between doing rbinds and merges to arrive at a full archive? Will we need some extra metadata? Will it blow the entire thing up? Can we put some restrictions in to disallow any tricky cases?
  • Performance implications: time and memory cost of dealing with this. Do/can we make it opt-in if it's expensive?
@brookslogan brookslogan added bug Something isn't working question Further information is requested performance labels Jul 26, 2023
@brookslogan
Copy link
Contributor Author

We may also want to address our inability to specify by signal/measurement when we don't have version data available. E.g., we have today's version data for signal A but only yesterday's for B. I'm not sure we can do this without user help or assumptions on data; e.g., I don't think epidatr can tell us that we received data from a provider but it was all the same for some version.

@dsweber2
Copy link
Contributor

Not sure I follow what the add/change/remove this measurement mean. Also, we'd need this column for every data column separately.

The problem cases I ran into came out of epix_merge; it arises when there are time_values in one column not present in another in that particular sub-archive. As far as cases I've thought about, we can more or less build the needed behavior into epix_rbind, and let the user decide whether they want to overwrite or not. The only edge case that would escape that I can think of:

y2 <- tibble(
        ~geo_value, ~time_value, ~version, ~y_value,
        "g1",           1L,           5, NA,
        "g1",           1L,           6, "YB6",
      )
x2 <- tibble(
        ~geo_value, ~time_value, ~version, ~y_value,
        "g1",           1L,           4:6, paste0("XA", 4:6),
      )
x1 <- tibble(
        ~geo_value, ~time_value, ~version, ~x_value,
        "g1",           1L,           1:3, paste0("XA", 1:3),
      )
y1 <- tibble(
        ~geo_value, ~time_value, ~version, ~y_value,
        "g1",           1L,           1:3, paste0("YA", 1:3),
      )
first <- epix_merge(x1, y1)
second <- epix_merge(x2, y2)
epix_rbind(first, second)

y here has an odd history, where there is initially data, which changes to an NA, but is then re-released with a value at a later issue. Then a "real" NA would be overwritten (specifically, the one for version=5), in addition to the non-real one. This seems niche to me, so I'm not sure if we'd want to support it.

I think it would be sufficient for epix_merge to output an extra column per input column indicating whether it added a given NA; this could be a boolean, so it'd be pretty space efficient. This would cover

it's an explicit NA, it's a missing row

but I'm not sure when

it should be LOCF'd from the previous version

would be meaningfully different from the other two?

Anyways, I think everything but epix_rbind would be able to ignore this flag and treat NA's as NA's regardless (more or less). Then its up to epix_rbind whether to overwrite the ones where this flag is true, though I have a hard time seeing why a user wouldn't want to.

@brookslogan
Copy link
Contributor Author

brookslogan commented Jul 27, 2023

Here's an example where explicit NA / missing row would differ from LOCF.

  • epix_merge() would need to consider the version range of the archive to decide whether things should be flagged as the latter or the former
    • (We'd probably need to have some rule to make this work like "archives can contain data for partial issue ranges, but it must cover the same issue range for every epikey and value column".... or something like that.)
  • epix_rbind() would need to consider these flags, performing some LOCF, unless we lazily do this LOCF later when snapshots are requested in epix_as_of() or epix_slide().
  • I'm wondering if we should just make these post-merge flags part of the archive format to begin with. If we do, I'm thinking we might be able to get rid of extra parameters in epix_rbind() that would otherwise be providing the same information at a different point in time. [On second thought, I'm unsure. I'm having trouble wrapping my mind around this situation, and feel like the "types of NAs" might be a useful clarifying concept if we could make it work. Or maybe forbidding some types of NAs so we can make more assumptions...]
library(tibble)
library(epiprocess)
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter
x1 = tribble(~geo_value, ~time_value, ~version,           ~x,
                     1L,          1L,       2L,          10L,
) %>% as_epi_archive()
x2 = tribble(~geo_value, ~time_value, ~version,           ~x,
# value was updated to an explicit NA or the corresponding row no longer appears
# in a snapshot of the data set:
                     1L,          1L,       5L,  NA_integer_,
) %>% as_epi_archive()
y1 = tribble(~geo_value, ~time_value, ~version,           ~y,
                     1L,          1L,       2L,           8L,
) %>% as_epi_archive()
y2 = tribble(~geo_value, ~time_value, ~version,           ~y,
# no updates = we kept the old value of 8L
#
# unrelated measurement to make archive construction easier:
                     1L,       1000L,       5L,       10000L,
) %>% as_epi_archive()
xy1 = epix_merge(x1, y1)
xy2 = epix_merge(x2, y2)
xy_na = epix_rbind(xy1, xy2, sync="na")
xy_locf = epix_rbind(xy1, xy2, sync="locf")
xy_desired = tribble(~geo_value, ~time_value, ~version,           ~x,     ~y,
                             1L,          1L,       2L,          10L,     8L,
                             1L,          1L,       5L,  NA_integer_,     8L,
                             1L,       1000L,       5L,  NA_integer_, 10000L,
) %>% as_epi_archive()
xy_na$DT
#>    geo_value time_value version  x     y
#> 1:         1          1       2 10     8
#> 2:         1          1       5 NA    NA
#> 3:         1       1000       5 NA 10000
xy_locf$DT
#>    geo_value time_value version  x     y
#> 1:         1          1       2 10     8
#> 2:         1       1000       5 NA 10000
xy_desired$DT
#>    geo_value time_value version  x     y
#> 1:         1          1       2 10     8
#> 2:         1          1       5 NA     8
#> 3:         1       1000       5 NA 10000

Created on 2023-07-27 with reprex v2.0.2

@brookslogan
Copy link
Contributor Author

Not sure I follow what the add/change/remove this measurement mean.

I meant sort of like a git diff of one snapshot vs. the previous one.

  • Add = there was no value/row for this epikey-time-signal in the previous snapshot, but there is one in the current. The value could be an explicit NA.
  • Change = there was a value/row, but the value's different in the current snapshot. The value could be an explicit NA.
    • If the first version we have available for a measurement is a "change", then we're working with a partial archive that's missing some data from previous versions, and we can't produce as_of data for the current version without epix_rbinding on the missing history.
  • Remove = there was a value/row in the previous snapshot, but in the current, there's no value/row. (Implicit NA.)

Also, we'd need this column for every data column separately.

Yes. I think that would also be the case with the epix_merge() output flags you proposed. (Maybe there's some chance ALTREPs would automatically save us space here in the simple case where the flag is the same for every row of the archive, or we could try to force ALTREPs if this is the case.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants