Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider separating out or unifying measurement nonexistence and NA-ness in archives #110

Open
brookslogan opened this issue Jun 19, 2022 · 0 comments
Labels
enhancement New feature or request P3 very low priority

Comments

@brookslogan
Copy link
Contributor

Very low-priority for now unless we run into it testing epiprocess and epipredict on COVID-19 data. While it seemed important for NoroSTAT acquisition in delphi-epidata due to the format of the web site involved, it may not matter for most prepared epi data sets.

Currently, we partially treat snapshot-row/measurement nonexistence and NA-ness as different; for example, we can produce snapshots with rows that have all NA measurements. However, epi_archives sometimes mix up row/measurement nonexistence and NA-ness. E.g.:

  • There's no built-in way to express row/measurement removals. (Before the initial version of a data set row, as_of will not produce a row for that measurement, but if it is removed in some version, as_of will produce NAs for that version until the next version it is added back, if any.)
  • Merges can "create" NA measurements for earlier time values and versions of one signal due to having update data there for other signals. This doesn't make sense if we treat nonexistence and NA-ness as different, but would if we treat them as the same.

On a related note, in development of compactification in #97 and #101, the following situation was discussed: the update data contains a measurement with an initial value of NA; should this row be omitted? If it is omitted, then (unless a merge reintroduces an NA for that measurement) it will be treated as nonexistent, and the user may get errors trying to get measurements as of some version that they expect to exist.

A couple of potential approaches to tracking separately:

  • Augment archive update data with a logical column for row/measurement existence flags. (SQL-based version here.)
  • Have a separate tables/archives tracking removal of measurements (and/or other objects to specify some nice existence pattern + a removal table/archive for exceptions). This should be more space-efficient as we don't need to add any more columns to the original archive.

This would complicate ways to create archives + existing operations.

Instead, we might think to unify nonexistence and NA-ness in some sense, which might simplify reasoning for various functions dealing with archives (e.g., #88). We'd need NA to represent both, but still might need some idea of "nonexistence" to catch requests for unrecognized geo&additkey values, or when merging data sets with mismatched existing geo&additkey sets.

@brookslogan brookslogan added enhancement New feature or request P3 very low priority labels Jun 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P3 very low priority
Projects
None yet
Development

No branches or pull requests

1 participant