Consider separating out or unifying measurement nonexistence and NA-ness in archives #110

brookslogan · 2022-06-19T08:24:33Z

Very low-priority for now unless we run into it testing epiprocess and epipredict on COVID-19 data. While it seemed important for NoroSTAT acquisition in delphi-epidata due to the format of the web site involved, it may not matter for most prepared epi data sets.

Currently, we partially treat snapshot-row/measurement nonexistence and NA-ness as different; for example, we can produce snapshots with rows that have all NA measurements. However, epi_archives sometimes mix up row/measurement nonexistence and NA-ness. E.g.:

There's no built-in way to express row/measurement removals. (Before the initial version of a data set row, as_of will not produce a row for that measurement, but if it is removed in some version, as_of will produce NAs for that version until the next version it is added back, if any.)
Merges can "create" NA measurements for earlier time values and versions of one signal due to having update data there for other signals. This doesn't make sense if we treat nonexistence and NA-ness as different, but would if we treat them as the same.

On a related note, in development of compactification in #97 and #101, the following situation was discussed: the update data contains a measurement with an initial value of NA; should this row be omitted? If it is omitted, then (unless a merge reintroduces an NA for that measurement) it will be treated as nonexistent, and the user may get errors trying to get measurements as of some version that they expect to exist.

A couple of potential approaches to tracking separately:

Augment archive update data with a logical column for row/measurement existence flags. (SQL-based version here.)
Have a separate tables/archives tracking removal of measurements (and/or other objects to specify some nice existence pattern + a removal table/archive for exceptions). This should be more space-efficient as we don't need to add any more columns to the original archive.

This would complicate ways to create archives + existing operations.

Creation from snapshots and from updates is no longer the same thing; snapshots simultaneously express existing values and nonexistence/removals, while updates only express existing values. Although this may be a chance to take advantage of Consider validating sensible key values in epi_archive, valid ops for & performance improvements from nonunique keys #89 if implemented.
Merges would need to deal with removal tables (e.g., store a list of removal tables, or merge them).

Instead, we might think to unify nonexistence and NA-ness in some sense, which might simplify reasoning for various functions dealing with archives (e.g., #88). We'd need NA to represent both, but still might need some idea of "nonexistence" to catch requests for unrecognized geo&additkey values, or when merging data sets with mismatched existing geo&additkey sets.

The text was updated successfully, but these errors were encountered:

brookslogan added enhancement New feature or request P3 very low priority labels Jun 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider separating out or unifying measurement nonexistence and NA-ness in archives #110

Consider separating out or unifying measurement nonexistence and NA-ness in archives #110

brookslogan commented Jun 19, 2022

Consider separating out or unifying measurement nonexistence and NA-ness in archives #110

Consider separating out or unifying measurement nonexistence and NA-ness in archives #110

Comments

brookslogan commented Jun 19, 2022