Add support for disk-backed updating archives, with instances for common data sources #87

brookslogan · 2022-05-31T19:00:01Z

Common data sources:

directory of full/windowed snapshots (might make us rethink indexing by unique observations, as we actually only need to index full snapshots by the version, and reliably-windowed snapshots by (version,time_value))
WayBack Machine archives and/or mementos
GitHub and/or git
delphi.epidata

dshemetov · 2022-05-31T23:35:13Z

Seems like we need an updating version of the memoize package. I'm surprised no boxed solution exists for this yet. Or maybe it does and we just haven't found it yet.

I may be misreading the issue though.

brookslogan · 2022-06-02T01:14:53Z

We may not need something that smart. First idea is to just require the user to assign a name to each updating archive and values for all the non-version parameters, plus maybe, if versions aren't datetimes, a function to translate from Sys.time() to the expected version. No hashing of requested argument values involved. (On the other hand, maybe common query arguments wouldn't be that hard or slow to hash, so maybe we should just think of this as a variant on memoize.)

Then there would need to be a bit of code to request (for delphi.epidata, in a single or small number of requests) the update data or snapshots only from the latest recorded version to the version expected to be available at the current date; taking the existing archive data, overwriting the latest recorded version with the re-requested version (since this latest version might have been subject to some overwrites), and appending the version update data beyond that, and updating what the "latest recorded version" should be considered to be. Probably some interface considerations here to try to share logic between the different types of data hosts.

This is a little specific to maintaining archives of version data. I'm not sure if we can expect there to be an existing package. Most of the implementation work probably lies in dealing with the data host-specific code.

dshemetov · 2022-06-06T19:47:34Z

I'm realizing I don't know what updating archive means. Is this a technical reference to epi_archive or does it mean (what I thought it meant) a cache that can read contents partially from cache and partially from API, dynamic to the request needed (e.g. if the cache contains a piece of the request and the API can then fill the missing piece)?

brookslogan · 2023-04-21T16:24:40Z

Updating here would just be taking cached/caching archive, asking API for any hotfixed/overwritten/clobbered and additional versions, saving those overwrites & additions to the cache, returning the updated archive.

This interacts with exploring alternative backends for archive, like dplyr-compatible df-like objects or polars. Some dplyr backends and/or polars support accessing data from disk, and might support updating disk-backed things.

Some other time you pointed to cachem; we could check if it helps here.

brookslogan added enhancement New feature or request P2 low priority P1 medium priority and removed P2 low priority labels May 31, 2022

brookslogan mentioned this issue Jun 2, 2022

Consider validating sensible key values in epi_archive, valid ops for & performance improvements from nonunique keys #89

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for disk-backed updating archives, with instances for common data sources #87

Add support for disk-backed updating archives, with instances for common data sources #87

brookslogan commented May 31, 2022

dshemetov commented May 31, 2022 •

edited

Loading

brookslogan commented Jun 2, 2022

dshemetov commented Jun 6, 2022 •

edited

Loading

brookslogan commented Apr 21, 2023

Add support for disk-backed updating archives, with instances for common data sources #87

Add support for disk-backed updating archives, with instances for common data sources #87

Comments

brookslogan commented May 31, 2022

dshemetov commented May 31, 2022 • edited Loading

brookslogan commented Jun 2, 2022

dshemetov commented Jun 6, 2022 • edited Loading

brookslogan commented Apr 21, 2023

dshemetov commented May 31, 2022 •

edited

Loading

dshemetov commented Jun 6, 2022 •

edited

Loading