Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for disk-backed updating archives, with instances for common data sources #87

Open
brookslogan opened this issue May 31, 2022 · 4 comments
Labels
enhancement New feature or request P1 medium priority

Comments

@brookslogan
Copy link
Contributor

Common data sources:

  • directory of full/windowed snapshots (might make us rethink indexing by unique observations, as we actually only need to index full snapshots by the version, and reliably-windowed snapshots by (version,time_value))
  • WayBack Machine archives and/or mementos
  • GitHub and/or git
  • delphi.epidata
@brookslogan brookslogan added enhancement New feature or request P2 low priority P1 medium priority and removed P2 low priority labels May 31, 2022
@dshemetov
Copy link
Contributor

dshemetov commented May 31, 2022

Seems like we need an updating version of the memoize package. I'm surprised no boxed solution exists for this yet. Or maybe it does and we just haven't found it yet.

I may be misreading the issue though.

@brookslogan
Copy link
Contributor Author

We may not need something that smart. First idea is to just require the user to assign a name to each updating archive and values for all the non-version parameters, plus maybe, if versions aren't datetimes, a function to translate from Sys.time() to the expected version. No hashing of requested argument values involved. (On the other hand, maybe common query arguments wouldn't be that hard or slow to hash, so maybe we should just think of this as a variant on memoize.)

Then there would need to be a bit of code to request (for delphi.epidata, in a single or small number of requests) the update data or snapshots only from the latest recorded version to the version expected to be available at the current date; taking the existing archive data, overwriting the latest recorded version with the re-requested version (since this latest version might have been subject to some overwrites), and appending the version update data beyond that, and updating what the "latest recorded version" should be considered to be. Probably some interface considerations here to try to share logic between the different types of data hosts.

This is a little specific to maintaining archives of version data. I'm not sure if we can expect there to be an existing package. Most of the implementation work probably lies in dealing with the data host-specific code.

@dshemetov
Copy link
Contributor

dshemetov commented Jun 6, 2022

I'm realizing I don't know what updating archive means. Is this a technical reference to epi_archive or does it mean (what I thought it meant) a cache that can read contents partially from cache and partially from API, dynamic to the request needed (e.g. if the cache contains a piece of the request and the API can then fill the missing piece)?

@brookslogan
Copy link
Contributor Author

Updating here would just be taking cached/caching archive, asking API for any hotfixed/overwritten/clobbered and additional versions, saving those overwrites & additions to the cache, returning the updated archive.

This interacts with exploring alternative backends for archive, like dplyr-compatible df-like objects or polars. Some dplyr backends and/or polars support accessing data from disk, and might support updating disk-backed things.

Some other time you pointed to cachem; we could check if it helps here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P1 medium priority
Projects
None yet
Development

No branches or pull requests

2 participants