Add option to compactify issue data in `epi_archive` #62

brookslogan · 2022-03-29T23:47:56Z

epi_archives can be formed based on a conglomeration of full snapshots, issue data with duplicate re-reporting, and/or minimal patch-like issues. Some space (unsure about time) can be saved by removing rows that match LOCF of previous issues. Space can be essential if we are attempting in-memory analysis.

Proposal: introduce a constructor argument compactify:

TRUE: remove unnecessary rows to give same LOCF results. Make sure to maintain the same max_issue value as the original data.
FALSE: leave data as-is
default value (say, NULL): same as TRUE except message the user if this actually changed the data, and telling them how to silence the message

Use cases:

User inputs full snapshot data, to prevent using space quadratic in the number of snapshots. (A further enhancement would be to directly work off of a directory of snapshot files or something similar.)
We are working off of a covidcast data source that historically did not use diff-based issues and/or has many full re-issues. (E.g., repeating the analysis here gives covidcast jhu-csse state-level case issue data at 79% duplicates despite the shift to having routine issues being diff-based. This is still just reducing what object.size says is 40MB--50MB down to ~10MB, but at the county level it might matter more.)

The text was updated successfully, but these errors were encountered:

ryantibs · 2022-04-05T00:22:21Z

@lcbrooks Thanks for the idea! I think it sounds like a good idea.

Not sure if compactify would be the name I would have come up with. Is that a somewhat commonly-used verb? Or something you made up?

brookslogan · 2022-04-13T23:43:44Z

Regarding the naming: I just didn't want to call it compress (this is not applying a general-purpose compression algorithm but something archive-specific) and couldn't think of anything better. Not sure where I picked up the word, but it's in some dictionaries and appears to be used in math and physics theory in a limited number of contexts.

kenmawer · 2022-05-13T22:49:30Z

I can help with this.

kenmawer · 2022-06-01T20:52:22Z

I completed this and added tests for compactify. However, it still needs reviewing, so feel free to suggest improvements, such as if you want the printing to be more informative or have some way to prevent invalid entries for the compactify variable.

brookslogan · 2022-06-06T19:14:25Z

Is this in a PR already? If so, could you please point me to it; otherwise, could you open a PR?

brookslogan added the good first issue Good for newcomers label Apr 13, 2022

dajmcdon assigned kenmawer May 13, 2022

brookslogan added performance P2 low priority labels May 31, 2022

brookslogan mentioned this issue Jun 17, 2022

km-compactify_rectify2 #107

Closed

brookslogan linked a pull request Jun 21, 2022 that will close this issue

Km compactify rectify #101

Merged

brookslogan mentioned this issue Jul 25, 2022

Km compactify rectify #101

Merged

brookslogan closed this as completed in #101 Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to compactify issue data in `epi_archive` #62

Add option to compactify issue data in `epi_archive` #62

brookslogan commented Mar 29, 2022

ryantibs commented Apr 5, 2022

brookslogan commented Apr 13, 2022

kenmawer commented May 13, 2022

kenmawer commented Jun 1, 2022 •

edited

Loading

brookslogan commented Jun 6, 2022

Add option to compactify issue data in epi_archive #62

Add option to compactify issue data in epi_archive #62

Comments

brookslogan commented Mar 29, 2022

ryantibs commented Apr 5, 2022

brookslogan commented Apr 13, 2022

kenmawer commented May 13, 2022

kenmawer commented Jun 1, 2022 • edited Loading

brookslogan commented Jun 6, 2022

Add option to compactify issue data in `epi_archive` #62

Add option to compactify issue data in `epi_archive` #62

kenmawer commented Jun 1, 2022 •

edited

Loading