Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to compactify issue data in epi_archive #62

Closed
brookslogan opened this issue Mar 29, 2022 · 5 comments · Fixed by #101
Closed

Add option to compactify issue data in epi_archive #62

brookslogan opened this issue Mar 29, 2022 · 5 comments · Fixed by #101
Assignees
Labels
good first issue Good for newcomers P2 low priority performance

Comments

@brookslogan
Copy link
Contributor

epi_archives can be formed based on a conglomeration of full snapshots, issue data with duplicate re-reporting, and/or minimal patch-like issues. Some space (unsure about time) can be saved by removing rows that match LOCF of previous issues. Space can be essential if we are attempting in-memory analysis.

Proposal: introduce a constructor argument compactify:

  • TRUE: remove unnecessary rows to give same LOCF results. Make sure to maintain the same max_issue value as the original data.
  • FALSE: leave data as-is
  • default value (say, NULL): same as TRUE except message the user if this actually changed the data, and telling them how to silence the message

Use cases:

  • User inputs full snapshot data, to prevent using space quadratic in the number of snapshots. (A further enhancement would be to directly work off of a directory of snapshot files or something similar.)
  • We are working off of a covidcast data source that historically did not use diff-based issues and/or has many full re-issues. (E.g., repeating the analysis here gives covidcast jhu-csse state-level case issue data at 79% duplicates despite the shift to having routine issues being diff-based. This is still just reducing what object.size says is 40MB--50MB down to ~10MB, but at the county level it might matter more.)
@ryantibs
Copy link
Member

ryantibs commented Apr 5, 2022

@lcbrooks Thanks for the idea! I think it sounds like a good idea.

Not sure if compactify would be the name I would have come up with. Is that a somewhat commonly-used verb? Or something you made up?

@brookslogan
Copy link
Contributor Author

Regarding the naming: I just didn't want to call it compress (this is not applying a general-purpose compression algorithm but something archive-specific) and couldn't think of anything better. Not sure where I picked up the word, but it's in some dictionaries and appears to be used in math and physics theory in a limited number of contexts.

@brookslogan brookslogan added the good first issue Good for newcomers label Apr 13, 2022
@kenmawer
Copy link
Contributor

I can help with this.

@kenmawer
Copy link
Contributor

kenmawer commented Jun 1, 2022

I completed this and added tests for compactify. However, it still needs reviewing, so feel free to suggest improvements, such as if you want the printing to be more informative or have some way to prevent invalid entries for the compactify variable.

@brookslogan
Copy link
Contributor Author

Is this in a PR already? If so, could you please point me to it; otherwise, could you open a PR?

@brookslogan brookslogan linked a pull request Jun 21, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers P2 low priority performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants