-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[export] reduce numeric precision to reduce dataset size by ~30% #1512
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1512 +/- ##
==========================================
+ Coverage 69.90% 70.09% +0.19%
==========================================
Files 75 75
Lines 7888 7923 +35
Branches 1933 1938 +5
==========================================
+ Hits 5514 5554 +40
+ Misses 2088 2085 -3
+ Partials 286 284 -2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like something where some unit tests would be appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see the reduction in dataset sizes!
68a28d7
to
d1b0cca
Compare
Updated the code a bit and added unit tests |
Based on PR feedback in <#1512 (comment)> This better conveys that confidences for numeric traits must have 2 and only 2 elements.
d1b0cca
to
3f2f82c
Compare
Using fewer sig figs (or decimal places, as appropriate) is just fine for Auspice's usage and helps reduce the output JSON size. Testing on the zika dataset this reduces the (minified, gzipped) JSON by 20% from 175kB to 139kB. This refactor also uses more thorough error checking and enforcement that entropy values are found together with confidence values where appropriate. (Note that the previous usage of `is_valid` was bogus as it only works for string values.)
This extends the work in the previous commit to reduce the sig. figs / decimal places of all numeric node attrs. The file size of the zika dataset is reduced to 120kB which, when combined with the previous commit, is a 31% reduction cf. Augur 24.4.0
The previous restriction to the highest 4 values was motivated by keeping the eventual Auspice dataset small. That restriction is now part of `augur export v2` so we can now report them all here. This results in slightly larger node-data files (a 1.5% increase for the zika analysis) but will produce more thorough data for any scripts / non-Auspice usage.
Based on PR feedback in <#1512 (comment)> This better conveys that confidences for numeric traits must have 2 and only 2 elements.
3f2f82c
to
5f1a538
Compare
Docs CI failures due to sphinx-doc/sphinx-argparse#56, but all tests pass. Merging this PR now as per this advice on Slack:
|
Description of proposed changes
Reduces the precision of numerical values in the Auspice JSON (node attrs, confidence & entropy values) whilst keeping enough precision so that the rendering / display in Auspice is unchanged.
Motivated by this slack thread
It'd be good to get a few different eyes on this PR as it affects every single dataset Augur produces.
Related work
The output of
augur traits
doesn't currently distinguish between a (terminal) node with a provided value vs an inferred one. They both look something likeWe could reduce the JSON size a bit more if we were able to know the value wasn't inferred and thus drop the confidence & entropy fields from the export. (This may require some small Auspice changes as well.)
Checklist