Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[export] reduce numeric precision to reduce dataset size by ~30% #1512

Merged
merged 5 commits into from
Jul 15, 2024

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Jul 2, 2024

Description of proposed changes

Reduces the precision of numerical values in the Auspice JSON (node attrs, confidence & entropy values) whilst keeping enough precision so that the rendering / display in Auspice is unchanged.

Motivated by this slack thread

It'd be good to get a few different eyes on this PR as it affects every single dataset Augur produces.

Related work

The output of augur traits doesn't currently distinguish between a (terminal) node with a provided value vs an inferred one. They both look something like

{
    "value": "foo",
    "confidence": {"foo": 1.0},
    "entropy": -1.000088900581841e-12
}

We could reduce the JSON size a bit more if we were able to know the value wasn't inferred and thus drop the confidence & entropy fields from the export. (This may require some small Auspice changes as well.)

Checklist

  • Checks pass
  • If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

@jameshadfield jameshadfield requested a review from a team July 2, 2024 09:18
Copy link

codecov bot commented Jul 2, 2024

Codecov Report

Attention: Patch coverage is 93.47826% with 3 lines in your changes missing coverage. Please review.

Project coverage is 70.09%. Comparing base (eeb6237) to head (5f1a538).
Report is 162 commits behind head on master.

Files with missing lines Patch % Lines
augur/export_v2.py 95.55% 2 Missing ⚠️
augur/traits.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1512      +/-   ##
==========================================
+ Coverage   69.90%   70.09%   +0.19%     
==========================================
  Files          75       75              
  Lines        7888     7923      +35     
  Branches     1933     1938       +5     
==========================================
+ Hits         5514     5554      +40     
+ Misses       2088     2085       -3     
+ Partials      286      284       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@genehack genehack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like something where some unit tests would be appropriate?

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see the reduction in dataset sizes!

augur/export_v2.py Outdated Show resolved Hide resolved
augur/export_v2.py Outdated Show resolved Hide resolved
augur/export_v2.py Outdated Show resolved Hide resolved
@jameshadfield jameshadfield force-pushed the james/reduce-numeric-accuracy branch from 68a28d7 to d1b0cca Compare July 5, 2024 00:48
@jameshadfield
Copy link
Member Author

Updated the code a bit and added unit tests

augur/export_v2.py Outdated Show resolved Hide resolved
jameshadfield added a commit that referenced this pull request Jul 10, 2024
Based on PR feedback in <#1512 (comment)>

This better conveys that confidences for numeric traits must have 2 and
only 2 elements.
CHANGES.md Outdated Show resolved Hide resolved
Using fewer sig figs (or decimal places, as appropriate) is just fine
for Auspice's usage and helps reduce the output JSON size. Testing on
the zika dataset this reduces the (minified, gzipped) JSON by 20% from
175kB to 139kB.

This refactor also uses more thorough error checking and enforcement
that entropy values are found together with confidence values where
appropriate. (Note that the previous usage of `is_valid` was bogus as it
only works for string values.)
This extends the work in the previous commit to reduce the sig. figs /
decimal places of all numeric node attrs. The file size of the zika
dataset is reduced to 120kB which, when combined with the previous
commit, is a 31% reduction cf. Augur 24.4.0
The previous restriction to the highest 4 values was motivated by
keeping the eventual Auspice dataset small. That restriction is now part
of `augur export v2` so we can now report them all here. This results in
slightly larger node-data files (a 1.5% increase for the zika analysis)
but will produce more thorough data for any scripts / non-Auspice usage.
Based on PR feedback in <#1512 (comment)>

This better conveys that confidences for numeric traits must have 2 and
only 2 elements.
@jameshadfield
Copy link
Member Author

Docs CI failures due to sphinx-doc/sphinx-argparse#56, but all tests pass.

Merging this PR now as per this advice on Slack:

we added the docs check not too long ago and now it's giving a false positive. I'd just ignore the failure for [this] PR. We should probably add some temporary pin but that can be a separate PR.

@jameshadfield jameshadfield merged commit ee1e4ae into master Jul 15, 2024
21 of 23 checks passed
@jameshadfield jameshadfield deleted the james/reduce-numeric-accuracy branch July 15, 2024 21:39
@victorlin victorlin mentioned this pull request Jul 15, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants