[export] reduce numeric precision to reduce dataset size by ~30% #1512

jameshadfield · 2024-07-02T09:17:15Z

Description of proposed changes

Reduces the precision of numerical values in the Auspice JSON (node attrs, confidence & entropy values) whilst keeping enough precision so that the rendering / display in Auspice is unchanged.

Motivated by this slack thread

It'd be good to get a few different eyes on this PR as it affects every single dataset Augur produces.

Related work

The output of augur traits doesn't currently distinguish between a (terminal) node with a provided value vs an inferred one. They both look something like

{
    "value": "foo",
    "confidence": {"foo": 1.0},
    "entropy": -1.000088900581841e-12
}

We could reduce the JSON size a bit more if we were able to know the value wasn't inferred and thus drop the confidence & entropy fields from the export. (This may require some small Auspice changes as well.)

Checklist

Checks pass
If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

codecov · 2024-07-02T09:29:36Z

Codecov Report

Attention: Patch coverage is 93.47826% with 3 lines in your changes missing coverage. Please review.

Project coverage is 70.09%. Comparing base (eeb6237) to head (5f1a538).
Report is 162 commits behind head on master.

Files with missing lines	Patch %	Lines
augur/export_v2.py	95.55%	2 Missing ⚠️
augur/traits.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1512      +/-   ##
==========================================
+ Coverage   69.90%   70.09%   +0.19%     
==========================================
  Files          75       75              
  Lines        7888     7923      +35     
  Branches     1933     1938       +5     
==========================================
+ Hits         5514     5554      +40     
+ Misses       2088     2085       -3     
+ Partials      286      284       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

genehack

This feels like something where some unit tests would be appropriate?

joverlee521

Nice to see the reduction in dataset sizes!

augur/export_v2.py

jameshadfield · 2024-07-05T00:55:02Z

Updated the code a bit and added unit tests

augur/export_v2.py

Based on PR feedback in <#1512 (comment)> This better conveys that confidences for numeric traits must have 2 and only 2 elements.

CHANGES.md

Using fewer sig figs (or decimal places, as appropriate) is just fine for Auspice's usage and helps reduce the output JSON size. Testing on the zika dataset this reduces the (minified, gzipped) JSON by 20% from 175kB to 139kB. This refactor also uses more thorough error checking and enforcement that entropy values are found together with confidence values where appropriate. (Note that the previous usage of `is_valid` was bogus as it only works for string values.)

This extends the work in the previous commit to reduce the sig. figs / decimal places of all numeric node attrs. The file size of the zika dataset is reduced to 120kB which, when combined with the previous commit, is a 31% reduction cf. Augur 24.4.0

The previous restriction to the highest 4 values was motivated by keeping the eventual Auspice dataset small. That restriction is now part of `augur export v2` so we can now report them all here. This results in slightly larger node-data files (a 1.5% increase for the zika analysis) but will produce more thorough data for any scripts / non-Auspice usage.

Based on PR feedback in <#1512 (comment)> This better conveys that confidences for numeric traits must have 2 and only 2 elements.

jameshadfield · 2024-07-15T21:39:20Z

Docs CI failures due to sphinx-doc/sphinx-argparse#56, but all tests pass.

Merging this PR now as per this advice on Slack:

we added the docs check not too long ago and now it's giving a false positive. I'd just ignore the failure for [this] PR. We should probably add some temporary pin but that can be a separate PR.

jameshadfield requested a review from a team July 2, 2024 09:18

genehack reviewed Jul 2, 2024

View reviewed changes

joverlee521 requested changes Jul 2, 2024

View reviewed changes

augur/export_v2.py Outdated Show resolved Hide resolved

augur/export_v2.py Outdated Show resolved Hide resolved

augur/export_v2.py Outdated Show resolved Hide resolved

jameshadfield force-pushed the james/reduce-numeric-accuracy branch from 68a28d7 to d1b0cca Compare July 5, 2024 00:48

genehack approved these changes Jul 8, 2024

View reviewed changes

augur/export_v2.py Outdated Show resolved Hide resolved

joverlee521 approved these changes Jul 9, 2024

View reviewed changes

jameshadfield added a commit that referenced this pull request Jul 10, 2024

[export] refactor types

179cf65

Based on PR feedback in <#1512 (comment)> This better conveys that confidences for numeric traits must have 2 and only 2 elements.

jameshadfield force-pushed the james/reduce-numeric-accuracy branch from d1b0cca to 3f2f82c Compare July 10, 2024 03:37

genehack approved these changes Jul 10, 2024

View reviewed changes

victorlin reviewed Jul 15, 2024

View reviewed changes

CHANGES.md Outdated Show resolved Hide resolved

jameshadfield added 5 commits July 16, 2024 09:06

[export] refactor types

0d866c2

Based on PR feedback in <#1512 (comment)> This better conveys that confidences for numeric traits must have 2 and only 2 elements.

changelog

5f1a538

jameshadfield force-pushed the james/reduce-numeric-accuracy branch from 3f2f82c to 5f1a538 Compare July 15, 2024 21:09

jameshadfield merged commit ee1e4ae into master Jul 15, 2024
21 of 23 checks passed

jameshadfield deleted the james/reduce-numeric-accuracy branch July 15, 2024 21:39

victorlin mentioned this pull request Jul 15, 2024

Pin sphinx-argparse <0.5.0 #1543

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[export] reduce numeric precision to reduce dataset size by ~30% #1512

[export] reduce numeric precision to reduce dataset size by ~30% #1512

jameshadfield commented Jul 2, 2024 •

edited by genehack

Loading

codecov bot commented Jul 2, 2024 •

edited

Loading

genehack left a comment

joverlee521 left a comment

jameshadfield commented Jul 5, 2024

jameshadfield commented Jul 15, 2024

[export] reduce numeric precision to reduce dataset size by ~30% #1512

[export] reduce numeric precision to reduce dataset size by ~30% #1512

Conversation

jameshadfield commented Jul 2, 2024 • edited by genehack Loading

Description of proposed changes

Related work

Checklist

codecov bot commented Jul 2, 2024 • edited Loading

Codecov Report

genehack left a comment

Choose a reason for hiding this comment

joverlee521 left a comment

Choose a reason for hiding this comment

jameshadfield commented Jul 5, 2024

jameshadfield commented Jul 15, 2024

jameshadfield commented Jul 2, 2024 •

edited by genehack

Loading

codecov bot commented Jul 2, 2024 •

edited

Loading