Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for using tdigests to compute approximate percentiles. #8983
Support for using tdigests to compute approximate percentiles. #8983
Changes from 26 commits
a37b539
2dc3fd6
767f1dd
9477f88
d51d583
e47e4f2
3c4ce03
0d36303
c405f7d
8bc8e12
ea1662e
b755e0f
352b8fc
4f90bfd
f967add
d36f2d7
e5201aa
40d2063
abd324f
fdf3315
036d2e7
7ebffb3
60d7969
3e0119b
8e14771
fe3ecae
e1ccef5
708548f
f0fb57f
4083e54
5fbdaca
564f9c7
fb2816e
e5e9360
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does 1000 come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the reference implementation uses
delta=100
for an "acceptable" error bounds, and we chose1000
to see to check if that improves on the accuracy.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is basically a magic number that defines the "compression" level of the resulting digest. Although the larger the number is, the more precise the result is. Internally, it puts a bound on the maximum number of centroid clusters a digest will have. So with 1000, you'd end up with at most 1000 centroids (two doubles each). As mithun points out, values of ~100 were commonly used in the source paper for millions of input bounds.
Since tdigest does not give any absolute guarantees on precision, I figured pumping into to a default of 1000 seemed reasonable, since that's a worst case resulting column size of ~32KB.
Note: I'll be producing some data doing some comparisons of tdigest performance vs. exact percentiles and Spark's internal approximation (using a different kind of digest)