-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cardinality estimation proposal #7264
Conversation
In the case of measurement removal, we will maintain a similar counter for removed measurements. | ||
However, when a measurement is removed, all series for that measurement are removed, so we need to get a count of the number of series belonging to the measurement, in order to increase the removed series counter too. | ||
We can do this by consulting the measurement hash index in the TSI | ||
and looking up the `len(series)` value, which tells us how many series belong to the measurement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is len(series)
already computed in there or should we just use the count from the existing measurement sketch before deletion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exact number of series is available in the TSI file. See the measurements block here.
One question I have about HLL++, how's the accuracy when the count is low. Like, < 1k, < 10k, < 100k. |
@pauldix good question. So initially I was going to prototype something and then empirically determine the accuracy for low cardinalities and find a cut-off point where we switch to counting, but looking at some of the existing analyses out there I think the error rate is pretty good for HLL++. For example, see pages 5 and 6 of this analysis. It shows that even for low cardinalities the error rate stays I should probably have added something about this in the document. I still think we could consider switching to exact-counting at certain cardinality levels. When I was thinking about this before, I figured we could come up with a very loose upper-bound on cardinality by simply summing the number of series in all indices for a database. If this value is If we went down this road we would probably have to consider how we handle the count/estimate approaches in the query language and if we have different functions for them or there is a dynamic switch for the same function. |
@e-dard Overall lgtm. Two issues though:
|
@benbjohnson thanks for comments.
|
``` | ||
╔═══════════Sketches═══════════╗ | ||
║ ┌──────────────────────────┐ ║ | ||
║ │ Sketch Count <uint32> │ ║ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to store this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since sketches are a fixed size we would then know how far to seek, to get to the dictionary if we added further sketch-types to the Sketches block in the future. Since @benbjohnson pointed out that the Sketch block should be after the dictionary anyway, then maybe we don't need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the sketch count always be two also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, yeah. We don't need the count.
@jwilder @benbjohnson addressed comments in 0a3e879. |
When we move to a new index format, and as we begin to consider efficient support for databases with billions of series, measuring series and measurement cardinality exactly will become impractical.
This PR introduces a proposal for how we could do cardinality estimation for series and measurements in a database.
It only covers the use-cases of determining an estimation of the total series in a database and the total measurements.
/cc @benbjohnson @jwilder @pauldix