Add cardinality estimation proposal #7264

e-dard · 2016-09-07T16:17:54Z

When we move to a new index format, and as we begin to consider efficient support for databases with billions of series, measuring series and measurement cardinality exactly will become impractical.

This PR introduces a proposal for how we could do cardinality estimation for series and measurements in a database.

It only covers the use-cases of determining an estimation of the total series in a database and the total measurements.

/cc @benbjohnson @jwilder @pauldix

pauldix · 2016-09-07T16:44:43Z

docs/tsm/TSI_CARDINALITY_PROPOSAL.md

+In the case of measurement removal, we will maintain a similar counter for removed measurements.
+However, when a measurement is removed, all series for that measurement are removed, so we need to get a count of the number of series belonging to the measurement, in order to increase the removed series counter too.
+We can do this by consulting the measurement hash index in the TSI
+and looking up the `len(series)` value, which tells us how many series belong to the measurement.


Is len(series) already computed in there or should we just use the count from the existing measurement sketch before deletion?

The exact number of series is available in the TSI file. See the measurements block here.

pauldix · 2016-09-07T16:51:29Z

One question I have about HLL++, how's the accuracy when the count is low. Like, < 1k, < 10k, < 100k.

e-dard · 2016-09-07T16:59:41Z

@pauldix good question. So initially I was going to prototype something and then empirically determine the accuracy for low cardinalities and find a cut-off point where we switch to counting, but looking at some of the existing analyses out there I think the error rate is pretty good for HLL++.

For example, see pages 5 and 6 of this analysis. It shows that even for low cardinalities the error rate stays <1%. This uses a slightly higher precision than I had in mind but it's a perfectly usable precision in production.

I should probably have added something about this in the document. I still think we could consider switching to exact-counting at certain cardinality levels. When I was thinking about this before, I figured we could come up with a very loose upper-bound on cardinality by simply summing the number of series in all indices for a database. If this value is < N then we would do exact counting.

If we went down this road we would probably have to consider how we handle the count/estimate approaches in the query language and if we have different functions for them or there is a dynamic switch for the same function.

benbjohnson · 2016-09-08T15:14:31Z

@e-dard Overall lgtm. Two issues though:

The terms inside the series dictionary rely on being low offset values so we can take advantage of variable length integer encoding. I would suggest moving the sketches after the Series Dictionary.
The WAL is meant to be an append-only file so having sketches at the beginning wouldn't make sense. We can either recompute sketches on startup or we can snapshot sketches periodically within the WAL and recompute for a subset of the WAL.

e-dard · 2016-09-08T16:50:09Z

@benbjohnson thanks for comments.

Ah yes I completely overlooked that. I'll update schema with a new position for the sketches after the dictionary.
I also overlooked this! I'll have a think over the weekend. I guess recalculating all WAL sketches on startup wouldn't be too bad, given we're only talking around 25MB of data per WAL typically right? Periodic snapshotting of the sketch, and then reading the latest sketch back into memory on startup could be another option too, as each sketch is small ~16KB.

jwilder · 2016-09-09T19:43:07Z

docs/tsm/TSI_CARDINALITY_PROPOSAL.md

+```
+╔═══════════Sketches═══════════╗
+║ ┌──────────────────────────┐ ║
+║ │   Sketch Count <uint32>  │ ║


Why do we need to store this?

Since sketches are a fixed size we would then know how far to seek, to get to the dictionary if we added further sketch-types to the Sketches block in the future. Since @benbjohnson pointed out that the Sketch block should be after the dictionary anyway, then maybe we don't need this?

Wouldn't the sketch count always be two also?

At the moment, yeah. We don't need the count.

e-dard · 2016-09-12T13:31:10Z

@jwilder @benbjohnson addressed comments in 0a3e879.

e-dard · 2016-09-13T14:50:19Z

@jwilder updated approach for dealing with tombstones in 13a11e8

Add cardinality estimation proposal

633c6c1

e-dard added RFC area/tsm labels Sep 7, 2016

toddboom added the 2 - Working label Sep 7, 2016

pauldix reviewed Sep 7, 2016
View reviewed changes

jwilder removed the 2 - Working label Sep 9, 2016

jwilder reviewed Sep 9, 2016
View reviewed changes

Address comments in proposal

0a3e879

Update how tombstone sketches work

13a11e8

jwilder mentioned this pull request Sep 13, 2016

Support High Cardinality Tags and Series #7151

Closed

e-dard mentioned this pull request Oct 11, 2016

Add optimised HyperLogLog++ implementation #7450

Merged

jwilder added the area/tsi label Apr 4, 2017

jsternberg closed this Apr 23, 2018

e-dard deleted the er-hll-proposal branch January 7, 2019 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cardinality estimation proposal #7264

Add cardinality estimation proposal #7264

e-dard commented Sep 7, 2016

pauldix Sep 7, 2016

e-dard Sep 7, 2016

pauldix commented Sep 7, 2016

e-dard commented Sep 7, 2016 •

edited

Loading

benbjohnson commented Sep 8, 2016

e-dard commented Sep 8, 2016

jwilder Sep 9, 2016

e-dard Sep 9, 2016

jwilder Sep 9, 2016

e-dard Sep 11, 2016

e-dard commented Sep 12, 2016

e-dard commented Sep 13, 2016

Add cardinality estimation proposal #7264

Add cardinality estimation proposal #7264

Conversation

e-dard commented Sep 7, 2016

pauldix Sep 7, 2016

Choose a reason for hiding this comment

e-dard Sep 7, 2016

Choose a reason for hiding this comment

pauldix commented Sep 7, 2016

e-dard commented Sep 7, 2016 • edited Loading

benbjohnson commented Sep 8, 2016

e-dard commented Sep 8, 2016

jwilder Sep 9, 2016

Choose a reason for hiding this comment

e-dard Sep 9, 2016

Choose a reason for hiding this comment

jwilder Sep 9, 2016

Choose a reason for hiding this comment

e-dard Sep 11, 2016

Choose a reason for hiding this comment

e-dard commented Sep 12, 2016

e-dard commented Sep 13, 2016

e-dard commented Sep 7, 2016 •

edited

Loading