Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cardinality estimation proposal #7264

Closed
wants to merge 3 commits into from
Closed

Add cardinality estimation proposal #7264

wants to merge 3 commits into from

Conversation

e-dard
Copy link
Contributor

@e-dard e-dard commented Sep 7, 2016

When we move to a new index format, and as we begin to consider efficient support for databases with billions of series, measuring series and measurement cardinality exactly will become impractical.

This PR introduces a proposal for how we could do cardinality estimation for series and measurements in a database.

It only covers the use-cases of determining an estimation of the total series in a database and the total measurements.

/cc @benbjohnson @jwilder @pauldix

In the case of measurement removal, we will maintain a similar counter for removed measurements.
However, when a measurement is removed, all series for that measurement are removed, so we need to get a count of the number of series belonging to the measurement, in order to increase the removed series counter too.
We can do this by consulting the measurement hash index in the TSI
and looking up the `len(series)` value, which tells us how many series belong to the measurement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is len(series) already computed in there or should we just use the count from the existing measurement sketch before deletion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact number of series is available in the TSI file. See the measurements block here.

@pauldix
Copy link
Member

pauldix commented Sep 7, 2016

One question I have about HLL++, how's the accuracy when the count is low. Like, < 1k, < 10k, < 100k.

@e-dard
Copy link
Contributor Author

e-dard commented Sep 7, 2016

@pauldix good question. So initially I was going to prototype something and then empirically determine the accuracy for low cardinalities and find a cut-off point where we switch to counting, but looking at some of the existing analyses out there I think the error rate is pretty good for HLL++.

For example, see pages 5 and 6 of this analysis. It shows that even for low cardinalities the error rate stays <1%. This uses a slightly higher precision than I had in mind but it's a perfectly usable precision in production.

I should probably have added something about this in the document. I still think we could consider switching to exact-counting at certain cardinality levels. When I was thinking about this before, I figured we could come up with a very loose upper-bound on cardinality by simply summing the number of series in all indices for a database. If this value is < N then we would do exact counting.

If we went down this road we would probably have to consider how we handle the count/estimate approaches in the query language and if we have different functions for them or there is a dynamic switch for the same function.

@benbjohnson
Copy link
Contributor

@e-dard Overall lgtm. Two issues though:

  1. The terms inside the series dictionary rely on being low offset values so we can take advantage of variable length integer encoding. I would suggest moving the sketches after the Series Dictionary.
  2. The WAL is meant to be an append-only file so having sketches at the beginning wouldn't make sense. We can either recompute sketches on startup or we can snapshot sketches periodically within the WAL and recompute for a subset of the WAL.

@e-dard
Copy link
Contributor Author

e-dard commented Sep 8, 2016

@benbjohnson thanks for comments.

  1. Ah yes I completely overlooked that. I'll update schema with a new position for the sketches after the dictionary.
  2. I also overlooked this! I'll have a think over the weekend. I guess recalculating all WAL sketches on startup wouldn't be too bad, given we're only talking around 25MB of data per WAL typically right? Periodic snapshotting of the sketch, and then reading the latest sketch back into memory on startup could be another option too, as each sketch is small ~16KB.

```
╔═══════════Sketches═══════════╗
║ ┌──────────────────────────┐ ║
║ │ Sketch Count <uint32> │ ║
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to store this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since sketches are a fixed size we would then know how far to seek, to get to the dictionary if we added further sketch-types to the Sketches block in the future. Since @benbjohnson pointed out that the Sketch block should be after the dictionary anyway, then maybe we don't need this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the sketch count always be two also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, yeah. We don't need the count.

@e-dard
Copy link
Contributor Author

e-dard commented Sep 12, 2016

@jwilder @benbjohnson addressed comments in 0a3e879.

@e-dard
Copy link
Contributor Author

e-dard commented Sep 13, 2016

@jwilder updated approach for dealing with tombstones in 13a11e8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants