Add high cardinality requirements doc #7175

jwilder · 2016-08-17T23:12:47Z

This is start at a requirements doc for #7151.

benbjohnson · 2016-08-18T13:47:11Z

👍

pauldix · 2016-08-18T14:08:53Z

docs/tsm/INDEXING_REQS.md

+### Performance
+
+1. The index must be able to support 1B+ series without exhausting RAM
+2. Startup times should must not exceed 5 mins


I'd prefer this to be even lower.

I agree and I think we should aim to have it lower. I used 5 mins because with one restart in a year, that would be the amount of downtime allowed for five 9s.

Updated this to 1min

pauldix · 2016-08-18T14:10:15Z

Should also be able to show tag keys by measurement and be able to show tag values by key and measurement.

e-dard · 2016-08-18T15:07:19Z

In terms of the performance we should probably have a reference machine/architecture that we expect to meet the targets on.

pauldix · 2016-08-18T15:12:55Z

The perf targets are tricky. Might be easier to quantify other things than length of time for startup or query planning.

Like, query planning shouldn't require sorting anything (i.e. things should already be in sorted order). Or, if planning a query against a section of the index that is cold, it shouldn't require more than N disk seeks (for some value of N that makes sense).

Likewise, startup shouldn't require more than opening the file handles for TSM and index files and reading the WAL to load up in memory structures. Then also say that the WAL should be no larger than M bytes, which would be the thing that impacts startup time.

jwilder · 2016-08-18T15:53:48Z

The performance targets are high level requirements that would meet a users needs. They should be verifiable through testing to determine if we are meeting them or not. I think latency targets for startup make sense because that directly affects the end user (See #6250). I'd rather not have them too specific to how the index is implemented as this document will drive design ideas.

e-dard · 2016-08-18T15:56:33Z

There isn't anything in this document around renaming measurements, tags or values. I know it's not relevant to the problem we're trying to solve, but it might be worth being sympathetic to making this possible in the future when thinking about implementation. Just a thought.

jwilder · 2016-08-18T16:00:02Z

@e-dard That's a good point, but I think that could broaden the scope too much. Renaming also affects the TSM file indexes.

e-dard · 2016-08-19T08:41:42Z

docs/tsm/INDEXING_REQS.md

+6. `SELECT count(value) FROM cpu where host ='server-01' AND location = 'us-east1' GROUP BY host`
+7. `DROP MEASUREMENT cpu`
+8. `DROP SERIES cpu WHERE time > now() - 1h`
+9. `DROP SEREIES cpu WHERE host = 'server-01'`


Can we add some other queries here, which have suffered from performance problems in the past?

The execution time for the two queries below for example should be fast, and get slower gracefully with the amount data we store. Whether we can achieve O(1) or something worse like ~O(n log n) will probably depend on the TSI implementation.

SELECT first(value) FROM cpu SELECT value FROM cpu ORDER BY ASC LIMIT 1 SELECT last(value) FROM cpu SELECT value FROM cpu ORDER BY DESC LIMIT 1

These queries are similar to the ones already listed from how they would interact with the index. I was trying to come up with some scenarios where the index would be used and stressed in different ways.

For example, SELECT first(value) FROM cpu needs to access the index essentially the same as DROP MEASUREMENT cpu in that the index would need to be queried to determine all series keys for cpu and then process those series.

The first, last, vs order by desc, is more of a query engine thing than an index issue because all four would hit the index to return all series for cpu and then let the query engine figure out the first, last, etc..

Looking at current ones, I think adding a regex scenario is missing and should be added. Also different boolean logic for tags as opposed to just AND which would stress how we merged series sets in the index.

For example, SELECT first(value) FROM cpu needs to access the index essentially the same as DROP MEASUREMENT cpu in that the index would need to be queried to determine all series keys for cpu and then process those series.

Without meaning to jump too far into implementation details in the requirements doc, would it not be possible to maintain the first/last value for value within the index, so we don't need to scan any series keys at all? I guess it's hard to go down that path and still provide a drop-in replacement for the current index.

pauldix · 2016-08-19T17:41:51Z

One thing we may want to add later to the query language is a starts with type query for measurement names and tag values. The only matching we have now is against regex, which is very inefficient since it has to scan all possible values.

Starts with is quite useful for doing auto-completion inside of a UI and those can be optimized much better than regex matches.

benbjohnson · 2016-08-19T20:19:03Z

We could also rewrite simple regular expressions with a trailing .* to STARTS WITH queries automatically.

e-dard · 2016-08-23T20:00:09Z

LGTM 👍

pauldix · 2016-08-24T15:57:31Z

One other thing I think we might want to add to this. For finding series, measurements, tag keys and tag values, we may want to have some method for returning a list on some rough period of time.

For example, if a user keeps their data around for a long time and they have a bunch of hosts or docker container IDs that are old, often they'll probably only want to return the set of items that have been written to in the last 24 hours.

Or if they're going back in time they wouldn't want to see everything for all time, just the relevant entries for that time range.

I don't think it needs to be down to the second accurate. Even 24 hours would probably be good enough. Just to narrow the space of items that show up. If accuracy is needed then the underlying series could be queried to see if they have data in that range.

What you guys think?

rbetts · 2017-05-30T15:16:05Z

No further action specific to the 1.3 milestone. Closing this issue. We will track ongoing TSI work separately.

Add high cardinality requirements doc

b42b2d9

jwilder added the area/tsm label Aug 17, 2016

jwilder added this to the 1.1.0 milestone Aug 17, 2016

jwilder mentioned this pull request Aug 17, 2016

Support High Cardinality Tags and Series #7151

Closed

pauldix reviewed Aug 18, 2016
View reviewed changes

Updates based on review

e4764bc

e-dard reviewed Aug 19, 2016
View reviewed changes

jwilder modified the milestones: 1.2.0, 1.1.0 Oct 6, 2016

timhallinflux modified the milestones: 1.3.0, 1.2.0 Dec 19, 2016

pauldix added ready in progress and removed ready in progress labels Jan 25, 2017

jwilder added the area/tsi label Apr 3, 2017

rbetts closed this May 30, 2017

jwilder deleted the jw-cardinality branch April 20, 2018 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add high cardinality requirements doc #7175

Add high cardinality requirements doc #7175

jwilder commented Aug 17, 2016

benbjohnson commented Aug 18, 2016

pauldix Aug 18, 2016

jwilder Aug 18, 2016

jwilder Aug 18, 2016

pauldix commented Aug 18, 2016

e-dard commented Aug 18, 2016

pauldix commented Aug 18, 2016

jwilder commented Aug 18, 2016

e-dard commented Aug 18, 2016 •

edited

Loading

jwilder commented Aug 18, 2016

e-dard Aug 19, 2016 •

edited

Loading

jwilder Aug 19, 2016

e-dard Aug 19, 2016

pauldix commented Aug 19, 2016

benbjohnson commented Aug 19, 2016

e-dard commented Aug 23, 2016

pauldix commented Aug 24, 2016

rbetts commented May 30, 2017

Add high cardinality requirements doc #7175

Add high cardinality requirements doc #7175

Conversation

jwilder commented Aug 17, 2016

benbjohnson commented Aug 18, 2016

pauldix Aug 18, 2016

Choose a reason for hiding this comment

jwilder Aug 18, 2016

Choose a reason for hiding this comment

jwilder Aug 18, 2016

Choose a reason for hiding this comment

pauldix commented Aug 18, 2016

e-dard commented Aug 18, 2016

pauldix commented Aug 18, 2016

jwilder commented Aug 18, 2016

e-dard commented Aug 18, 2016 • edited Loading

jwilder commented Aug 18, 2016

e-dard Aug 19, 2016 • edited Loading

Choose a reason for hiding this comment

jwilder Aug 19, 2016

Choose a reason for hiding this comment

e-dard Aug 19, 2016

Choose a reason for hiding this comment

pauldix commented Aug 19, 2016

benbjohnson commented Aug 19, 2016

e-dard commented Aug 23, 2016

pauldix commented Aug 24, 2016

rbetts commented May 30, 2017

e-dard commented Aug 18, 2016 •

edited

Loading

e-dard Aug 19, 2016 •

edited

Loading