-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add high cardinality requirements doc #7175
Conversation
👍 |
### Performance | ||
|
||
1. The index must be able to support 1B+ series without exhausting RAM | ||
2. Startup times should must not exceed 5 mins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer this to be even lower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and I think we should aim to have it lower. I used 5 mins because with one restart in a year, that would be the amount of downtime allowed for five 9s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this to 1min
Should also be able to show tag keys by measurement and be able to show tag values by key and measurement. |
In terms of the performance we should probably have a reference machine/architecture that we expect to meet the targets on. |
The perf targets are tricky. Might be easier to quantify other things than length of time for startup or query planning. Like, query planning shouldn't require sorting anything (i.e. things should already be in sorted order). Or, if planning a query against a section of the index that is cold, it shouldn't require more than N disk seeks (for some value of N that makes sense). Likewise, startup shouldn't require more than opening the file handles for TSM and index files and reading the WAL to load up in memory structures. Then also say that the WAL should be no larger than M bytes, which would be the thing that impacts startup time. |
The performance targets are high level requirements that would meet a users needs. They should be verifiable through testing to determine if we are meeting them or not. I think latency targets for startup make sense because that directly affects the end user (See #6250). I'd rather not have them too specific to how the index is implemented as this document will drive design ideas. |
There isn't anything in this document around renaming measurements, tags or values. I know it's not relevant to the problem we're trying to solve, but it might be worth being sympathetic to making this possible in the future when thinking about implementation. Just a thought. |
@e-dard That's a good point, but I think that could broaden the scope too much. Renaming also affects the TSM file indexes. |
6. `SELECT count(value) FROM cpu where host ='server-01' AND location = 'us-east1' GROUP BY host` | ||
7. `DROP MEASUREMENT cpu` | ||
8. `DROP SERIES cpu WHERE time > now() - 1h` | ||
9. `DROP SEREIES cpu WHERE host = 'server-01'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some other queries here, which have suffered from performance problems in the past?
The execution time for the two queries below for example should be fast, and get slower gracefully with the amount data we store. Whether we can achieve O(1)
or something worse like ~O(n log n)
will probably depend on the TSI implementation.
SELECT first(value) FROM cpu
SELECT value FROM cpu ORDER BY ASC LIMIT 1
SELECT last(value) FROM cpu
SELECT value FROM cpu ORDER BY DESC LIMIT 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These queries are similar to the ones already listed from how they would interact with the index. I was trying to come up with some scenarios where the index would be used and stressed in different ways.
For example, SELECT first(value) FROM cpu
needs to access the index essentially the same as DROP MEASUREMENT cpu
in that the index would need to be queried to determine all series keys for cpu
and then process those series.
The first
, last
, vs order by desc
, is more of a query engine thing than an index issue because all four would hit the index to return all series for cpu
and then let the query engine figure out the first
, last
, etc..
Looking at current ones, I think adding a regex scenario is missing and should be added. Also different boolean logic for tags as opposed to just AND
which would stress how we merged series sets in the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, SELECT first(value) FROM cpu needs to access the index essentially the same as DROP MEASUREMENT cpu in that the index would need to be queried to determine all series keys for cpu and then process those series.
Without meaning to jump too far into implementation details in the requirements doc, would it not be possible to maintain the first/last value for value
within the index, so we don't need to scan any series keys at all? I guess it's hard to go down that path and still provide a drop-in replacement for the current index.
One thing we may want to add later to the query language is a Starts with is quite useful for doing auto-completion inside of a UI and those can be optimized much better than regex matches. |
We could also rewrite simple regular expressions with a trailing |
LGTM 👍 |
One other thing I think we might want to add to this. For finding series, measurements, tag keys and tag values, we may want to have some method for returning a list on some rough period of time. For example, if a user keeps their data around for a long time and they have a bunch of hosts or docker container IDs that are old, often they'll probably only want to return the set of items that have been written to in the last 24 hours. Or if they're going back in time they wouldn't want to see everything for all time, just the relevant entries for that time range. I don't think it needs to be down to the second accurate. Even 24 hours would probably be good enough. Just to narrow the space of items that show up. If accuracy is needed then the underlying series could be queried to see if they have data in that range. What you guys think? |
No further action specific to the 1.3 milestone. Closing this issue. We will track ongoing TSI work separately. |
This is start at a requirements doc for #7151.
cc @benbjohnson @e-dard @pauldix