-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow indexing data in order with multiple indexing clients #1650
Comments
Will review with @pquentin. Can you comment on the urgency/priority for this ? |
Yes: this enhancement has a high priority for the TSDB team as the ingestion slowdown that we are getting with TSDB is in the way of getting TSDB adopted. This slowdown has two main causes as far as we know: the way how TSDB handles IDs, which the TSDB team is working on, and the index sort that TSDB configures ( |
What would you want Rally to do here? The use cases you've laid out made sense, but I don't know how to support them with multiple indexing clients, which are also used to get better indexing speed. The TSDB track has a single corpora with a single big 130G file with 123M lines with timestamps ranging from
While this is a kind of worst case scenario, by definition we can't index in order with 8 indexing clients that are indexing in parallel. I think the best possible outcome in theory would be that each indexing client retrieves data sequentially from the file, so that we have 8 bulk operations of 5K in parallel at all times, but reading the file sequentially. We can't do that in Rally today however. And it's also not clear it's doable without introducing a client-side bottleneck. A quick and dirty approach that seems worth trying however is to have an alternate corpora with 8 files, each of them going from |
I don't know much about the Rally internals, so apologies if my suggestion is being naive, but could Rally leverage the offset table to precompute an ordered queue of bulk requests that need to run. Then each of the driver's workers would pull from this queue to know what is the offset range for the next bulk request that needs to run? I wouldn't expect it to introduce more client-side bottlenecks compared to today, while providing stronger guarantees that the file gets ingested in order compared to independently ingesting multiple ordered files? |
Is there a benefit to using multiple clients if a single file is being ingested sequentially? Could a single client be used? |
Yes: Elasticsearch's threading model for ingestion consists of using one thread per target shard of the |
While I understand why we need more clients than the number of cores, I'm not following this specific argument: if you have a single shard to index to, how can you use more than one CPU? Per your first sentence, only one thread could be active since we have a single shard in that cluster. On the other hand, with a single node cluster and a single index but N shards, I see how you could use N CPUs. |
Sorry if my phrasing was unclear. Elasticsearch uses one thread per shard per bulk request. So you need concurrent bulk requests (e.g. via multiple clients) to use more than one CPU per shard. |
So one core assumption of Rally is that all operations (including bulk indexing) behave the same at the top level:
Reading more closely at your suggestion, I realize that instead of splitting by the number of indexing clients (8 in my example) we could split by the number of bulk requests (116_633_698 lines / 10_000 bulk size, that is 11_664 offsets) and reuse the same mmap approach. Just as above, each client can work out independently which parts of the file it needs to read. Not only this looks possible without a big rearchitecture, but it's indeed going to be fast enough. Thanks! I'll try to figure out how much time that would take to implement.
Given the code change approach is likely to take time to implement, we agreed offline to try the alternate corpus approach first, except that we're going to use 16 indexing clients since our benchmarks run on a m5d.4xlarge instance which has 16 vCPUs. Here's the list of steps to add such a new TSDB corpus:
|
* Add a split corpora to TSDB to allow sequential indexing As detailed here: elastic/rally#1650 * Rename corpus:split to corpus:split16
After a false start, I was able to implement the hacky approach to rearrange the file and index roughly in order. On my laptop which is much less stable than a true benchmarking environment, here's the difference in distribution at 13% indexing vs 18% with 16 clients: "2021-04-28T17:00:00.000Z": 2557494,
"2021-04-28T18:00:00.000Z": 3742052,
"2021-04-28T19:00:00.000Z": 3704665,
- "2021-04-28T20:00:00.000Z": 1979124,
+ "2021-04-28T20:00:00.000Z": 3693782,
- "2021-04-28T21:00:00.000Z": 1241240,
+ "2021-04-28T21:00:00.000Z": 2985247,
"2021-04-28T22:00:00.000Z": 1233969,
- "2021-04-28T23:00:00.000Z": 389828,
+ "2021-04-28T23:00:00.000Z": 1228967,
- "2021-04-29T00:00:00.000Z": 0,
+ "2021-04-29T00:00:00.000Z": 1243619,
- "2021-04-29T01:00:00.000Z": 0,
+ "2021-04-29T02:00:00.000Z": 169134,
- "2021-04-29T02:00:00.000Z": 0,
+ "2021-04-29T01:00:00.000Z": 1238765,
"2021-04-29T03:00:00.000Z": 0,
"2021-04-29T04:00:00.000Z": 0,
"2021-04-29T05:00:00.000Z": 0,
"2021-04-29T06:00:00.000Z": 0,
"2021-04-29T07:00:00.000Z": 0,
"2021-04-29T08:00:00.000Z": 0,
"2021-04-29T09:00:00.000Z": 0,
"2021-04-29T10:00:00.000Z": 0,
"2021-04-29T11:00:00.000Z": 0,
"2021-04-29T12:00:00.000Z": 0,
"2021-04-29T13:00:00.000Z": 0,
"2021-04-29T14:00:00.000Z": 0,
"2021-04-29T15:00:00.000Z": 0,
"2021-04-29T16:00:00.000Z": 0,
"2021-04-29T17:00:00.000Z": 9 (Yes, I should investigate the lone 9 at the end.) This did improve indexing speed, but surprisingly it improved all configurations, not only the time-series benchmarks where index sorting is applied: @martijnvg will investigate more I believe. In parallel, I've also worked towards implementing the proper change in Rally, but it requires touching the most complex part of Rally. So far I was able to open two pull requests towards that goal: It's also OK if I don't finish this given the hacky approach solves the immediate problem. |
Thanks a lot @pquentin. This work is increasing our confidence in the TSDB overhead that is reported in nightly benchmarks and already helped highlight that the way that Elasticsearch performs merging is not good at maintaining the original ordering of the data. This will help us move forward with changes that rely on ingestion order like elastic/elasticsearch#92456, elastic/elasticsearch#92684, and I'm sure many more in the future.
This sounds right. If ingesting data in order proves too hard to do in Rally, I think we'll want to do a similar hack for the Logging track where the ingestion order could also affect ingestion rates and query performance. |
This won't be needed as the elastic/logs (and elastic/security) tracks completely sidestep this logic by generating their own data with dynamic timestamps and thus already index in order. |
Is it like the hack we did for TSDB where each client sends data in order but maybe data doesn't get ingested in global order in case a client goes a bit faster than another one for some reason, or do the Logging and Security tracks have stronger guarantees around the global order? |
No, the guarantee is much stronger, as the timestamps are generated on demand just before indexing directly in the hot code path. Despite the stronger guarantee, the design of those tracks is more or less considered a mistake now as:
If you need that stronger guarantee, we can look into doing this for the TSDB track too. |
Thanks for the pointers!
Your above comment states that this design is considered a mistake, so I'm unsure how you feel about this approach vs. the curent hack. I don't know how good the current hack is at giving good approximate global ordering, but in general having better guarantees around it to better mimic what happens in the real world feels like a desirable feature. |
* Add a split corpora to TSDB to allow sequential indexing As detailed here: elastic/rally#1650 * Rename corpus:split to corpus:split16
I've discussed this offline quickly with @dliappis and then with @martijnvg and we agreed that the split16 corpus will do as for now as the elastic/logs way would have too many drawbacks. In elastic/rally-tracks#378 I fixed the remaining blocker and make sure we don't index a few documents from the end first. (Which was the issue mentioned in #1650 (comment).) Initially I wanted to implement this approach in Rally directly, but was only able to merge #1659 before having to move on. This is still satisfying as it simplifies the existing code significantly and I understand this core and complicated part of Rally much better. (I also opened #1666 to document my findings.) If we need this behavior for other tracks, we can always reuse the TSDB approach: https://github.com/elastic/rally-tracks/tree/master/tsdb. The README explains how to run _tools/split.py to generate the wanted file. If we find ourselves needing this in many places, then it will be time to implement this in Rally itself, but for now there's nothing actionable left in this issue, so I'll close it. |
When indexing data with multiple clients, Rally may not ingest data in the same order as the input file. This is probably fine when ingesting non-time based datasets as the order in which documents get ingested should not be too relevant. However this is problematic for time-based data as we would expect different performance characteristics depending on whether time-based data gets ingested in (mostly) increasing time order vs. not:
@timestamp
involve more reordering at flush and merge time when data gets ingested in random order, making ingestion possibly slower than what users would observe in practice. This currently affects TSDB data streams, which sort data on_tsid
then descending@timestamp
, and would be relevant for some other datasets including Logging datasets that we're contemplating updating to be sorted by descending@timestamp
by default.@timestamp
are more costly when data gets ingested in random order. When data gets ingested in order, there are higher chances that segments either fully match or don't match at all a given range filter, which is something that Lucene can handle more efficiently than a partially matching segment. On the other hand if indexing happens in a somewhat random order, then most segments are likely to partially match a range filter on@timestamp
, which would make search performance worse than what users would observe in practice.Having a way to make Rally honor the order in which documents occur in the input file would help create tracks that have more fidelity to the performance characteristics that users would observe in production.
The text was updated successfully, but these errors were encountered: