-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a split cache in Searchers #3857
Conversation
1c411ef
to
22e07f9
Compare
22e07f9
to
015fe2c
Compare
@@ -327,6 +343,9 @@ impl Handler<PackagedSplitBatch> for Uploader { | |||
counters.num_staged_splits.fetch_add(split_metadata_list.len() as u64, Ordering::SeqCst); | |||
|
|||
let mut packaged_splits_and_metadata = Vec::with_capacity(batch.splits.len()); | |||
|
|||
event_broker.publish(ReportSplitsRequest { report_splits }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we report the splits after a successful upload rather than a successful publish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a bunch of reasons that are not necessarily relevant to be honest.
There was two original motivations:
-
getting more info in the
ReportSplit
message to eventually make it possible to improve on our cache logic:
I strongly suspect we might want to be smart enough to decide to cache in priority mature splits.
(aparte: Interestingly the number of merge ops before split maturity should not matter much. The idea there is that a split with merge opsn+1
is m times larger than a split with maturityn
, and its life expectancy is also m times larger... Overall you get roughly the same bang for the buck by downloading one or the other) -
One of customer oscillated about the need to make sure we hit the cache all of the times. Putting it before publish makes it possible to enrich the configuration to change the behavior. For instance, someone who really cares about caching everything could use config flag to block publishing until at least one node has a copy in its cache.
There was another place that I considered, but at that point I did not have access to the list of split ids, and only had them one by one. I thought it was a bit sad to increase the number of rpc.
These are not great reasons. I'll see if I can move stuff to the MetastoreEventPublisher
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no the thing we miss in the publisher is the storage uri.
Co-authored-by: Adrien Guillo <[email protected]>
Co-authored-by: Adrien Guillo <[email protected]>
32e5d5b
to
0f04333
Compare
6a56ebb
to
ebb9a28
Compare
Searcher split cache
Quickwit includes a split cache. It can be useful for specific workloads:
The split cache stores entire split files on disk.
It works under the following configurable constraints:
Searcher get tipped by indexers about the existence of splits (for which they have the best affinity).
They also might learn about split existence, upon read requests.
The searcher is then in charge of maintaining an in-memory datastructure with a bounded list of splits it knows about and their score.
The current strategy for admission/evicton is a simple LRU logic.
If the most recently accessed splits not already in cache has been accessed, we consider downloading it.
If the limits have been reached, we only proceed to eviction if one of the split currently
in cache has been less recently accessed.