Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Searchable Snapshots / Segment Replication] Custom query routing for performance improvements #7436

Closed
kotwanikunal opened this issue May 4, 2023 · 2 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request

Comments

@kotwanikunal
Copy link
Member

kotwanikunal commented May 4, 2023

Is your feature request related to a problem? Please describe.
Searchable snapshots enable querying of indices stored within the snapshots on repositories by only fetching data needed by the query on-demand. It utilizes a file cache for enabling downloads and tracking frequently used pieces of data called as blocks. Queries can have a better latency if the same shard is queried for repeated/similar queries since the cache efficiency will be higher.

Segment replication provides an alternative replication mechanism between nodes by copying Lucene segment files from the primary shard to its replicas. The indexing requests are first processed by the primary shard, and can provide better consistency if data is queried from the primary shard.

As performance improvements, we would like to achieve the following -

  • Route queries to primary shards for better query consistency in case of segment replication enabled indices
  • Route queries to specific shards to maximize cache efficiency for searchable snapshot indices

Describe the solution you'd like
Enable custom routing of queries to maximize performance for the above use cases.
As a first step, additional preferences for query routing were added as a part of #7375

Describe alternatives you've considered

  • N/A

Additional context

@kotwanikunal kotwanikunal added enhancement Enhancement or improvement to existing feature or request untriaged labels May 4, 2023
@kotwanikunal kotwanikunal self-assigned this May 4, 2023
@kotwanikunal
Copy link
Member Author

Why is the default mechanism not sufficient for SegRep/Searchable Snapshots?

Searchable snapshots enable querying of indices stored within the snapshots on repositories by only fetching data needed by the query on-demand. It utilizes a file cache for enabling downloads and tracking frequently used pieces of data called as blocks. Queries can have a better latency if the same shard is queried for repeated/similar queries since the cache efficiency will be higher.

Segment replication provides an alternative replication mechanism between nodes by copying Lucene segment files from the primary shard to its replicas. The indexing requests are first processed by the primary shard. When the primary is refreshed it creates new Lucene segments and opens up a reader to make the newly indexed documents searchable. An async event is then performed that copies the newly created segments to replicas over the network. Only once this process is complete are the documents searchable on replicas. The time it takes for this to complete is referred to as replication lag, and depending on cluster configuration the replication lag may be unacceptable for sensitive queries. Routing search requests to primaries can provide better read-after-write consistency and avoid the replication lag for sensitive queries.

Preferences supported by routing which need to be weighed in for custom routing

  1. Adaptive Replica Selection
    1. Adaptive mechanism which takes in the following constraints:
      1. Response time of prior requests between the coordinating node and the eligible node
      2. How long the eligible node took to run previous searches
      3. Queue size of the eligible node’s search threadpool
    2. Ranks the nodes based on the above criteria and sorts the shard routing as per node rankings
  2. Awareness Attributes
    1. Takes in configured awareness attributes and ranks shards based on the awareness attributes
  3. Weighted Shard Routing
    1. To control the distribution of search or HTTP traffic, you can use the weights per awareness attribute to control the distribution of search or HTTP traffic across zones. This is commonly used for zonal deployments, heterogeneous instances, and routing traffic away from zones during zonal failure.
    2. Assigns weight per awareness attribute and orders shards as per the assigned weights

These preferences are bypassed when one of the following preferences are set at query time:

  1. Specific Shards
  2. Specific Nodes
  3. Primary/Primary First
  4. Replica/Replica First
  5. Local

Test/Performance Considerations

  1. Too many requests routed to a single node (SegRep/Searchable Snapshots)
  2. Primary shards being overloaded by both indexing and querying operations (SegRep)

Proposed Solution

Segment Replication

  • For segment replication, the client should be able to route request to the primary shard for faster consistency.
  • This can be achieved by adding in the _primary preference for search queries.
  • Pros:
    • Simple solution
    • Client can specify the preference when needed
  • Cons:
    • Will cause the nodes with primaries to work harder since both the index and querying requests will be routed to the same nodes/shards.

Searchable Snapshots

To enable high file cache efficiency, we need to route the requests for a particular shard in the index repeatedly to the same shard. This can be achieved using the following solutions, where solution 1 and 2 can be a precursor to Solution 3 -

  1. Use primary as the default preference
  • The default routing of requests for searchable snapshot indices should have a _primary preference
  • This will ensure that only the primaries answer the queries for that particular shard, leading to a high cache efficiency
  • Pros:
    • Simple, straightforward solution
    • Can achieve high cache efficiency
  • Cons:
    • User unaware of under the hood changes. Skips all the adaptive/weighted routed settings which the user might have enabled
    • A hybrid setup (data + search) node might lead to node utilization imbalance causing a hot node if the primaries for searchable snapshot reside on that node and primaries for local indices are also served by the same node (indexing + searching)

2. Alternative: Hash the search request, use the hash key as string preference

  • The string based preference key, which can be supplied as a search preference, provides a stickiness to particular shards and is used in cases where a session needs to be maintained for search operations
  • The query will be routed to one copy of each shard, which may or may not be primary, and the shard that is chosen will be a function of the string that is supplied as a preference. (Same key will lead to same shard choices)
  • Utilizing this approach, we can ideally create a hash from the user request, and implicitly supply that string as the query preference, which would lead to same queries being served by same shards.
  • Pros:
    • Simple solution
    • Requests will be evenly distributed across shards
    • Repeated queries will have a high cache efficiency
  • Cons:
    • Any minor changes in the query will lead to a different shard preference, which will cause cache inefficiency
    • Implicit nature will lead to issues with user supplid config for adaptive/weighted routing settings
  1. Add in an adaptive algorithm with support for filecache stats
  • Introduce a new adaptive algorithm which can rank the nodes based on the factors mentioned above () and add in additional attributes like cache size on the node, free space within the cache (from the node stats API) to make better decisions on where the request should be routed
  • This might lead to additional repository calls but it will be able to balance out the requests evenly (may not have as high a cache efficiency)
  • Pros:
    • Reduced likelihood of hot nodes or hot shards
    • Can function with other adaptive attributes which the user has currently set
  • Cons:
    • Might not have as high a cache efficiency as Solution A and can cause queries to have higher latency based on the adaptive nature
    • Adaptive algorithm needs to be tuned and designed correctly to correctly weigh in the node stats for file cache while also preventing unnecessary block fetches
    • Can cause thrashing since a block might be redownloaded across nodes, kicking some other blocks for a preferred shard out

Moving ahead with Solution 1 for Searchable Snapshots as the next step.

@kotwanikunal
Copy link
Member Author

Created an issue for the pending task: #7593

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

2 participants