Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct service time parameter in ARS formula #70283

Merged
merged 7 commits into from
Mar 18, 2021

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented Mar 11, 2021

The adaptive replica selection algorithm implements the C3 algorithm
for ranking nodes. The formula defines service time as a quantity 1/muBarS.
Our implementation accidentally plugs in service time for muBarS instead of
1/muBarS. This commit corrects the formula and adds invariant tests to
confirm it behaves as expected.

This change also fixes a bug in how we adjust node statistics. To ensure that
nodes with high ranks occasionally get selected, every time we select a
'winner' node, we average its stats with the node's stats and add it to the
moving average. For service time, we were accidentally overwriting the whole
moving average with the new stats, which caused the ranks to adjust too
quickly. This issue has a much bigger impact now that the formula correctly
incorporates service time, and is important to fix so the behavior remains
reasonable.

Fixes #65838.

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Mar 11, 2021

As mentioned in the issue, for moderate queue sizes the existing behavior roughly ranks based on response time. So in many cases we may not see a real difference with this change. However (based on my understanding of the formula) there are possible improvements:

  • The ranking could be more responsive, since we now better incorporate non-local information about each node's service time and queue size
  • When nodes have larger queue sizes, we'll always make the right selection, whereas before we might pick a node with worse service time

Note that I have not run large scale end-to-end tests to assess the impact. I was hoping for feedback here: what level of end-to-end testing do we think is appropriate for this change? The ARS algorithm has a few moving parts (for example stats adjustments), so it can be hard to fully understand the behavior with only unit tests.

@jtibshirani jtibshirani marked this pull request as ready for review March 11, 2021 01:36
@jtibshirani jtibshirani added :Search/Search Search-related issues that do not fall into other categories >bug v7.13.0 v8.0.0 labels Mar 11, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Mar 11, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@jtibshirani jtibshirani marked this pull request as draft March 11, 2021 17:22
ClusterService clusterService = ClusterServiceUtils.createClusterService(threadPool);
ResponseCollectorService collector = new ResponseCollectorService(clusterService);
Map<String, Long> outstandingRequests = new HashMap<>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a 'winner' node is selected, we update the local 'outstanding requests' map to increment its value by 1. This only affects the local copy, it doesn't update global statistics. While this has a purpose in the ARS logic, it doesn't make sense in the context of this test. So I removed the shared outstandingRequests map here, and added another test that explicitly checks this behavior.

@jtibshirani jtibshirani marked this pull request as ready for review March 15, 2021 23:29
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch @jtibshirani ! The formula looks much better now ;).

ExponentiallyWeightedMovingAverage avgServiceTime = new ExponentiallyWeightedMovingAverage(
ResponseCollectorService.ALPHA, stats.serviceTime);
avgServiceTime.addValue((minStats.serviceTime + stats.serviceTime) / 2);
final long updatedService = (long) avgServiceTime.getAverage();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, the "adjustment" seems odd. We should think about updating the node statistics explicitly. That's not in the scope of this PR but that would be a good follow up as discussed offline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani
I'm curious, in NodeStatistics:

final ExponentiallyWeightedMovingAverage queueSize;
final ExponentiallyWeightedMovingAverage responseTime;
double serviceTime;

why not serviceTime also use type ExponentiallyWeightedMovingAverage, but adjust it here? It seems the effect is the same.

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Mar 16, 2021

Thanks @jimczi for the review! I also tagged @henningandersen to get input from the distributed area.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this. I have a question, probably just me needing this clarification.

Also, I wonder if we could add an integration test to validate that the overall mechanism works as we expect it (for instance, doing two searches in a row, we would want to see it hit two different nodes (I think)).


// The final formula
double rank = rS - (1.0 / muBarS) + (Math.pow(qHatS, queueAdjustmentFactor) / muBarS);
return rank;
return rS - muBarSInverse + Math.pow(qHatS, queueAdjustmentFactor) * muBarSInverse;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good. But I wonder if the clientNum above is representing the right number? As far as I can see, it is the number of nodes that this node has gotten a search response from. Compared to the paper this sounds more like "servers" than "clients". For instance in setups with a dedicated coordinating tier, this could be somewhat dynamic and the number of clients could differ between the nodes in the tier.

I wonder if you know the reasoning behind how the clientNum is derived?

Copy link
Contributor Author

@jtibshirani jtibshirani Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found this surprising and don't understand the reasoning. I am guessing it is meant as a loose approximation to the number of clients, since by default every node can serve as a coordinating node. Since we were dividing by service time before, this clientNum approximation didn't have a big impact.

I can see a couple options. We could avoid making changes to clientNum in this PR to keep it well-scoped, while recognizing that it may give too much weight to the queue size factor. Or we could always set clientNum to 1 for now, which can underestimate the queue size, but is simpler and makes the calculation more predictable.

In any case, I will make sure we track this through an issue or other means.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My worry is primarily that if we split changes here, we end up with some customers seeing multiple changes to the behavior over releases.

On the other hand, my intuition on this matter is not really strong enough that I think it is a show-stopper to merge this first and then deal with the other part later. It seems likely enough that clientNum will stabilize around a similar number over time for "client" nodes in the cluster. It might be too high (or low) though and building up the numClients could take time after restarts.

The dedicated role for coordinating nodes might come in handy here.

Copy link
Contributor Author

@jtibshirani jtibshirani Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change identified a few aspects of ARS that could use improvement (including the 'stats adjustment' @jimczi mentioned above). So there will likely be more changes even apart from clientNum. To me it seems okay to introduce fixes/ improvements incrementally instead of assembling a single large update to the algorithm.

@jtibshirani
Copy link
Contributor Author

I wonder if we could add an integration test to validate that the overall mechanism works as we expect it

I added SearchReplicaSelectionIT with some basic checks on end-to-end behavior.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.


// The final formula
double rank = rS - (1.0 / muBarS) + (Math.pow(qHatS, queueAdjustmentFactor) / muBarS);
return rank;
return rS - muBarSInverse + Math.pow(qHatS, queueAdjustmentFactor) * muBarSInverse;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My worry is primarily that if we split changes here, we end up with some customers seeing multiple changes to the behavior over releases.

On the other hand, my intuition on this matter is not really strong enough that I think it is a show-stopper to merge this first and then deal with the other part later. It seems likely enough that clientNum will stabilize around a similar number over time for "client" nodes in the cluster. It might be too high (or low) though and building up the numClients could take time after restarts.

The dedicated role for coordinating nodes might come in handy here.

jtibshirani added a commit that referenced this pull request Apr 2, 2021
When computing node’s ARS rank, we use the number of outstanding search
requests to the node. If there are no connections to the node, we consider
there to be 1 outstanding request. This isn’t accurate, the default should be 0
to indicate no outstanding requests. The ARS rank we return in node stats
actually uses 0 instead of 1.

This small fix lets us remove a test workaround. It also ensures the ARS ranks
we return in node stats match the ranks we use to select shards during search.

Follow-up to #70283.
jtibshirani added a commit that referenced this pull request Apr 2, 2021
When computing node’s ARS rank, we use the number of outstanding search
requests to the node. If there are no connections to the node, we consider
there to be 1 outstanding request. This isn’t accurate, the default should be 0
to indicate no outstanding requests. The ARS rank we return in node stats
actually uses 0 instead of 1.

This small fix lets us remove a test workaround. It also ensures the ARS ranks
we return in node stats match the ranks we use to select shards during search.

Follow-up to #70283.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.13.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ARS implementation inverts the priority of servicetime
6 participants