-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct service time parameter in ARS formula #70283
Conversation
As mentioned in the issue, for moderate queue sizes the existing behavior roughly ranks based on response time. So in many cases we may not see a real difference with this change. However (based on my understanding of the formula) there are possible improvements:
Note that I have not run large scale end-to-end tests to assess the impact. I was hoping for feedback here: what level of end-to-end testing do we think is appropriate for this change? The ARS algorithm has a few moving parts (for example stats adjustments), so it can be hard to fully understand the behavior with only unit tests. |
Pinging @elastic/es-search (Team:Search) |
c1ec0ee
to
70c00bc
Compare
ClusterService clusterService = ClusterServiceUtils.createClusterService(threadPool); | ||
ResponseCollectorService collector = new ResponseCollectorService(clusterService); | ||
Map<String, Long> outstandingRequests = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a 'winner' node is selected, we update the local 'outstanding requests' map to increment its value by 1. This only affects the local copy, it doesn't update global statistics. While this has a purpose in the ARS logic, it doesn't make sense in the context of this test. So I removed the shared outstandingRequests
map here, and added another test that explicitly checks this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch @jtibshirani ! The formula looks much better now ;).
ExponentiallyWeightedMovingAverage avgServiceTime = new ExponentiallyWeightedMovingAverage( | ||
ResponseCollectorService.ALPHA, stats.serviceTime); | ||
avgServiceTime.addValue((minStats.serviceTime + stats.serviceTime) / 2); | ||
final long updatedService = (long) avgServiceTime.getAverage(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, the "adjustment" seems odd. We should think about updating the node statistics explicitly. That's not in the scope of this PR but that would be a good follow up as discussed offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani
I'm curious, in NodeStatistics
:
final ExponentiallyWeightedMovingAverage queueSize;
final ExponentiallyWeightedMovingAverage responseTime;
double serviceTime;
why not serviceTime
also use type ExponentiallyWeightedMovingAverage
, but adjust it here? It seems the effect is the same.
Thanks @jimczi for the review! I also tagged @henningandersen to get input from the distributed area. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this. I have a question, probably just me needing this clarification.
Also, I wonder if we could add an integration test to validate that the overall mechanism works as we expect it (for instance, doing two searches in a row, we would want to see it hit two different nodes (I think)).
|
||
// The final formula | ||
double rank = rS - (1.0 / muBarS) + (Math.pow(qHatS, queueAdjustmentFactor) / muBarS); | ||
return rank; | ||
return rS - muBarSInverse + Math.pow(qHatS, queueAdjustmentFactor) * muBarSInverse; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks good. But I wonder if the clientNum
above is representing the right number? As far as I can see, it is the number of nodes that this node has gotten a search response from. Compared to the paper this sounds more like "servers" than "clients". For instance in setups with a dedicated coordinating tier, this could be somewhat dynamic and the number of clients could differ between the nodes in the tier.
I wonder if you know the reasoning behind how the clientNum
is derived?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also found this surprising and don't understand the reasoning. I am guessing it is meant as a loose approximation to the number of clients, since by default every node can serve as a coordinating node. Since we were dividing by service time before, this clientNum
approximation didn't have a big impact.
I can see a couple options. We could avoid making changes to clientNum
in this PR to keep it well-scoped, while recognizing that it may give too much weight to the queue size factor. Or we could always set clientNum
to 1 for now, which can underestimate the queue size, but is simpler and makes the calculation more predictable.
In any case, I will make sure we track this through an issue or other means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My worry is primarily that if we split changes here, we end up with some customers seeing multiple changes to the behavior over releases.
On the other hand, my intuition on this matter is not really strong enough that I think it is a show-stopper to merge this first and then deal with the other part later. It seems likely enough that clientNum
will stabilize around a similar number over time for "client" nodes in the cluster. It might be too high (or low) though and building up the numClients
could take time after restarts.
The dedicated role for coordinating nodes might come in handy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change identified a few aspects of ARS that could use improvement (including the 'stats adjustment' @jimczi mentioned above). So there will likely be more changes even apart from clientNum
. To me it seems okay to introduce fixes/ improvements incrementally instead of assembling a single large update to the algorithm.
I added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
||
// The final formula | ||
double rank = rS - (1.0 / muBarS) + (Math.pow(qHatS, queueAdjustmentFactor) / muBarS); | ||
return rank; | ||
return rS - muBarSInverse + Math.pow(qHatS, queueAdjustmentFactor) * muBarSInverse; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My worry is primarily that if we split changes here, we end up with some customers seeing multiple changes to the behavior over releases.
On the other hand, my intuition on this matter is not really strong enough that I think it is a show-stopper to merge this first and then deal with the other part later. It seems likely enough that clientNum
will stabilize around a similar number over time for "client" nodes in the cluster. It might be too high (or low) though and building up the numClients
could take time after restarts.
The dedicated role for coordinating nodes might come in handy here.
When computing node’s ARS rank, we use the number of outstanding search requests to the node. If there are no connections to the node, we consider there to be 1 outstanding request. This isn’t accurate, the default should be 0 to indicate no outstanding requests. The ARS rank we return in node stats actually uses 0 instead of 1. This small fix lets us remove a test workaround. It also ensures the ARS ranks we return in node stats match the ranks we use to select shards during search. Follow-up to #70283.
When computing node’s ARS rank, we use the number of outstanding search requests to the node. If there are no connections to the node, we consider there to be 1 outstanding request. This isn’t accurate, the default should be 0 to indicate no outstanding requests. The ARS rank we return in node stats actually uses 0 instead of 1. This small fix lets us remove a test workaround. It also ensures the ARS ranks we return in node stats match the ranks we use to select shards during search. Follow-up to #70283.
The adaptive replica selection algorithm implements the C3 algorithm
for ranking nodes. The formula defines service time as a quantity
1/muBarS
.Our implementation accidentally plugs in service time for
muBarS
instead of1/muBarS
. This commit corrects the formula and adds invariant tests toconfirm it behaves as expected.
This change also fixes a bug in how we adjust node statistics. To ensure that
nodes with high ranks occasionally get selected, every time we select a
'winner' node, we average its stats with the node's stats and add it to the
moving average. For service time, we were accidentally overwriting the whole
moving average with the new stats, which caused the ranks to adjust too
quickly. This issue has a much bigger impact now that the formula correctly
incorporates service time, and is important to fix so the behavior remains
reasonable.
Fixes #65838.