-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce CCR stats endpoint #32350
Introduce CCR stats endpoint #32350
Conversation
This commit introduces the CCR stats endpoint which provides shard-level stats on the status of CCR follower tasks.
Pinging @elastic/es-distributed |
for (final Map.Entry<Integer, TransportCcrStatsAction.TaskResponse> shard : index.getValue().entrySet()) { | ||
builder.startObject(Integer.toString(shard.getKey())); | ||
{ | ||
final ShardFollowNodeTask.Status status = shard.getValue().status(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we use the status toXContent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there are slight differences in what we present on the endpoint and the status x-content. For example: the endpoint doesn't present the index metadata version, and presents successful/failed fetches and bulk operations as total/successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re index metadata version - why not present everything we have? the purpose of status object is to report information about the running task, which I think is identical with the purpose of this api?
presents successful/failed fetches and bulk operations as total/successful.
We can change the status class to track these if you feel it's more natural. It's not a part of the task run time behavior so nothing stops us to align these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re index metadata version - why not present everything we have?
This does not seem useful from an end-user perspective but we can add in the interest of simplification.
We can change the status class to track these if you feel it's more natural. It's not a part of the task run time behavior so nothing stops us to align these.
Tracking successful/failed feels more natural internally, presenting total/successful feels more natural but we can present successful/failed in the interest of simplification.
I pushed e3fc87a.
Thanks @jasontedor . I asked a quick question before doing a full review. I'm looking at this under the following assumptions:
Let me know if you agree. |
* ccr: Number of utilities for writing gradle integration tests (elastic#32282) Determine the minimum gradle version based on the wrapper (elastic#32226) Enable FIPS JVM in CI (elastic#32330) Build: Fix jarHell error I caused by last backport Use shadow plugin in ccr/qa Fix JarHell on X-Pack protocol
Your assumptions are correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jasontedor . I left some comments. I also miss a test for the TransportCcrStatsAction (the only thing I see that does that's not covered elsewhere is the matching of tasks) and a rest layer test.
for (final Map.Entry<Integer, TransportCcrStatsAction.TaskResponse> shard : index.getValue().entrySet()) { | ||
builder.startObject(Integer.toString(shard.getKey())); | ||
{ | ||
final ShardFollowNodeTask.Status status = shard.getValue().status(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re index metadata version - why not present everything we have? the purpose of status object is to report information about the running task, which I think is identical with the purpose of this api?
presents successful/failed fetches and bulk operations as total/successful.
We can change the status class to track these if you feel it's more natural. It's not a part of the task run time behavior so nothing stops us to align these.
|
||
private String indexName; | ||
|
||
public void setIndexName(final String indexName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should follow the same patterns as an indices stats request - i.e., we should except a list of indices or a wild card pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will require some effort. I will look into this tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bleskes I pushed 2ee14e3. I am not a fan of it since the TasksRequest#match
method is now a lie so that we can defer index resolution to the transport action. This is a limitation of the tasks API that puts us here. I do not think we should block CCR on addressing this so we have this workaround for now. I dislike it but maybe it's tolerable for now to give us an API consistent with others. Let me know what you think.
private final ReleasableLock readLock; | ||
private final ReleasableLock writeLock; | ||
|
||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the locking semantics - read/ write for stats updates and fully synced for class execution. I think we should stick with one. I understand you did it to make sure stats reading is consistent. My vote goes to the simplicity of synchronized access and making the status creation method synchronized as well. It's super light.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
listener.onResponse(new TaskResponse(task.getFollowShardId(), task.getStatus())); | ||
} | ||
|
||
public static class TaskResponse implements Writeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not together with the other request/response object for this action? I'd rather have them together (here or in the action)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 42c309d.
@@ -490,7 +491,7 @@ public void testFollowIndex_lowMaxTranslogBytes() throws Exception { | |||
ShardFollowNodeTask.Status status = (ShardFollowNodeTask.Status) taskInfo.getStatus(); | |||
assertThat(status, notNullValue()); | |||
assertThat("incorrect global checkpoint " + shardFollowTaskParams, | |||
status.getFollowerGlobalCheckpoint(), | |||
status.followerGlobalCheckpoint(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering (I don't care much) - why did you move from the get* method naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an indication there is no setter and the field is otherwise immutable.
* elastic/ccr: (57 commits) ShardFollowNodeTask should fetch operation once (elastic#32455) Do not expose hard-deleted docs in Lucene history (elastic#32333) Tests: Fix convert error tests to use fixed value (elastic#32415) IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (elastic#32374) REST high-level client: parse back _ignored meta field (elastic#32362) [CI] Mute DocumentSubsetReaderTests testSearch Reject follow request if following setting not enabled on follower (elastic#32448) TEST: testDocStats should always use forceMerge (elastic#32450) TEST: avoid merge in testSegmentMemoryTrackedInBreaker TEST: Avoid deletion in FlushIT AwaitsFix IndexShardTests#testDocStats Painless: Add method type to method. (elastic#32441) Remove reference to non-existent store type (elastic#32418) [TEST] Mute failing FlushIT test Fix ordering of bootstrap checks in docs (elastic#32417) [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints Validate source of an index in LuceneChangesSnapshot (elastic#32288) [TEST] Mute failing testConvertLongHexError bump lucene version after backport Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (elastic#32390) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extra iterations @jasontedor . Code so far LGTM. I would still love to see a rest test but I presume this is just still in the works.
* where we can decide on the basis of the cluster state whether or not the task matches any of the index patterns in the | ||
* request. | ||
*/ | ||
return task instanceof ShardFollowNodeTask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this can be an assertion now? we filter for the task instance type in the transport action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would support throwing UnsupportedOperationException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed b9c5429.
private volatile long leaderGlobalCheckpoint; | ||
|
||
private volatile long leaderMaxSeqNo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these don't need to be volatile any more, now that we read under lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. With a little extra synchronization, it is not needed on any of the variables here. I pushed 45aba55.
e -> handleFailure(e, retryCounter, () -> sendShardChangesRequest(from, maxOperationCount, maxRequiredSeqNo, retryCounter))); | ||
response -> { | ||
synchronized (ShardFollowNodeTask.this) { | ||
// noinspection NonAtomicOperationOnVolatileField |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment about these not needing to be volatile, meaning we can get rid of this suppression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 4fc75d5.
|
||
public RestCcrStatsAction(final Settings settings, final RestController controller) { | ||
super(settings); | ||
controller.registerHandler(RestRequest.Method.GET, "/_xpack/ccr/_stats", this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heads up - this will change (or can be changed now)
* ccr: (24 commits) Remove _xpack from CCR APIs (elastic#32563) TEST: Avoid merges in testRecoveryWithOutOfOrderDelete Logging: Make node name consistent in logger (elastic#31588) Mute SSLTrustRestrictionsTests on JDK 11 Increase max chunk size to 256Mb for repo-azure (elastic#32101) Docs: Fix README upgrade mention (elastic#32313) Changed ReindexRequest to use Writeable.Reader (elastic#32401) Mute KerberosAuthenticationIT Fix AutoIntervalDateHistogram.testReduce random failures (elastic#32301) fix no=>not typo (elastic#32463) Mute QueryProfilerIT#testProfileMatchesRegular() HLRC: Add delete watch action (elastic#32337) High-level client: fix clusterAlias parsing in SearchHit (elastic#32465) Fix calculation of orientation of polygons (elastic#27967) [Kerberos] Add missing javadocs (elastic#32469) [Kerberos] Remove Kerberos bootstrap checks (elastic#32451) Make get all app privs requires "*" permission (elastic#32460) Switch security to new style Requests (elastic#32290) Switch security spi example to new style Requests (elastic#32341) Painless: Add PainlessConstructor (elastic#32447) ...
@bleskes Would you take another look? |
This commit introduces the CCR stats endpoint which provides shard-level stats on the status of CCR follower tasks.
Update the endpoint and the unit tests to reflect the changes brought in [1] [1] elastic/elasticsearch#32350
Update the endpoint and the unit tests to reflect the changes brought in [1] [1] elastic/elasticsearch#32350
Update the endpoint and the unit tests to reflect the changes brought in [1]. Also simplify filtering ccr-stats related documents, similarly to the work done in [2]. [1] elastic/elasticsearch#32350 [2] 9b8974f Relates #546
This commit introduces the CCR stats endpoint which provides shard-level stats on the status of CCR follower tasks.