Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce CCR stats endpoint #32350

Merged
merged 21 commits into from
Aug 3, 2018
Merged

Introduce CCR stats endpoint #32350

merged 21 commits into from
Aug 3, 2018

Conversation

jasontedor
Copy link
Member

This commit introduces the CCR stats endpoint which provides shard-level stats on the status of CCR follower tasks.

This commit introduces the CCR stats endpoint which provides shard-level
stats on the status of CCR follower tasks.
@jasontedor jasontedor added >enhancement review :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Jul 25, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

for (final Map.Entry<Integer, TransportCcrStatsAction.TaskResponse> shard : index.getValue().entrySet()) {
builder.startObject(Integer.toString(shard.getKey()));
{
final ShardFollowNodeTask.Status status = shard.getValue().status();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we use the status toXContent?

Copy link
Member Author

@jasontedor jasontedor Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there are slight differences in what we present on the endpoint and the status x-content. For example: the endpoint doesn't present the index metadata version, and presents successful/failed fetches and bulk operations as total/successful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re index metadata version - why not present everything we have? the purpose of status object is to report information about the running task, which I think is identical with the purpose of this api?

presents successful/failed fetches and bulk operations as total/successful.

We can change the status class to track these if you feel it's more natural. It's not a part of the task run time behavior so nothing stops us to align these.

Copy link
Member Author

@jasontedor jasontedor Jul 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re index metadata version - why not present everything we have?

This does not seem useful from an end-user perspective but we can add in the interest of simplification.

We can change the status class to track these if you feel it's more natural. It's not a part of the task run time behavior so nothing stops us to align these.

Tracking successful/failed feels more natural internally, presenting total/successful feels more natural but we can present successful/failed in the interest of simplification.

I pushed e3fc87a.

@bleskes
Copy link
Contributor

bleskes commented Jul 25, 2018

Thanks @jasontedor . I asked a quick question before doing a full review. I'm looking at this under the following assumptions:

  1. We will review what we really expose in follow up discussions and potentially change it. This PR is about the infra to expose stats.
  2. I believe we need an index level summary (shards are an implementation detail). This is IMO also out of scope (and has some tricky aspects).

Let me know if you agree.

* ccr:
  Number of utilities for writing gradle integration tests (elastic#32282)
  Determine the minimum gradle version based on the wrapper (elastic#32226)
  Enable FIPS JVM in CI (elastic#32330)
  Build: Fix jarHell error I caused by last backport
  Use shadow plugin in ccr/qa
  Fix JarHell on X-Pack protocol
@jasontedor
Copy link
Member Author

Your assumptions are correct.

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jasontedor . I left some comments. I also miss a test for the TransportCcrStatsAction (the only thing I see that does that's not covered elsewhere is the matching of tasks) and a rest layer test.

for (final Map.Entry<Integer, TransportCcrStatsAction.TaskResponse> shard : index.getValue().entrySet()) {
builder.startObject(Integer.toString(shard.getKey()));
{
final ShardFollowNodeTask.Status status = shard.getValue().status();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re index metadata version - why not present everything we have? the purpose of status object is to report information about the running task, which I think is identical with the purpose of this api?

presents successful/failed fetches and bulk operations as total/successful.

We can change the status class to track these if you feel it's more natural. It's not a part of the task run time behavior so nothing stops us to align these.


private String indexName;

public void setIndexName(final String indexName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should follow the same patterns as an indices stats request - i.e., we should except a list of indices or a wild card pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will require some effort. I will look into this tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes I pushed 2ee14e3. I am not a fan of it since the TasksRequest#match method is now a lie so that we can defer index resolution to the transport action. This is a limitation of the tasks API that puts us here. I do not think we should block CCR on addressing this so we have this workaround for now. I dislike it but maybe it's tolerable for now to give us an API consistent with others. Let me know what you think.

private final ReleasableLock readLock;
private final ReleasableLock writeLock;

{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the locking semantics - read/ write for stats updates and fully synced for class execution. I think we should stick with one. I understand you did it to make sure stats reading is consistent. My vote goes to the simplicity of synchronized access and making the status creation method synchronized as well. It's super light.

Copy link
Member Author

@jasontedor jasontedor Jul 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, no problem. I pushed 9b5aa54 and 54f7d3c.

listener.onResponse(new TaskResponse(task.getFollowShardId(), task.getStatus()));
}

public static class TaskResponse implements Writeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not together with the other request/response object for this action? I'd rather have them together (here or in the action)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 42c309d.

@@ -490,7 +491,7 @@ public void testFollowIndex_lowMaxTranslogBytes() throws Exception {
ShardFollowNodeTask.Status status = (ShardFollowNodeTask.Status) taskInfo.getStatus();
assertThat(status, notNullValue());
assertThat("incorrect global checkpoint " + shardFollowTaskParams,
status.getFollowerGlobalCheckpoint(),
status.followerGlobalCheckpoint(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering (I don't care much) - why did you move from the get* method naming?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an indication there is no setter and the field is otherwise immutable.

* elastic/ccr: (57 commits)
  ShardFollowNodeTask should fetch operation once (elastic#32455)
  Do not expose hard-deleted docs in Lucene history (elastic#32333)
  Tests: Fix convert error tests to use fixed value (elastic#32415)
  IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (elastic#32374)
  REST high-level client: parse back _ignored meta field (elastic#32362)
  [CI] Mute DocumentSubsetReaderTests testSearch
  Reject follow request if following setting not enabled on follower (elastic#32448)
  TEST: testDocStats should always use forceMerge (elastic#32450)
  TEST: avoid merge in testSegmentMemoryTrackedInBreaker
  TEST: Avoid deletion in FlushIT
  AwaitsFix IndexShardTests#testDocStats
  Painless: Add method type to method. (elastic#32441)
  Remove reference to non-existent store type (elastic#32418)
  [TEST] Mute failing FlushIT test
  Fix ordering of bootstrap checks in docs (elastic#32417)
  [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints
  Validate source of an index in LuceneChangesSnapshot (elastic#32288)
  [TEST] Mute failing testConvertLongHexError
  bump lucene version after backport
  Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (elastic#32390)
  ...
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra iterations @jasontedor . Code so far LGTM. I would still love to see a rest test but I presume this is just still in the works.

* where we can decide on the basis of the cluster state whether or not the task matches any of the index patterns in the
* request.
*/
return task instanceof ShardFollowNodeTask;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this can be an assertion now? we filter for the task instance type in the transport action.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would support throwing UnsupportedOperationException.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed b9c5429.

private volatile long leaderGlobalCheckpoint;

private volatile long leaderMaxSeqNo;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these don't need to be volatile any more, now that we read under lock.

Copy link
Member Author

@jasontedor jasontedor Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. With a little extra synchronization, it is not needed on any of the variables here. I pushed 45aba55.

e -> handleFailure(e, retryCounter, () -> sendShardChangesRequest(from, maxOperationCount, maxRequiredSeqNo, retryCounter)));
response -> {
synchronized (ShardFollowNodeTask.this) {
// noinspection NonAtomicOperationOnVolatileField
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about these not needing to be volatile, meaning we can get rid of this suppression.

Copy link
Member Author

@jasontedor jasontedor Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 4fc75d5.


public RestCcrStatsAction(final Settings settings, final RestController controller) {
super(settings);
controller.registerHandler(RestRequest.Method.GET, "/_xpack/ccr/_stats", this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heads up - this will change (or can be changed now)

* ccr: (24 commits)
  Remove _xpack from CCR APIs (elastic#32563)
  TEST: Avoid merges in testRecoveryWithOutOfOrderDelete
  Logging: Make node name consistent in logger (elastic#31588)
  Mute SSLTrustRestrictionsTests on JDK 11
  Increase max chunk size to 256Mb for repo-azure (elastic#32101)
  Docs: Fix README upgrade mention (elastic#32313)
  Changed ReindexRequest to use Writeable.Reader (elastic#32401)
  Mute KerberosAuthenticationIT
  Fix AutoIntervalDateHistogram.testReduce random failures (elastic#32301)
  fix no=>not typo (elastic#32463)
  Mute QueryProfilerIT#testProfileMatchesRegular()
  HLRC: Add delete watch action (elastic#32337)
  High-level client: fix clusterAlias parsing in SearchHit (elastic#32465)
  Fix calculation of orientation of polygons (elastic#27967)
  [Kerberos] Add missing javadocs (elastic#32469)
  [Kerberos] Remove Kerberos bootstrap checks (elastic#32451)
  Make get all app privs requires "*" permission (elastic#32460)
  Switch security to new style Requests (elastic#32290)
  Switch security spi example to new style Requests (elastic#32341)
  Painless: Add PainlessConstructor (elastic#32447)
  ...
@jasontedor
Copy link
Member Author

@bleskes Would you take another look?

@jasontedor jasontedor merged commit d640c9d into elastic:ccr Aug 3, 2018
jasontedor added a commit that referenced this pull request Aug 3, 2018
This commit introduces the CCR stats endpoint which provides shard-level
stats on the status of CCR follower tasks.
@jasontedor jasontedor deleted the ccr-stats branch August 3, 2018 13:30
dliappis added a commit to dliappis/rally that referenced this pull request Aug 6, 2018
Update the endpoint and the unit tests to reflect the changes brought
in [1]

[1] elastic/elasticsearch#32350
dliappis added a commit to dliappis/rally that referenced this pull request Aug 6, 2018
Update the endpoint and the unit tests to reflect the changes brought
in [1]

[1] elastic/elasticsearch#32350
dliappis added a commit to elastic/rally that referenced this pull request Aug 7, 2018
Update the endpoint and the unit tests to reflect the changes brought
in [1].

Also simplify filtering ccr-stats related documents,
similarly to the work done in [2].

[1] elastic/elasticsearch#32350
[2] 9b8974f

Relates #546
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants