Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILM wait for active shards on rolled index in a separate step #50718

Merged

Conversation

andreidan
Copy link
Contributor

@andreidan andreidan commented Jan 7, 2020

After we rollover the index we wait for the configured number of shards for the rolled index to become active (based on the index.write.wait_for_active_shards setting which might be present in a template, or otherwise in the default case, for the primaries to become active).
This wait might be long due to disk watermarks being tripped, replicas not being able to spring to life due to cluster nodes reconfiguration and others and the RolloverStep might not complete successfully due to this inherent transient situation, albeit the rolled index having been created.

Relates to #48183 and #44135

@andreidan andreidan added :Data Management/ILM+SLM Index and Snapshot lifecycle management WIP labels Jan 7, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@andreidan
Copy link
Contributor Author

@dakrone what are your thoughts on this?

@dakrone
Copy link
Member

dakrone commented Jan 7, 2020

I took a brief look. I think maybe we should separate this and support the parameter on the regular Rollover API (right now it's only accessible from the Transport layer I think?). Once that is available. I think splitting the steps on the ILM side sounds like a good idea to me. We need to make sure to default to the current behavior for the Rollover API though, to make sure we don't accidentally make a breaking change.

@andreidan
Copy link
Contributor Author

andreidan commented Jan 8, 2020

@dakrone the regular Rollover API supports `wait_for_active_shards" as a query parameter and the rollover client api supports it as well ( https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/action/admin/indices/rollover/RolloverRequestBuilder.java#L101 )

Are you thinking of exposing this as a parameter in the ILM Rollover action configuration or something else?
ie.

"hot": {                      
        "actions": {
          "rollover": {             
            "max_size": "50GB",
            "max_age": "30d",
            "wait_for_active_shards": 2
          }
        }

I'm not sure we should do this as we didn't see interest into something like this and it also requires validating which we can't perform (the value must not be larger than the number of primaries + replicas but this is a policy definition that can be applied to any index - and then the templates for the rolled index can make this even more complex)

@dakrone
Copy link
Member

dakrone commented Jan 8, 2020

Are you thinking of exposing this as a parameter in the ILM Rollover action configuration or something else?

Ah no! I didn't realize we had already exposed it (since only the setter was added in the actual change). If we already support it then splitting the two sounds good to me.

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan andreidan added v8.0.0 and removed WIP labels Jan 15, 2020
@andreidan andreidan requested a review from dakrone January 15, 2020 13:26
@andreidan
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andrei, I left a couple of comments!

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

// if the rollover was not performed on a write index alias, the alias will be moved to the new index and it will be the only
// index this alias points to
List<IndexMetaData> indices = alias.getIndices();
assert indices.size() == 1 : "when performing rollover on alias with is_write_index = false the alias must point to only " +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assertion doesn't stand in a CCR environment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coming up with the fix by parsing the number from the index name and finding the rolled index by "max number"

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more comments, thanks Andrei


Alias alias = (Alias) aliasOrIndex;
IndexMetaData aliasWriteIndex = alias.getWriteIndex();
String rolledIndexName;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String rolledIndexName;
final String rolledIndexName;

Alias alias = (Alias) aliasOrIndex;
IndexMetaData aliasWriteIndex = alias.getWriteIndex();
String rolledIndexName;
String waitForActiveShardsSettingValue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String waitForActiveShardsSettingValue;
final String waitForActiveShardsSettingValue;

// Index must have been since deleted
logger.debug("unable to find the index that was rolled over from [{}] as part of lifecycle action [{}]", index.getName(),
getKey().getAction());
return new Result(false, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return the above log message as part of the "info" ToXContent object here? Otherwise this could stick in false forever and the user would never know why

*/
static int parseIndexNameCounter(String indexName) {
int numberIndex = indexName.lastIndexOf("-");
assert numberIndex != -1 : "no separator '-' found";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this assert and throw a regular error

static int parseIndexNameCounter(String indexName) {
int numberIndex = indexName.lastIndexOf("-");
assert numberIndex != -1 : "no separator '-' found";
return Integer.parseInt(indexName.substring(numberIndex + 1, indexName.endsWith(">") ? indexName.length() - 1 :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should catch the NumberFormatException and format it into a nicer, more human readable exception. It's possible to hit this if someone were to take an index foo-000003 and snapshot it, then restore it with a different name (like foo-000003-restored for example), this would flip out but it wouldn't be clear why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've kept the exception handling light here as this step follows closely the RolloverStep (which would not succeed in the example you illustrated as the source index would not match the ^.*-\d+$ pattern). Its visibility is package default solely for testing purposes.
I guess renames and such could occur between steps, so I'll add the exception handling and remove the assumptions.

static final ParseField TARGET_ACTIVE_SHARDS_COUNT = new ParseField("target_active_shards_count");
static final ParseField ENOUGH_SHARDS_ACTIVE = new ParseField("enough_shards_active");
static final ParseField MESSAGE = new ParseField("message");
static final ConstructingObjectParser<Info, Void> PARSER = new ConstructingObjectParser<>("wait_for_active_shards_step_info",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this PARSER is never actually used?

this.enoughShardsActive = enoughShardsActive;

if (enoughShardsActive) {
message = "The target of [" + targetActiveShardsCount + "] are active. Don't need to wait anymore.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, but we should keep these lowercase, and I don't think we need a trailing period (these are more like error messages than sentences)

if (enoughShardsActive) {
message = "The target of [" + targetActiveShardsCount + "] are active. Don't need to wait anymore.";
} else {
message = "Waiting for [" + targetActiveShardsCount + "] shards to become active, but only [" + currentActiveShardsCount +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same about lowercasing/trailing-period here.

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan andreidan requested a review from dakrone January 20, 2020 14:13
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Andrei!

@andreidan andreidan merged commit 457a92f into elastic:master Jan 22, 2020
andreidan added a commit to andreidan/elasticsearch that referenced this pull request Jan 22, 2020
…c#50718)

After we rollover the index we wait for the configured number of shards for the
rolled index to become active (based on the index.write.wait_for_active_shards
setting which might be present in a template, or otherwise in the default case,
for the primaries to become active).
This wait might be long due to disk watermarks being tripped, replicas not
being able to spring to life due to cluster nodes reconfiguration and others
and, the RolloverStep might not complete successfully due to this inherent
transient situation, albeit the rolled index having been created.

(cherry picked from commit 457a92f)
Signed-off-by: Andrei Dan <[email protected]>
andreidan added a commit that referenced this pull request Jan 22, 2020
#51296)

After we rollover the index we wait for the configured number of shards for the
rolled index to become active (based on the index.write.wait_for_active_shards
setting which might be present in a template, or otherwise in the default case,
for the primaries to become active).
This wait might be long due to disk watermarks being tripped, replicas not
being able to spring to life due to cluster nodes reconfiguration and others
and, the RolloverStep might not complete successfully due to this inherent
transient situation, albeit the rolled index having been created.

(cherry picked from commit 457a92f)
Signed-off-by: Andrei Dan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management v7.7.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants