Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executing a SLM policy that is already taking a snapshot times out (504 error code) #45594

Closed
jen-huang opened this issue Aug 15, 2019 · 5 comments · Fixed by #45727
Closed
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management

Comments

@jen-huang
Copy link

Steps

  1. Create a repository with a slow write rate, like 1kb
  2. Create policy that uses the slow repo
  3. Execute that policy, observe that a snapshot is in progress, and the policy information includes in progress information
  4. Execute the policy again while the snapshot hasn't finished, observe that the request hangs and eventually times out with a 504 response code
@jen-huang jen-huang added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Aug 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@dakrone
Copy link
Member

dakrone commented Aug 15, 2019

Can you describe how you reproduced this in more detail? When I run through the same behavior the second execute call returns with the snapshot name, which then fails (as expected) with a warning in the logs and registers it as a failed snapshot in the SLM policy

[elasticsearch] [2019-08-15T08:35:23,107][INFO ][o.e.x.s.SnapshotLifecycleTask] [node-0] snapshot lifecycle policy [daily-snapshots] issuing create snapshot [production-snap-2019.08.15-sdycfgceq1ozhvqhbdhyfg]
[elasticsearch] [2019-08-15T08:35:23,145][INFO ][o.e.s.SnapshotsService   ] [node-0] snapshot [repo:production-snap-2019.08.15-sdycfgceq1ozhvqhbdhyfg/eXVrbzxrRHmceBWpy70iPQ] started
[elasticsearch] [2019-08-15T08:35:23,188][INFO ][o.e.c.m.MetaDataCreateIndexService] [node-0] [.slm-history-1-2019.08] creating index, cause [auto(bulk api)], templates [.slm-history], shards [1]/[0], mappings [_doc]


[elasticsearch] [2019-08-15T08:35:27,333][INFO ][o.e.x.s.SnapshotLifecycleTask] [node-0] snapshot lifecycle policy [daily-snapshots] issuing create snapshot [production-snap-2019.08.15-e4coq5clrdaatbhcbzb93q]
[elasticsearch] [2019-08-15T08:35:27,356][WARN ][o.e.s.SnapshotsService   ] [node-0] [repo][production-snap-2019.08.15-e4coq5clrdaatbhcbzb93q] failed to create snapshot
[elasticsearch] org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [repo:production-snap-2019.08.15-e4coq5clrdaatbhcbzb93q]  a snapshot is already running
[elasticsearch] 	at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:286) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
[elasticsearch] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
[elasticsearch] 	at java.lang.Thread.run(Thread.java:834) [?:?]

Repo/policy I'm using:

PUT /_snapshot/repo
{
  "type": "fs",
  "settings": {
    "location": "repo",
    "max_snapshot_bytes_per_sec": "10b"
  }
}

PUT /_slm/policy/daily-snapshots
{
  "schedule": "1 2 3 * * ?",
  "name": "<production-snap-{now/d}>",
  "repository": "repo",
  "config": {
    "indices": ["foo-*", "important"],
    "ignore_unavailable": true,
    "include_global_state": false
  }
}

@jen-huang
Copy link
Author

jen-huang commented Aug 15, 2019

@dakrone The approximate total size of my indices is 22mb and I set my repo to 1kb per second. Here is a sequence of my requests:

# Repo configuration
# GET /_snapshot/slow-repo
{
  "slow-repo" : {
    "type" : "fs",
    "settings" : {
      "location" : "test",
      "max_snapshot_bytes_per_sec" : "1kb"
    }
  }
}

# Policy configuration
# GET /_slm/policy/slow-test
{
  "slow-test" : {
    "version" : 1,
    "modified_date_millis" : 1565885533675,
    "policy" : {
      "name" : "slow-test",
      "schedule" : "0 0 0 ? * 7",
      "repository" : "slow-repo",
      "config" : { }
    },
    "next_execution_millis" : 1566000000000
  }
}

# Executing policy the first time
# PUT /_slm/policy/slow-test/_execute
{
  "snapshot_name" : "slow-test-_wl2wwbpseudgpqlbosnnq"
}

# Checking policy information after executing - in progress information is listed
# GET /_slm/policy/slow-test
{
  "slow-test" : {
    "version" : 1,
    "modified_date_millis" : 1565885533675,
    "policy" : {
      "name" : "slow-test",
      "schedule" : "0 0 0 ? * 7",
      "repository" : "slow-repo",
      "config" : { }
    },
    "last_success" : {
      "snapshot_name" : "slow-test-_wl2wwbpseudgpqlbosnnq",
      "time" : 1565885694939
    },
    "next_execution_millis" : 1566000000000,
    "in_progress" : {
      "name" : "slow-test-_wl2wwbpseudgpqlbosnnq",
      "uuid" : "9kEheisWQ5i_4IlT0AamMg",
      "state" : "STARTED",
      "start_time_millis" : 1565885694668
    }
  }
}

# Executing policy the second time time - it times out
# PUT /_slm/policy/slow-test/_execute
{
  "statusCode": 504,
  "error": "Gateway Time-out",
  "message": "Client request timeout"
}

The failure from the second execution does not show up in in policy information or ES logs until I delete the snapshot that is currently in progress.

@original-brownbear
Copy link
Member

@dakrone I think might now what's going on here.
The transport handler for executing a policy runs on the snapshot pool. So if you're already doing snapshot things, then executing a policy might not work because there's no thread available to handle the request on the snapshot pool.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Aug 20, 2019
* Executing SLM policies on the snapshot thread will block until a snapshot finishes if the pool is completely busy executing that snapshot
* Fixes elastic#45594
@original-brownbear
Copy link
Member

@jen-huang I opened #45727 with a suggested fix. If you still have the reproducer setup ready for this, feel free to try it out :)

original-brownbear added a commit that referenced this issue Aug 20, 2019
* Executing SLM policies on the snapshot thread will block until a snapshot finishes if the pool is completely busy executing that snapshot
* Fixes #45594
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Aug 20, 2019
…45727)

* Executing SLM policies on the snapshot thread will block until a snapshot finishes if the pool is completely busy executing that snapshot
* Fixes elastic#45594
original-brownbear added a commit that referenced this issue Aug 20, 2019
…45748)

* Executing SLM policies on the snapshot thread will block until a snapshot finishes if the pool is completely busy executing that snapshot
* Fixes #45594
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants