Executing a SLM policy that is already taking a snapshot times out (504 error code) #45594

jen-huang · 2019-08-15T04:30:05Z

Steps

Create a repository with a slow write rate, like 1kb
Create policy that uses the slow repo
Execute that policy, observe that a snapshot is in progress, and the policy information includes in progress information
Execute the policy again while the snapshot hasn't finished, observe that the request hangs and eventually times out with a 504 response code

elasticmachine · 2019-08-15T04:30:07Z

Pinging @elastic/es-core-features

dakrone · 2019-08-15T14:37:48Z

Can you describe how you reproduced this in more detail? When I run through the same behavior the second execute call returns with the snapshot name, which then fails (as expected) with a warning in the logs and registers it as a failed snapshot in the SLM policy

[elasticsearch] [2019-08-15T08:35:23,107][INFO ][o.e.x.s.SnapshotLifecycleTask] [node-0] snapshot lifecycle policy [daily-snapshots] issuing create snapshot [production-snap-2019.08.15-sdycfgceq1ozhvqhbdhyfg]
[elasticsearch] [2019-08-15T08:35:23,145][INFO ][o.e.s.SnapshotsService   ] [node-0] snapshot [repo:production-snap-2019.08.15-sdycfgceq1ozhvqhbdhyfg/eXVrbzxrRHmceBWpy70iPQ] started
[elasticsearch] [2019-08-15T08:35:23,188][INFO ][o.e.c.m.MetaDataCreateIndexService] [node-0] [.slm-history-1-2019.08] creating index, cause [auto(bulk api)], templates [.slm-history], shards [1]/[0], mappings [_doc]


[elasticsearch] [2019-08-15T08:35:27,333][INFO ][o.e.x.s.SnapshotLifecycleTask] [node-0] snapshot lifecycle policy [daily-snapshots] issuing create snapshot [production-snap-2019.08.15-e4coq5clrdaatbhcbzb93q]
[elasticsearch] [2019-08-15T08:35:27,356][WARN ][o.e.s.SnapshotsService   ] [node-0] [repo][production-snap-2019.08.15-e4coq5clrdaatbhcbzb93q] failed to create snapshot
[elasticsearch] org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [repo:production-snap-2019.08.15-e4coq5clrdaatbhcbzb93q]  a snapshot is already running
[elasticsearch] 	at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:286) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
[elasticsearch] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
[elasticsearch] 	at java.lang.Thread.run(Thread.java:834) [?:?]

Repo/policy I'm using:

PUT /_snapshot/repo
{
  "type": "fs",
  "settings": {
    "location": "repo",
    "max_snapshot_bytes_per_sec": "10b"
  }
}

PUT /_slm/policy/daily-snapshots
{
  "schedule": "1 2 3 * * ?",
  "name": "<production-snap-{now/d}>",
  "repository": "repo",
  "config": {
    "indices": ["foo-*", "important"],
    "ignore_unavailable": true,
    "include_global_state": false
  }
}

jen-huang · 2019-08-15T16:20:31Z

@dakrone The approximate total size of my indices is 22mb and I set my repo to 1kb per second. Here is a sequence of my requests:

# Repo configuration
# GET /_snapshot/slow-repo
{
  "slow-repo" : {
    "type" : "fs",
    "settings" : {
      "location" : "test",
      "max_snapshot_bytes_per_sec" : "1kb"
    }
  }
}

# Policy configuration
# GET /_slm/policy/slow-test
{
  "slow-test" : {
    "version" : 1,
    "modified_date_millis" : 1565885533675,
    "policy" : {
      "name" : "slow-test",
      "schedule" : "0 0 0 ? * 7",
      "repository" : "slow-repo",
      "config" : { }
    },
    "next_execution_millis" : 1566000000000
  }
}

# Executing policy the first time
# PUT /_slm/policy/slow-test/_execute
{
  "snapshot_name" : "slow-test-_wl2wwbpseudgpqlbosnnq"
}

# Checking policy information after executing - in progress information is listed
# GET /_slm/policy/slow-test
{
  "slow-test" : {
    "version" : 1,
    "modified_date_millis" : 1565885533675,
    "policy" : {
      "name" : "slow-test",
      "schedule" : "0 0 0 ? * 7",
      "repository" : "slow-repo",
      "config" : { }
    },
    "last_success" : {
      "snapshot_name" : "slow-test-_wl2wwbpseudgpqlbosnnq",
      "time" : 1565885694939
    },
    "next_execution_millis" : 1566000000000,
    "in_progress" : {
      "name" : "slow-test-_wl2wwbpseudgpqlbosnnq",
      "uuid" : "9kEheisWQ5i_4IlT0AamMg",
      "state" : "STARTED",
      "start_time_millis" : 1565885694668
    }
  }
}

# Executing policy the second time time - it times out
# PUT /_slm/policy/slow-test/_execute
{
  "statusCode": 504,
  "error": "Gateway Time-out",
  "message": "Client request timeout"
}

The failure from the second execution does not show up in in policy information or ES logs until I delete the snapshot that is currently in progress.

original-brownbear · 2019-08-20T10:23:54Z

@dakrone I think might now what's going on here.
The transport handler for executing a policy runs on the snapshot pool. So if you're already doing snapshot things, then executing a policy might not work because there's no thread available to handle the request on the snapshot pool.

* Executing SLM policies on the snapshot thread will block until a snapshot finishes if the pool is completely busy executing that snapshot * Fixes elastic#45594

original-brownbear · 2019-08-20T12:30:08Z

@jen-huang I opened #45727 with a suggested fix. If you still have the reproducer setup ready for this, feel free to try it out :)

* Executing SLM policies on the snapshot thread will block until a snapshot finishes if the pool is completely busy executing that snapshot * Fixes #45594

…45727) * Executing SLM policies on the snapshot thread will block until a snapshot finishes if the pool is completely busy executing that snapshot * Fixes elastic#45594

…45748) * Executing SLM policies on the snapshot thread will block until a snapshot finishes if the pool is completely busy executing that snapshot * Fixes #45594

jen-huang added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Aug 15, 2019

original-brownbear mentioned this issue Aug 20, 2019

Stop Executing SLM Policy Transport Action on Snapshot Pool #45727

Merged

original-brownbear closed this as completed in #45727 Aug 20, 2019

original-brownbear mentioned this issue Aug 20, 2019

Stop Executing SLM Policy Transport Action on Snapshot Pool (#45727) #45748

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Executing a SLM policy that is already taking a snapshot times out (504 error code) #45594

Executing a SLM policy that is already taking a snapshot times out (504 error code) #45594

jen-huang commented Aug 15, 2019

elasticmachine commented Aug 15, 2019

dakrone commented Aug 15, 2019

jen-huang commented Aug 15, 2019 •

edited

Loading

original-brownbear commented Aug 20, 2019

original-brownbear commented Aug 20, 2019

Executing a SLM policy that is already taking a snapshot times out (504 error code) #45594

Executing a SLM policy that is already taking a snapshot times out (504 error code) #45594

Comments

jen-huang commented Aug 15, 2019

elasticmachine commented Aug 15, 2019

dakrone commented Aug 15, 2019

jen-huang commented Aug 15, 2019 • edited Loading

original-brownbear commented Aug 20, 2019

original-brownbear commented Aug 20, 2019

jen-huang commented Aug 15, 2019 •

edited

Loading