-
Notifications
You must be signed in to change notification settings - Fork 294
Impossible to stop Snap task #1482
Comments
@iwankgb, thanks for logging this issue. Do you mind to share your script for reproducing this issue easily? |
I think I managed to debug it. Locking mutex in control/available_plugin.go:findLatestPool func (ap *availablePlugins) findLatestPool(pType, name string) (strategy.Pool, serror.SnapError) {
ap.RLock()
defer ap.RUnlock()
// see if there exists a pool at all which matches name version.
var latest strategy.Pool seems to solve this issue. For reproduction and debugging I used below script: #!/bin/bash
snapteld -l 1 -t 0 -o '/tmp/snapteld.log'
sleep 3
snaptel plugin load snap-plugin-collector-docker
snaptel plugin load snap-plugin-processor-tag
snaptel plugin load snap-plugin-publisher-influxdb
for j in {1..1000}
do
echo "----------- iteration ${j} -----------------"
id=`snaptel task create -t task.yml | grep ID | awk '{print $2}'`
echo "Task ${id} started...."
sleep 15
for i in {1..5}
do
snaptel task stop ${id}
done
echo "Task stopped"
done with task: ---
version: 1
schedule:
type: "simple"
interval: "1s"
max-failures: 10
workflow:
collect:
metrics:
/intel/docker/*/stats/cgroups/*: {}
process:
-
plugin_name: "tag"
config:
tags: "test:issue"
publish:
-
plugin_name: "influxdb"
config:
host: "10.91.126.126"
port: 8086
database: "test"
user: "admin"
password: "admin"
https: false
publish:
-
plugin_name: "influxdb"
config:
host: "localhost"
port: 8086
database: "test"
user: "admin"
password: "admin"
https: false Repeated calls to stop task, and general structure of task manifest are added to simulate exact steps from @iwankgb experiments |
I added mitigation that @marcin-krolik developed to Snap installation that we use and it has not solved the issue unfortunately. |
It seems my previous finding does not really solves this issue. There is new observation though - when hang occurs it looks like publisher is not killed. For every experiment I did with different publishers (influxdb and file) publisher binary was always visible on list of active processes. All other plugins (collector, processor) where killed as expected. |
It is easier to reproduce this issue running several tasks parallely and then after several seconds stopping them. |
My method of reproduction is accessible by following steps:
|
Removed double RLock in (p *pool) Eligible() Replaced lock for reading with lock for writing (RLock/RUnlock -> Lock/Unlock) in (p *pool) IncRestartCount
Fixes #1482, removed unsafe double RLock
We have a task that runs for approximately 2 minutes and then is stopped. The task is being run in a loop of configurable number of iterations. Sometimes it is not possible to stop the task.
@marcin-krolik suggested that it might be related to #1454.
Log file is available at: https://gist.github.com/iwankgb/03ca6ac4d18cd1247fd22905645b3dbb
The text was updated successfully, but these errors were encountered: