Skip to content
This repository has been archived by the owner on Nov 8, 2022. It is now read-only.

Impossible to stop Snap task #1482

Closed
iwankgb opened this issue Jan 20, 2017 · 6 comments
Closed

Impossible to stop Snap task #1482

iwankgb opened this issue Jan 20, 2017 · 6 comments

Comments

@iwankgb
Copy link
Contributor

iwankgb commented Jan 20, 2017

We have a task that runs for approximately 2 minutes and then is stopped. The task is being run in a loop of configurable number of iterations. Sometimes it is not possible to stop the task.

  • OS version
[root@*********** ~]# uname -a
Linux ******* 4.9.2-1.el7.elrepo.x86_64 #1 SMP Tue Jan 10 11:30:19 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Snap version
[root@*********** ~]# snapteld --version
snapteld version 1.0.0
  • Environment details (virtual, physical, etc.) - bare metal
  • Steps to reproduce - keep creating and stopping Snap task.
  • Expected results: - task stoppped
  • Actual results: task is not stopped and pretends to be running; one of the plugins remains running despite of attempt to kill it
[root@*********** ~]# snaptel task list
ID 					 NAME 						 STATE 		 HIT 	 MISS 	 FAIL 	 CREATED 		 LAST FAILURE
dc2810f5-76e2-4c91-a4b4-08918d55df8f 	 serenity                                  	 Running 	 434 	 26 	 1 	 1:17PM 1-20-2017 	 CollectMetrics call error : could not obtain load metric: no history stored: could not calculate current load
[root@*********** ~]# ps ax | grep [s]nap
16600 ?        Ssl    0:13 /opt/snap/sbin/snapteld
32595 ?        Sl     0:01 /tmp/478065240/snap-plugin-publisher-influxdb {"LogLevel":5,"PingTimeoutDuration":10000000000,"NoDaemon":false,"Pprof":false}

@marcin-krolik suggested that it might be related to #1454.

Log file is available at: https://gist.github.com/iwankgb/03ca6ac4d18cd1247fd22905645b3dbb

@candysmurf
Copy link
Contributor

candysmurf commented Jan 20, 2017

@iwankgb, thanks for logging this issue. Do you mind to share your script for reproducing this issue easily?

@marcin-krolik
Copy link
Collaborator

marcin-krolik commented Jan 24, 2017

I think I managed to debug it. Locking mutex in control/available_plugin.go:findLatestPool

func (ap *availablePlugins) findLatestPool(pType, name string) (strategy.Pool, serror.SnapError) {
        ap.RLock()
        defer ap.RUnlock()
	// see if there exists a pool at all which matches name version.
	var latest strategy.Pool

seems to solve this issue.

For reproduction and debugging I used below script:

#!/bin/bash
snapteld -l 1 -t 0 -o '/tmp/snapteld.log'
sleep 3
snaptel plugin load snap-plugin-collector-docker
snaptel plugin load snap-plugin-processor-tag
snaptel plugin load snap-plugin-publisher-influxdb
for j in {1..1000}
do
    echo "----------- iteration ${j} -----------------"
    id=`snaptel task create -t task.yml | grep ID | awk '{print $2}'`
    echo "Task ${id} started...."
    sleep 15

    for i in {1..5}
    do
      snaptel task stop ${id}
    done
    echo "Task stopped"
done

with task:

---
  version: 1
  schedule: 
    type: "simple"
    interval: "1s"
  max-failures: 10
  workflow: 
    collect: 
      metrics: 
        /intel/docker/*/stats/cgroups/*: {}
      process:
        -
          plugin_name: "tag"
          config:
            tags: "test:issue"
          publish:
            -
              plugin_name: "influxdb"
              config:
                host: "10.91.126.126"
                port: 8086
                database: "test"
                user: "admin"
                password: "admin"
                https: false
      publish:
        - 
          plugin_name: "influxdb"
          config: 
             host: "localhost"
             port: 8086
             database: "test"
             user: "admin"
             password: "admin"
             https: false

Repeated calls to stop task, and general structure of task manifest are added to simulate exact steps from @iwankgb experiments

@iwankgb
Copy link
Contributor Author

iwankgb commented Jan 25, 2017

I added mitigation that @marcin-krolik developed to Snap installation that we use and it has not solved the issue unfortunately.

@marcin-krolik
Copy link
Collaborator

It seems my previous finding does not really solves this issue. There is new observation though - when hang occurs it looks like publisher is not killed. For every experiment I did with different publishers (influxdb and file) publisher binary was always visible on list of active processes. All other plugins (collector, processor) where killed as expected.

@katarzyna-z
Copy link
Contributor

It is easier to reproduce this issue running several tasks parallely and then after several seconds stopping them.

katarzyna-z added a commit to katarzyna-z/snap that referenced this issue Feb 16, 2017
@katarzyna-z
Copy link
Contributor

katarzyna-z commented Feb 16, 2017

My method of reproduction is accessible by following steps:

  1. get test.sh - script to download plugins, task manifests and run test
$ wget https://gist.githubusercontent.com/katarzyna-z/0d1fb8c3bfad134e494bd57390293f89/raw/56c17feb6c468c99f91484464763337432cbda35/test.sh
  1. chmod 777 test.sh
  2. launch test.sh

katarzyna-z added a commit to katarzyna-z/snap that referenced this issue Feb 17, 2017
katarzyna-z added a commit to katarzyna-z/snap that referenced this issue Feb 24, 2017
Removed double RLock in (p *pool) Eligible()

Replaced lock for reading with lock for writing (RLock/RUnlock -> Lock/Unlock) in (p *pool) IncRestartCount
PatrykMatyjasek added a commit that referenced this issue Feb 24, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants