Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple slicers running with the same ex_id #454

Closed
godber opened this issue Jul 12, 2017 · 6 comments
Closed

Multiple slicers running with the same ex_id #454

godber opened this issue Jul 12, 2017 · 6 comments
Assignees

Comments

@godber
Copy link
Member

godber commented Jul 12, 2017

We had a recent incident where a teraslice job reading from kafka and writing to ES started failing. This left us with duplicate slicers on the same ex_id as shown below:

curl -Ss  teraslice-master:5678/txt/slicers
ex_id                                 workers_available  workers_active  failed  queued  processed  subslice_by_key
------------------------------------  -----------------  --------------  ------  ------  ---------  ---------------
558e9d1e-053c-4e7e-8147-5e7c779a3c15  0                  1               2       10001   99323      0
b76938b9-1a24-43f0-8add-4423ef08326b  0                  1               0       10001   174        0
b76938b9-1a24-43f0-8add-4423ef08326b  0                  1               0       10001   174        0
558e9d1e-053c-4e7e-8147-5e7c779a3c15  0                  1               0       10001   177        0
50161028-f91b-4c1c-a489-9fe78929a578  0                  0               1       10001   45385      0
@godber
Copy link
Member Author

godber commented Jul 12, 2017

We know that we lost network to some nodes during this period, which we believe left one of the teraslice worker nodes without network for at least several minutes. Kimbro reports that the logs indicate that at least two slicers were running on the node that didn't have network.

@kstaken
Copy link
Member

kstaken commented Jul 12, 2017

It appears this incident was the result of a node temporarily losing network connectivity. While there was no connectivity the slicers crashed with a "no living connection" error that I suspect is because the state cluster was unreachable. I believe they were then restarted automatically resulting in the issue above.

At the time of this incident there were 4 jobs running. 2 of them ended up in failed status (the ones with slicers on the affected node) and 2 in failing status (I suspect from losing the workers on the node).

@godber
Copy link
Member Author

godber commented Aug 29, 2017

I have attempted to manually reproduce this but have not yet had luck. I do have something worth reporting though. First off, the technique for reproducing this requires the creation of a persistent job:

{
  "name": "Data generator to noop",
  "lifecycle": "persistent",
  "analytics": false,
  "operations": [
    {
      "_op": "elasticsearch_data_generator",
      "type": "events",
      "size": 1000
    },
    {
      "_op": "noop"
    }
  ]
}

I launch the integration tests and let them complete so I have a TS master a TS worker and an ES node, then I run the job with this:

curl -XPOST -Ss localhost:45678/jobs -d@spec/fixtures/jobs/gen-noop.json

The next thing you need is a way to partition the worker that runs the slicer from the rest of the machines. Fortunately recent versions of docker allow you to disconnect a container from a network. The integration tests get a network named integrationtests_default and the worker container is named integrationtests_teraslice-worker_1:

docker network disconnect integrationtests_default integrationtests_teraslice-worker_1

you can then reconnect the same worker using the following command:

docker network connect integrationtests_default integrationtests_teraslice-worker_1

This did not result in a duplicate slicer in the two cases I've tried but it did result in a log entry that claimed to be creating a slicer right before the job got shutdown:

$ docker-compose logs -f
...
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":100,"module":"elasticsearch_backend","level":50,"msg":"Error: Request Timeout after 120000ms\n    at /usr/local/lib/node_modules/terafoundation/node_modules/elasticsearch/src/lib/transport.js:336:15\n    at null.<anonymous> (/usr/local/lib/node_modules/terafoundation/node_modules/elasticsearch/src/lib/transport.js:365:7)\n    at Timer.listOnTimeout (timers.js:92:15)","time":"2017-08-29T19:54:16.919Z","v":0}
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":100,"module":"elasticsearch_backend","level":50,"msg":"Error: Request Timeout after 120000ms\n    at /usr/local/lib/node_modules/terafoundation/node_modules/elasticsearch/src/lib/transport.js:336:15\n    at null.<anonymous> (/usr/local/lib/node_modules/terafoundation/node_modules/elasticsearch/src/lib/transport.js:365:7)\n    at Timer.listOnTimeout (timers.js:92:15)","time":"2017-08-29T19:54:16.919Z","v":0}
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":106,"ex_id":"32e58581-fb8e-4ea0-b271-165e8c89dbcd","module":"worker","worker_id":"172.23.0.4__11","level":50,"msg":"worker 172.23.0.4__11 has disconnected from slicer ex_id 32e58581-fb8e-4ea0-b271-165e8c89dbcd transport close","time":"2017-08-29T19:54:21.373Z","v":0}
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":1,"level":30,"msg":"slicer has exited, id: 10, code: 255, signal: null","time":"2017-08-29T19:54:21.374Z","v":0}
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":1,"level":30,"msg":"launching a new slicer, id: 12","time":"2017-08-29T19:54:21.378Z","v":0}
teraslice-master_1  | {"name":"teraslice","hostname":"51fa18860875","pid":84,"ex_id":"32e58581-fb8e-4ea0-b271-165e8c89dbcd","module":"worker","worker_id":"172.23.0.3__8","level":50,"msg":"worker 172.23.0.3__8 has disconnected from slicer ex_id 32e58581-fb8e-4ea0-b271-165e8c89dbcd transport close","time":"2017-08-29T19:54:21.382Z","v":0}
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":120,"ex_id":"32e58581-fb8e-4ea0-b271-165e8c89dbcd","module":"slicer","level":50,"msg":"Slicer for ex_id: 32e58581-fb8e-4ea0-b271-165e8c89dbcd runtime error led to a restart, terminating job with failed status, please use the recover api to return slicer to a consistent state","time":"2017-08-29T19:54:23.820Z","v":0}
teraslice-master_1  | {"name":"teraslice","hostname":"51fa18860875","pid":10,"module":"jobs_service","level":50,"msg":"slicer for ex_id: 32e58581-fb8e-4ea0-b271-165e8c89dbcd has failed to initialize","time":"2017-08-29T19:54:23.829Z","v":0}
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":120,"ex_id":"32e58581-fb8e-4ea0-b271-165e8c89dbcd","module":"slicer","level":30,"msg":"slicer for job: 32e58581-fb8e-4ea0-b271-165e8c89dbcd has received a shutdown notice","time":"2017-08-29T19:54:23.847Z","v":0}
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":120,"ex_id":"32e58581-fb8e-4ea0-b271-165e8c89dbcd","module":"slicer","level":30,"msg":"slicer is now shutting down stores and exiting","time":"2017-08-29T19:54:24.354Z","v":0}
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":1,"level":30,"msg":"slicer has exited, id: 12, code: 0, signal: null","time":"2017-08-29T19:54:24.377Z","v":0}
teraslice-master_1  | {"name":"teraslice","hostname":"51fa18860875","pid":84,"ex_id":"32e58581-fb8e-4ea0-b271-165e8c89dbcd","module":"worker","worker_id":"172.23.0.3__8","level":30,"msg":"shutting down","time":"2017-08-29T19:54:24.831Z","v":0}
teraslice-master_1  | {"name":"teraslice","hostname":"51fa18860875","pid":84,"module":"analytics_storage","level":30,"msg":"shutting down.","time":"2017-08-29T19:54:24.832Z","v":0}
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":106,"ex_id":"32e58581-fb8e-4ea0-b271-165e8c89dbcd","module":"worker","worker_id":"172.23.0.4__11","level":30,"msg":"shutdown sequence initiated, but is still processing. Will force shutdown in 60 seconds","time":"2017-08-29T19:54:24.844Z","v":0}
teraslice-master_1  | {"name":"teraslice","hostname":"51fa18860875","pid":1,"level":30,"msg":"worker has exited, id: 8, code: 0, signal: null","time":"2017-08-29T19:54:24.865Z","v":0}
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":106,"ex_id":"32e58581-fb8e-4ea0-b271-165e8c89dbcd","module":"worker","worker_id":"172.23.0.4__11","slice_id":"62e9a19d-7fc7-402b-8d08-2e76690c36b8","level":30,"msg":"completed slice:  { slice_id: '62e9a19d-7fc7-402b-8d08-2e76690c36b8',\n  request: 1000,\n  slicer_id: 0,\n  slicer_order: 262 }","time":"2017-08-29T19:54:25.339Z","v":0}
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":106,"ex_id":"32e58581-fb8e-4ea0-b271-165e8c89dbcd","module":"worker","worker_id":"172.23.0.4__11","level":30,"msg":"shutting down","time":"2017-08-29T19:54:25.847Z","v":0}
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":106,"module":"analytics_storage","level":30,"msg":"shutting down.","time":"2017-08-29T19:54:25.849Z","v":0}
teraslice-worker_1  | {"name":"teraslice","hostname":"24f7aa4bd766","pid":1,"level":30,"msg":"worker has exited, id: 11, code: 0, signal: null","time":"2017-08-29T19:54:25.883Z","v":0}
^CERROR: Aborting.

@godber
Copy link
Member Author

godber commented Aug 29, 2017

Note on the above ... the job had the following failed status:

          "_status": "failed",
          "_has_errors": "true",
          "_failureReason": "node_master where slicer resided has disconnected"

@godber
Copy link
Member Author

godber commented Aug 30, 2017

Here is the sequence of events from above distilled down into just the slicer related entries, w is worker and A is from master:

w  | {"pid":1,"msg":"slicer has exited, id: 10, code: 255, signal: null"}
w  | {"pid":1,"msg":"launching a new slicer, id: 12"}
w  | {"pid":120,"module":"slicer","msg":"Slicer for ex_id: 32e58581-fb8e-4ea0-b271-165e8c89dbcd runtime error led to a restart, terminating job with failed status, please use the recover api to return slicer to a consistent state"}
A  | {"pid":10,"module":"jobs_service","msg":"slicer for ex_id: 32e58581-fb8e-4ea0-b271-165e8c89dbcd has failed to initialize"}
w  | {"pid":120,"module":"slicer","msg":"slicer for job: 32e58581-fb8e-4ea0-b271-165e8c89dbcd has received a shutdown notice"}
w  | {"pid":120,"module":"slicer","msg":"slicer is now shutting down stores and exiting"}
w  | {"pid":1,"msg":"slicer has exited, id: 12, code: 0, signal: null"}

@godber
Copy link
Member Author

godber commented Aug 30, 2017

From master.js in terafoundation it looks like any process that dies will be restarted based on conditions ... in this particular case, slicers are restarted ...

    cluster.on('exit', function(worker, code, signal) {
        var type = worker.assignment ? worker.assignment : 'worker';
        logger.info(`${type} has exited, id: ${worker.id}, code: ${code}, signal: ${signal}`);
        if (!shuttingDown && shouldProcessRestart(code, signal)) {
            var envConfig = determineWorkerENV(config, worker);
            var newWorker = cluster.fork(envConfig);
            logger.info(`launching a new ${type}, id: ${newWorker.id}`);
            logger.debug(`new worker configuration: ${JSON.stringify(envConfig)}`);

            _.assign(cluster.workers[newWorker.id], envConfig)

        }
    });

the call to determineWorkerENV() sets process.env.__process_restart to true, which triggers the following in slicer:

if (process.env.__process_restart) {
        var errMsg = `Slicer for ex_id: ${ex_id} runtime error led to a restart, terminating job with failed status, please use the recover api to return slicer to a consistent state`;
        logger.error(errMsg);
        messaging.send({message: 'slicer:error:terminal', error: errMsg, ex_id: ex_id})
    }

so it looks like its trying to prevent itself from starting (which it does in this case). But perhaps this slicer:error:terminal message isn't making it back to the master for some reason.

@godber godber closed this as completed in cc34a42 Sep 5, 2017
godber added a commit that referenced this issue Sep 5, 2017
prevent cleanUpNode fn from duplicating jobs resolves #454
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants