-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple slicers running with the same ex_id #454
Comments
We know that we lost network to some nodes during this period, which we believe left one of the teraslice worker nodes without network for at least several minutes. Kimbro reports that the logs indicate that at least two slicers were running on the node that didn't have network. |
It appears this incident was the result of a node temporarily losing network connectivity. While there was no connectivity the slicers crashed with a "no living connection" error that I suspect is because the state cluster was unreachable. I believe they were then restarted automatically resulting in the issue above. At the time of this incident there were 4 jobs running. 2 of them ended up in failed status (the ones with slicers on the affected node) and 2 in failing status (I suspect from losing the workers on the node). |
I have attempted to manually reproduce this but have not yet had luck. I do have something worth reporting though. First off, the technique for reproducing this requires the creation of a persistent job: {
"name": "Data generator to noop",
"lifecycle": "persistent",
"analytics": false,
"operations": [
{
"_op": "elasticsearch_data_generator",
"type": "events",
"size": 1000
},
{
"_op": "noop"
}
]
} I launch the integration tests and let them complete so I have a TS master a TS worker and an ES node, then I run the job with this: curl -XPOST -Ss localhost:45678/jobs -d@spec/fixtures/jobs/gen-noop.json The next thing you need is a way to partition the worker that runs the slicer from the rest of the machines. Fortunately recent versions of docker allow you to disconnect a container from a network. The integration tests get a network named
you can then reconnect the same worker using the following command:
This did not result in a duplicate slicer in the two cases I've tried but it did result in a log entry that claimed to be creating a slicer right before the job got shutdown:
|
Note on the above ... the job had the following failed status:
|
Here is the sequence of events from above distilled down into just the slicer related entries,
|
From cluster.on('exit', function(worker, code, signal) {
var type = worker.assignment ? worker.assignment : 'worker';
logger.info(`${type} has exited, id: ${worker.id}, code: ${code}, signal: ${signal}`);
if (!shuttingDown && shouldProcessRestart(code, signal)) {
var envConfig = determineWorkerENV(config, worker);
var newWorker = cluster.fork(envConfig);
logger.info(`launching a new ${type}, id: ${newWorker.id}`);
logger.debug(`new worker configuration: ${JSON.stringify(envConfig)}`);
_.assign(cluster.workers[newWorker.id], envConfig)
}
}); the call to if (process.env.__process_restart) {
var errMsg = `Slicer for ex_id: ${ex_id} runtime error led to a restart, terminating job with failed status, please use the recover api to return slicer to a consistent state`;
logger.error(errMsg);
messaging.send({message: 'slicer:error:terminal', error: errMsg, ex_id: ex_id})
} so it looks like its trying to prevent itself from starting (which it does in this case). But perhaps this |
prevent cleanUpNode fn from duplicating jobs resolves #454
We had a recent incident where a teraslice job reading from kafka and writing to ES started failing. This left us with duplicate slicers on the same ex_id as shown below:
The text was updated successfully, but these errors were encountered: