Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teraslice jobs fail to start on k8s 1.27.4, ts-exc starts but workers don't #3428

Closed
busma13 opened this issue Oct 6, 2023 · 4 comments
Closed

Comments

@busma13
Copy link
Contributor

busma13 commented Oct 6, 2023

          I had an issue running teraslice in minikube. Possibly related to the version of k8s I'm using. Here are the teraslice-master logs. I've removed or abbreviated what didn't look relevant to me.
kubectl -n ts-dev1 logs teraslice-master-6f65f6bcc4-5mt99 | bunyan
[2023-10-04T22:26:45.037Z]  INFO: teraslice/7 on teraslice-master-6f65f6bcc4-5mt99: Service starting (assignment=node_master)
...
(skipping setup, ES, asset deployment, etc)
...
[2023-10-04T22:34:14.911Z] DEBUG: example-data-generator-job/14 on teraslice-master-6f65f6bcc4-5mt99: enqueueing execution to be processed (queue size 0) (assignment=cluster_master, module=execution_service, worker_id=WvOOtRs_, active=true, analytics=true, performance_metrics=false, autorecover=false, lifecycle=once, max_retries=3, probation_window=300000, slicers=1, workers=2, stateful=false, labels=null, env_vars={}, targets=[], ephemeral_storage=true, pod_spec_override={}, volumes=[], job_id=5aee9a70-65ea-47ed-be53-ead76c096b4d, _context=ex, _created=2023-10-04T22:34:14.895Z, _updated=2023-10-04T22:34:14.895Z, ex_id=d5222f35-ad6a-4269-8b4e-e846636e44d4, metadata={}, _status=pending, _has_errors=false, _slicer_stats={}, _failureReason="")
    assets: [
      "65ee07b97850ce15e78068224febff5c5deb9ae7",
      "2b4f08ae993293c44af418d2f9ec3746d98039ae",
      "00183d8c533503f4acba7aa001931563a001791f"
    ]
    --
    operations: [
      {
        "_op": "data_generator",
        "_encoding": "json",
        "_dead_letter_action": "throw",
        "json_schema": null,
        "size": 5000000,
        "start": null,
        "end": null,
        "format": null,
        "stress_test": false,
        "date_key": "created",
        "set_id": null,
        "id_start_key": null
      },
      {
        "_op": "example",
        "_encoding": "json",
        "_dead_letter_action": "none",
        "type": "string"
      },
      {
        "_op": "delay",
        "_encoding": "json",
        "_dead_letter_action": "throw",
        "ms": 30000
      },
      {
        "_op": "elasticsearch_bulk",
        "_encoding": "json",
        "_dead_letter_action": "throw",
        "size": 5000,
        "connection": "default",
        "index": "terak8s-example-data",
        "type": "events",
        "delete": false,
        "update": false,
        "update_retry_on_conflict": 0,
        "update_fields": [],
        "upsert": false,
        "create": false,
        "script_file": "",
        "script": "",
        "script_params": {},
        "api_name": "elasticsearch_sender_api"
      }
    ]
    --
    apis: [
      {
        "_name": "elasticsearch_sender_api",
        "_encoding": "json",
        "_dead_letter_action": "throw",
        "size": 5000,
        "connection": "default",
        "index": "terak8s-example-data",
        "type": "events",
        "delete": false,
        "update": false,
        "update_retry_on_conflict": 0,
        "update_fields": [],
        "upsert": false,
        "create": false,
        "script_file": "",
        "script": "",
        "script_params": {},
        "_op": "elasticsearch_bulk"
      }
    ]
[2023-10-04T22:34:15.230Z]  INFO: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: Scheduling execution: d5222f35-ad6a-4269-8b4e-e846636e44d4 (assignment=cluster_master, module=execution_service, worker_id=WvOOtRs_)
[2023-10-04T22:34:15.273Z] DEBUG: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: execution allocating slicer (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=WvOOtRs_, apiVersion=batch/v1, kind=Job)
    metadata: {
        ...
    }
     --
    spec: {
        ...
    }
[2023-10-04T22:34:15.284Z] DEBUG: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: k8s slicer job submitted (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=WvOOtRs_, kind=Job, apiVersion=batch/v1, status={})
    metadata: {
        ...
    }
     --
    spec: {
        ...
    }
[2023-10-04T22:34:15.289Z] DEBUG: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: waiting for pod matching: controller-uid=undefined (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=WvOOtRs_)
(repeats 10 times)
...
[2023-10-04T22:34:25.673Z]  INFO: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: execution d5222f35-ad6a-4269-8b4e-e846636e44d4 is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=WvOOtRs_)
[2023-10-04T22:34:26.375Z] DEBUG: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: waiting for pod matching: controller-uid=undefined (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=WvOOtRs_)
(repeats 48 more times)
...
[2023-10-04T22:35:16.021Z]  WARN: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: Failed to provision execution d5222f35-ad6a-4269-8b4e-e846636e44d4 (assignment=cluster_master, module=execution_service, worker_id=WvOOtRs_)
[2023-10-04T22:35:16.059Z]  WARN: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: Calling stopExecution on execution: d5222f35-ad6a-4269-8b4e-e846636e44d4 to clean up k8s resources. (assignment=cluster_master, module=execution_service, worker_id=WvOOtRs_)
[2023-10-04T22:35:16.063Z]  INFO: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: k8s._deleteObjByExId: d5222f35-ad6a-4269-8b4e-e846636e44d4 execution_controller jobs deleting: ts-exc-example-data-generator-job-5aee9a70-65ea (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=WvOOtRs_)
[2023-10-04T22:35:16.072Z]  INFO: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: k8s._deleteObjByExId: d5222f35-ad6a-4269-8b4e-e846636e44d4 worker deployments has already been deleted (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=WvOOtRs_)
[2023-10-04T22:35:16.205Z] DEBUG: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: execution d5222f35-ad6a-4269-8b4e-e846636e44d4 finished, shutting down execution (assignment=cluster_master, module=execution_service, worker_id=WvOOtRs_)
[2023-10-04T22:35:16.209Z]  INFO: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: k8s._deleteObjByExId: d5222f35-ad6a-4269-8b4e-e846636e44d4 execution_controller jobs has already been deleted (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=WvOOtRs_)
[2023-10-04T22:35:16.212Z]  INFO: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: k8s._deleteObjByExId: d5222f35-ad6a-4269-8b4e-e846636e44d4 worker deployments has already been deleted (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=WvOOtRs_)
[2023-10-04T22:35:16.297Z]  INFO: teraslice/14 on teraslice-master-6f65f6bcc4-5mt99: client d5222f35-ad6a-4269-8b4e-e846636e44d4 disconnected { reason: 'client namespace disconnect' } (assignment=cluster_master, module=messaging:server, worker_id=WvOOtRs_)

_Originally posted by @busma13 in https://github.com/terascope/teraslice/issues/3427#issuecomment-1749016492_
            
ref: #3427  
@busma13
Copy link
Contributor Author

busma13 commented Oct 6, 2023

Changing k8s version to 1.23.17 works, so it is likely an api change.

@busma13
Copy link
Contributor Author

busma13 commented Oct 6, 2023

in v 1.23.17:
@@@@@@@@@ k8s index.js jobResult.spec.selector.matchLabels: {"controller-uid":"cce11921-bca1-41d2-9e62-a7541ba65a40"}

in version 1.27.3:
@@@@@@@@@ k8s index.js jobResult.spec.selector.matchLabels: {"batch.kubernetes.io/controller-uid":"dd0a0705-a83c-4d75-9744-47ab7746303f"}

@godber
Copy link
Member

godber commented Oct 6, 2023

Specifically, this message:

waiting for pod matching: controller-uid=undefined

Coming from here ...

https://github.com/terascope/teraslice/blob/master/packages/teraslice/lib/cluster/services/cluster/backends/kubernetes/k8s.js#L104

called from here:

async function allocateSlicer(ex) {
const execution = cloneDeep(ex);
execution.slicer_port = 45680;
const exJobResource = new K8sResource(
'jobs', 'execution_controller', context.sysconfig.teraslice, execution, logger
);
const exJob = exJobResource.resource;
logger.debug(exJob, 'execution allocating slicer');
const jobResult = await k8s.post(exJob, 'job');
logger.debug(jobResult, 'k8s slicer job submitted');
const controllerUid = jobResult.spec.selector.matchLabels['controller-uid'];
const pod = await k8s.waitForSelectedPod(
`controller-uid=${controllerUid}`,
null,
context.sysconfig.teraslice.slicer_timeout
);
logger.debug(`Slicer is using IP: ${pod.status.podIP}`);
execution.slicer_hostname = `${pod.status.podIP}`;
return execution;
}

So we can't "find" the execution controller the way we had ... anymore.

@busma13
Copy link
Contributor Author

busma13 commented Oct 10, 2023

ref:
#3429

@godber godber changed the title Running Minikube k8s example breaks on k8s 1.27.4 Teraslice jobs fail to start on k8s 1.27.4, ts-exc starts but workers don't Oct 11, 2023
@busma13 busma13 closed this as completed Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants