Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many random errors on CPU generic workers #630

Open
Tracked by #311
eu9ene opened this issue May 24, 2024 · 3 comments
Open
Tracked by #311

Many random errors on CPU generic workers #630

eu9ene opened this issue May 24, 2024 · 3 comments
Assignees
Labels
bug Something is broken or not correct taskcluster Issues related to the Taskcluster implementation of the training pipeline

Comments

@eu9ene eu9ene added bug Something is broken or not correct taskcluster Issues related to the Taskcluster implementation of the training pipeline labels May 24, 2024
@bhearsum
Copy link
Collaborator

The common thread here seems to running out of some sort resources. Examples include:

[task 2024-05-23T23:56:57.158Z] [34/12:fasttext_filter] OpenBLAS blas_thread_init: pthread_create failed for thread 22 of 32: Resource temporarily unavailable
[task 2024-05-24T04:15:45.087Z] [24/4:deescape-special-chars] Error: can't start new thread
[task 2024-05-24T01:12:29.457Z] [81/12:fasttext_filter] OpenBLAS blas_thread_init: pthread_create failed for thread 22 of 32: Resource temporarily unavailable
[task 2024-05-24T01:12:29.457Z] [81/12:fasttext_filter] OpenBLAS blas_thread_init: RLIMIT_NPROC 1031641 current, 1031641 max

I suspect podman is enforcing some resource limits. For example, by default there's a limit of 2048 pids in the container:

--pids-limit=limit

Tune the container’s pids limit. Set to -1 to have unlimited pids for the container. The default is 2048 on systems that support “pids” cgroup controller.

(From https://docs.podman.io/en/latest/markdown/podman-run.1.html.)

I'm not sure if we're doing this intentionally or not; I'll look into it.

@bhearsum
Copy link
Collaborator

bhearsum commented Jul 2, 2024

taskcluster/taskcluster#7120 has been filed for this.

@bhearsum
Copy link
Collaborator

bhearsum commented Jul 4, 2024

taskcluster/taskcluster#7120 has been filed for this.

This is fixed. We need to wait for it to be released, and pick up a new version of generic-worker for the cpu workers before to call this ticket fixed. I may wait on updating that image until some of the other things blocking generic worker for cpu workers are dealt with, in case we have other fixes that need to get into the image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct taskcluster Issues related to the Taskcluster implementation of the training pipeline
Projects
None yet
Development

No branches or pull requests

2 participants