You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CI-TPU tests are constantly failing because of resource issue.
Common reasons are
TPU resources are not available right now. Log messages look like No resources found in default namespace.
During the execution of the tests, the TPU machine got taken away. This will show us with an incomplete log + message like Exited with code exit status 1, CircleCI received exit code 1
In addition, the TPU tests logs are not very useful for contributor to debug issues right now.
For example in #12151(CI-TPU testing link), test_trainer_config_device_ids, test_accelerator_tpu, test_set_devices_if_none_tpu were all passing. However, multiple tests unrelated to this PR were failing with logs related to Cannot replicate if number of devices (1) is different from 8 or can't move model to device. All those failed tests seem to fit models after the 3 PR relevant tests are run. However, the logs aren't very helpful to debug why unrelated tests are failing.
TPU tests pass/fail are deterministic. The resource issues need to be looked at
Have more clear contributor guidelines on how to work with TPU testing. Examples include when @pl_multi_process_testshould be used in TPU tests, how can we optimize TPU tests performance to avoid timeouts, how do we tell timeouts from resource being taken in the middle
🐛 Bug
CI-TPU tests are constantly failing because of resource issue.
Common reasons are
No resources found in default namespace.
Exited with code exit status 1, CircleCI received exit code 1
In addition, the TPU tests logs are not very useful for contributor to debug issues right now.
For example in #12151(CI-TPU testing link),
test_trainer_config_device_ids
,test_accelerator_tpu
,test_set_devices_if_none_tpu
were all passing. However, multiple tests unrelated to this PR were failing with logs related toCannot replicate if number of devices (1) is different from 8
orcan't move model to device
. All those failed tests seem to fit models after the 3 PR relevant tests are run. However, the logs aren't very helpful to debug why unrelated tests are failing.To Reproduce
Submit a PR and run CI-TPU tests. Example failures are in https://app.circleci.com/pipelines/github/PyTorchLightning/pytorch-lightning?branch=pull%2F12151&filter=all
Expected behavior
@pl_multi_process_test
should be used in TPU tests, how can we optimize TPU tests performance to avoid timeouts, how do we tell timeouts from resource being taken in the middlecc @carmocca @akihironitta @Borda @kaushikb11 @rohitgr7
The text was updated successfully, but these errors were encountered: