TPU CI process flakiness #12335

DuYicong515 · 2022-03-15T08:02:04Z

🐛 Bug

CI-TPU tests are constantly failing because of resource issue.
Common reasons are

TPU resources are not available right now. Log messages look like No resources found in default namespace.
During the execution of the tests, the TPU machine got taken away. This will show us with an incomplete log + message like Exited with code exit status 1, CircleCI received exit code 1

In addition, the TPU tests logs are not very useful for contributor to debug issues right now.
For example in #12151(CI-TPU testing link), test_trainer_config_device_ids, test_accelerator_tpu, test_set_devices_if_none_tpu were all passing. However, multiple tests unrelated to this PR were failing with logs related to Cannot replicate if number of devices (1) is different from 8 or can't move model to device. All those failed tests seem to fit models after the 3 PR relevant tests are run. However, the logs aren't very helpful to debug why unrelated tests are failing.

To Reproduce

Submit a PR and run CI-TPU tests. Example failures are in https://app.circleci.com/pipelines/github/PyTorchLightning/pytorch-lightning?branch=pull%2F12151&filter=all

Expected behavior

TPU tests pass/fail are deterministic. The resource issues need to be looked at
Have more clear contributor guidelines on how to work with TPU testing. Examples include when @pl_multi_process_testshould be used in TPU tests, how can we optimize TPU tests performance to avoid timeouts, how do we tell timeouts from resource being taken in the middle
Improve TPU CI process log messages

cc @carmocca @akihironitta @Borda @kaushikb11 @rohitgr7

The text was updated successfully, but these errors were encountered:

akihironitta · 2022-03-27T09:41:22Z

Also, it would be very nice to see logs in real-time.

For the note, currently, we can see logs from CI only after the TPU CI job succeeds or fails (which usually takes up to 20min+) as dumped at:
https://github.com/PyTorchLightning/pytorch-lightning/blob/8b4abe4edb6912abeb3906c48bb822a3681b08c4/.circleci/config.yml#L96-L99

carmocca · 2022-07-19T16:29:29Z

Closing in favor of #13720

kaushikb11 self-assigned this Mar 15, 2022

kaushikb11 added this to the 1.7 milestone Mar 15, 2022

kaushikb11 added ci Continuous Integration accelerator: tpu Tensor Processing Unit tests labels Mar 15, 2022

DuYicong515 mentioned this issue Mar 21, 2022

Remove AcceleratorConnector.tpu_id #12387

Merged

12 tasks

akihironitta mentioned this issue Mar 27, 2022

Fix TPU testing and collect all tests #11098

Merged

11 tasks

carmocca removed this from the pl:1.7 milestone Jul 19, 2022

carmocca closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPU CI process flakiness #12335

TPU CI process flakiness #12335

DuYicong515 commented Mar 15, 2022 •

edited by akihironitta

Loading

akihironitta commented Mar 27, 2022

carmocca commented Jul 19, 2022

TPU CI process flakiness #12335

TPU CI process flakiness #12335

Comments

DuYicong515 commented Mar 15, 2022 • edited by akihironitta Loading

🐛 Bug

To Reproduce

Expected behavior

akihironitta commented Mar 27, 2022

carmocca commented Jul 19, 2022

DuYicong515 commented Mar 15, 2022 •

edited by akihironitta

Loading