Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU CI process flakiness #12335

Closed
DuYicong515 opened this issue Mar 15, 2022 · 2 comments
Closed

TPU CI process flakiness #12335

DuYicong515 opened this issue Mar 15, 2022 · 2 comments
Assignees
Labels
accelerator: tpu Tensor Processing Unit ci Continuous Integration tests

Comments

@DuYicong515
Copy link
Contributor

DuYicong515 commented Mar 15, 2022

🐛 Bug

CI-TPU tests are constantly failing because of resource issue.
Common reasons are

  1. TPU resources are not available right now. Log messages look like No resources found in default namespace.
  2. During the execution of the tests, the TPU machine got taken away. This will show us with an incomplete log + message like Exited with code exit status 1, CircleCI received exit code 1

In addition, the TPU tests logs are not very useful for contributor to debug issues right now.
For example in #12151(CI-TPU testing link), test_trainer_config_device_ids, test_accelerator_tpu, test_set_devices_if_none_tpu were all passing. However, multiple tests unrelated to this PR were failing with logs related to Cannot replicate if number of devices (1) is different from 8 or can't move model to device. All those failed tests seem to fit models after the 3 PR relevant tests are run. However, the logs aren't very helpful to debug why unrelated tests are failing.

To Reproduce

Submit a PR and run CI-TPU tests. Example failures are in https://app.circleci.com/pipelines/github/PyTorchLightning/pytorch-lightning?branch=pull%2F12151&filter=all

Expected behavior

  1. TPU tests pass/fail are deterministic. The resource issues need to be looked at
  2. Have more clear contributor guidelines on how to work with TPU testing. Examples include when @pl_multi_process_testshould be used in TPU tests, how can we optimize TPU tests performance to avoid timeouts, how do we tell timeouts from resource being taken in the middle
  3. Improve TPU CI process log messages

cc @carmocca @akihironitta @Borda @kaushikb11 @rohitgr7

@kaushikb11 kaushikb11 self-assigned this Mar 15, 2022
@kaushikb11 kaushikb11 added this to the 1.7 milestone Mar 15, 2022
@kaushikb11 kaushikb11 added ci Continuous Integration accelerator: tpu Tensor Processing Unit tests labels Mar 15, 2022
@akihironitta
Copy link
Contributor

Also, it would be very nice to see logs in real-time.

For the note, currently, we can see logs from CI only after the TPU CI job succeeds or fails (which usually takes up to 20min+) as dumped at:
https://github.com/PyTorchLightning/pytorch-lightning/blob/8b4abe4edb6912abeb3906c48bb822a3681b08c4/.circleci/config.yml#L96-L99

@carmocca carmocca removed this from the pl:1.7 milestone Jul 19, 2022
@carmocca
Copy link
Contributor

Closing in favor of #13720

@carmocca carmocca closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit ci Continuous Integration tests
Projects
None yet
Development

No branches or pull requests

4 participants