Improve CI speed #7992

wangkuiyi · 2018-01-31T02:39:29Z

Our CI has been running slow recently. Qing-Qing, Yu Yang, Helin, Chen Xi, Ya-ming, Yi-bing, and I discussed this issue and here are what we learned and what we are going to do:

A. Reduce the number of SM architectures

We are building many SM architectures in the CI: https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cuda.cmake.
According to the experiment of Qing-qing, [Speed up compiling]: reduce the NVCC compiling (some .cu operators can be compiled by G++) #5491, nvcc could run faster if we generate less number of SM architectures.

Helin is going to configure the CI system to generate only one SM architecture when checking PRs, but generating all SM architecture code in the nightly build of the develop branch.

B. Migrate the CI system to two servers

We are running four TeamCity agents on four GPU desktops, each with one GPU and a desktop-level CPU (a few cores). We have two idle servers, each with 6 GPUs and a powerful CPU with 56 cores.

Helin will migrate the CI system to the servers.

C. Distribute unit tests to multiple GPUs

Our CI system runs unit tests by calling ctest -j N, where N is the number of processes that run unit tests in parallel. However, all these N processes are using the same GPU.

Qing-qing is going to study if we can make cmake/ctest to use more than one GPUs.

D. Add an environment variable to distinguish unit tests and regression tests.

Unit tests and regression tests are tested on CI server for every PR. They should be distinguished. Only unit tests should be run for every PR. Nightly builds should run all tests. We should add an environment flag to control it.

The text was updated successfully, but these errors were encountered:

putcn · 2018-01-31T04:46:05Z

action item B is done, 198 and 199 are added to CI pool.

Yancey1989 · 2018-02-08T05:35:07Z

After discussing with @dzhwinter, we have another simple idea.
We can cache the thirdparty in a Docker Image which bases on paddle:latest-dev so that we don't need to build the thirdparty repeatedly for each PR.

Maybe the steps are as follows:

Check out the code and check the cmake files under cmake/external or Dockerfile under the root folder have any update, if so:
1. Rebuild a new Docker Image named paddle:teamcity which only contains the thirdparty.
2. Push paddle:teamcity to the docker hub.
Build and run all the unit test with paddle:teamcity

qingqing01 mentioned this issue Jan 31, 2018

Make parallel tests bind to different GPU. #8010

Merged

helinwang closed this as completed in #8010 Feb 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CI speed #7992

Improve CI speed #7992

wangkuiyi commented Jan 31, 2018 •

edited by reyoung

Loading

putcn commented Jan 31, 2018

Yancey1989 commented Feb 8, 2018

Improve CI speed #7992

Improve CI speed #7992

Comments

wangkuiyi commented Jan 31, 2018 • edited by reyoung Loading

A. Reduce the number of SM architectures

B. Migrate the CI system to two servers

C. Distribute unit tests to multiple GPUs

D. Add an environment variable to distinguish unit tests and regression tests.

putcn commented Jan 31, 2018

Yancey1989 commented Feb 8, 2018

wangkuiyi commented Jan 31, 2018 •

edited by reyoung

Loading