[RFC] Improving continuous integration #4234

hcho3 · 2019-03-08T05:45:57Z

Now that we have two sponsors funding the continuous integration (CI) infrastructure (https://xgboost-ci.net), we should discuss ways to improve it.

De-couple builds from tests

Currently, we have a single Jenkins stage where XGBoost is both built and run. We should split this stage into two, one for builds and another for tests. The benefits of de-coupling compilation from test runs are:

Eliminate redundant compilation: GPU code is quite slow in compilation, and right now we compile XGBoost many times over. Instead, compile XGBoost only once for each CUDA target.
Test cross-version CUDA support, e.g. Test whether XGBoost package built with CUDA 8.x also runs on a machine with CUDA 10.x.
Save intermediate artifacts in a S3 bucket: If tests pass, then we can deploy built artifacts immediately.

Add Windows target

The progress has been slow on this front. The main challenge is to get Jenkins to somehow launch Windows workers and send remote commands. We've run into issues compiling XGBoost on Windows a few times (#4139, #3869), so it would be nice to detect potential issues early on. In addition, we want to build Python wheels automatically (I've been building the Windows wheel manually).

Migrate Python and Java tests to Jenkins

The Python tests in the Travis service tend to fail quite often, see Roadmap: Fix flaky tests #3720. Move these tests to Jenkins, where we can choose workers with larger memory.
As discussed in [jvm-packages] Spylon-kernel java.lang.ClassCastException: ml.dmlc.xgboost4j.scala.Booster cannot be cast to ml.dmlc.xgboost4j.scala.Booster #4218, we want to properly support both Spark 2.3.x and 2.4.x. To that end, try building XGBoost4J-Spark with Spark 2.3.x and run it on machine with Spark 2.4.x (and vice versa).

Regular performance tests

We should run a suite of performance tests on a regular basis (say every two weeks). This way, we can detect any performance degradation.

@dmlc/xgboost-committer

hcho3 · 2019-03-11T23:22:24Z

ETA for the first pull request for this RFC: ~~end of this week (March 15, 2019)~~

hcho3 · 2019-04-03T23:33:39Z

It turns out that it is possible to compile CUDA code on machines without NVIDIA GPUs:

https://stackoverflow.com/questions/20186848/can-i-compile-a-cuda-program-without-having-a-cuda-device
MXNet uses CPU workers to compile CUDA code:
https://github.com/apache/incubator-mxnet/blob/0e8c27096289fcb77e473471df99e16e5aa242b9/ci/jenkins/Jenkins_steps.groovy#L164-L176

The benefit is that we can use more powerful CPU instances (e.g. c5d.18xlarge) to compile CUDA code faster.

cc @trivialfis @RAMitchell

terrytangyuan · 2019-04-04T01:06:15Z

Definitely +1 for running performance tests regularly. Probably more frequent than two weeks though.

hcho3 · 2019-04-04T02:00:51Z

@terrytangyuan Any suggestions for performance tests?

terrytangyuan · 2019-04-04T03:26:46Z

@terrytangyuan Any suggestions for performance tests?

Not on top of my head. I've only done this for internal datasets before. But here we should probably pick some good public (Kaggle?) datasets. Note that performance tests include both statistical and computational performance so we may want to consider use cases from both perspectives.

hcho3 · 2019-04-04T03:33:13Z

@terrytangyuan Thanks. Let me think over this over the weekend.

RAMitchell · 2019-04-10T02:47:00Z

Would something like this be okay for performance benchmarking? We have a more polished Nvidia version that could be open sourced. All of the dataset loading/processing is automatic and it runs on a docker container.

trivialfis · 2019-04-15T05:29:23Z

Preferably move R tests to Jenkins where we can cache all built dependencies. ;-)

hcho3 · 2019-04-15T05:29:59Z

@trivialfis Yes, yes, yes! Docker is a great invention

hcho3 mentioned this issue Mar 8, 2019

Set up continuous integration for Windows GPU target #3868

Closed

hcho3 added the status: RFC label Mar 8, 2019

hcho3 mentioned this issue Mar 8, 2019

Downgrade Jenkins agent to CUDA 9.2, to fix broken NCCL2 installation #4232

Merged

hcho3 pinned this issue Mar 8, 2019

This was referenced Mar 12, 2019

[RFC] harde rabit framework #4250

Closed

Roadmap: Fix flaky tests #3720

Closed

hcho3 mentioned this issue Apr 15, 2019

Refactor CMake scripts. #4323

Merged

6 tasks

This was referenced Apr 17, 2019

[CI] Refactor Jenkins CI pipeline + migrate tests to Jenkins #4380

Closed

XGBoost 0.90 Roadmap #4389

Closed

hcho3 unpinned this issue Apr 24, 2019

hcho3 mentioned this issue Apr 25, 2019

[CI] Refactor Jenkins CI pipeline + migrate all Linux tests to Jenkins #4401

Merged

This was referenced May 13, 2019

[CI] Add Windows GPU to Jenkins CI pipeline #4463

Merged

[RFC] Version 0.90 release candidate #4475

Merged

hcho3 closed this as completed Jun 27, 2019

lock bot locked as resolved and limited conversation to collaborators Sep 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Improving continuous integration #4234

[RFC] Improving continuous integration #4234

hcho3 commented Mar 8, 2019 •

edited

Loading

hcho3 commented Mar 11, 2019 •

edited

Loading

hcho3 commented Apr 3, 2019

terrytangyuan commented Apr 4, 2019

hcho3 commented Apr 4, 2019

terrytangyuan commented Apr 4, 2019

hcho3 commented Apr 4, 2019 •

edited

Loading

RAMitchell commented Apr 10, 2019

trivialfis commented Apr 15, 2019

hcho3 commented Apr 15, 2019

[RFC] Improving continuous integration #4234

[RFC] Improving continuous integration #4234

Comments

hcho3 commented Mar 8, 2019 • edited Loading

De-couple builds from tests

Add Windows target

Migrate Python and Java tests to Jenkins

Regular performance tests

hcho3 commented Mar 11, 2019 • edited Loading

hcho3 commented Apr 3, 2019

terrytangyuan commented Apr 4, 2019

hcho3 commented Apr 4, 2019

terrytangyuan commented Apr 4, 2019

hcho3 commented Apr 4, 2019 • edited Loading

RAMitchell commented Apr 10, 2019

trivialfis commented Apr 15, 2019

hcho3 commented Apr 15, 2019

hcho3 commented Mar 8, 2019 •

edited

Loading

hcho3 commented Mar 11, 2019 •

edited

Loading

hcho3 commented Apr 4, 2019 •

edited

Loading