Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix quantile tests running on multi-gpus #8775

Merged
merged 8 commits into from
Feb 13, 2023

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Feb 9, 2023

As reported in #8710 (comment).

Closes #8782 .

@rongou
Copy link
Contributor Author

rongou commented Feb 9, 2023

@trivialfis @hcho3 I'm surprised these tests are not run in the mgpu buildkite pipeline. Do we need to enable them?

@hcho3
Copy link
Collaborator

hcho3 commented Feb 9, 2023

We run all gtests using a single GPU only. This was done to save CI cost.

@rongou
Copy link
Contributor Author

rongou commented Feb 9, 2023

Ah I see. So what does the multi-gpu pipeline do? :)

@hcho3
Copy link
Collaborator

hcho3 commented Feb 9, 2023

@rongou It runs Python tests with mgpu markers

@rongou
Copy link
Contributor Author

rongou commented Feb 10, 2023

Can we also run the multi-gpu c++ tests there? For example, we can name the two tests here something like MGPUQuantile, and then it's a matter of running testxgboost --gtest_filter=MGPU*.

@hcho3
Copy link
Collaborator

hcho3 commented Feb 10, 2023

@rongou That's one option. Alternatively, we can create a separate gtest binary for the multi-GPU tests.

@trivialfis
Copy link
Member

The CPP MGPU tests were removed after having dask tests. However, it seems we might need to bring it back as some of them are moved back to CPP from dask with the development of federated learning.

@trivialfis
Copy link
Member

Opened an issue for tracking #8782 .

@hcho3
Copy link
Collaborator

hcho3 commented Feb 10, 2023

See my latest commit. It will run the quantile tests using 4 GPUs

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick CI update @hcho3 .

@rongou Could you please fix the error on the new CI test?

@rongou
Copy link
Contributor Author

rongou commented Feb 10, 2023

@hcho3 is this a transient error or something wrong with the configuration?

An error occurred (RepositoryAlreadyExistsException) when calling the CreateRepository operation: The repository with name 'xgb-ci.gpu11.0' already exists in the registry with id '492475357299'

@hcho3
Copy link
Collaborator

hcho3 commented Feb 10, 2023

@rongou The error is expected. It means that the Docker cache already has the Docker image for this CI pipeline. Can you take a look and find out why the gtest binary is crashing?

@rongou
Copy link
Contributor Author

rongou commented Feb 10, 2023

@hcho3 is there any way to get more information from the run? Right now it just says

Error: The command exited with status 135

I can't reproduce this on my local machine with 2 GPUs.

@hcho3
Copy link
Collaborator

hcho3 commented Feb 10, 2023

Is it using NCCL?

@rongou
Copy link
Contributor Author

rongou commented Feb 10, 2023

Yes.

@hcho3
Copy link
Collaborator

hcho3 commented Feb 10, 2023

I'll try to debug on my end

@hcho3 hcho3 force-pushed the fix-mgpu-quantile-tests branch from 87d06c5 to 4e9e6c5 Compare February 12, 2023 06:28
@hcho3 hcho3 merged commit ed91e77 into dmlc:master Feb 13, 2023
@rongou rongou deleted the fix-mgpu-quantile-tests branch September 25, 2023 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-GPU for CPP test.
3 participants