Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests][dask] fix workers without data test (fixes #5537) #5544

Merged
merged 3 commits into from
Nov 21, 2022

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Oct 16, 2022

Changes tests/python_package_test/test_dask.py::test_training_succeeds_even_if_some_workers_do_not_have_any_data to have three workers and data in only two of them.

Fixes #5537.

@jmoralez jmoralez changed the title WIP: [tests][dask] fix workers without data test [tests][dask] fix workers without data test (fixes #5537) Oct 16, 2022
@jmoralez jmoralez marked this pull request as ready for review October 16, 2022 04:20
@jmoralez jmoralez requested a review from jameslamb as a code owner October 16, 2022 04:20
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

Before I review, could you please explain, either in the description or in a comment here, what the root cause of the timeouts observed in #5537 were and how this fixes it?

The diff has a lot more changes in it than I'd expected from the description "have three workers and data in only two of them". For example, it's not obvious to me how that type of change would require bringing in new imports from sklearn.metrics.

I'm especially interested in the change where we're now passing an explicit list of workers= to client.persist(). Was the root problem just that "collection_to_single_partition() was resulting in, for example, dX being all on the first worker and dy being all on the second worker"? If so, couldn't the same effect be accomplished with the existing 2-worker cluster?

@jmoralez
Copy link
Collaborator Author

jmoralez commented Nov 9, 2022

what the root cause of the timeouts observed in #5537 were

I couldn't reproduce the timeouts locally.

how this fixes it

The purpose of the test was to check that in a distributed training setting, having a worker without data wouldn't cause a problem, that's why I changed it to having three workers and only two with data, so distributed training is performed (because of the two workers having data), whereas previously it was probably falling back to just local training, and the definition of "succeeds" is that we achieve a good metric value (hence the sklearn.metrics import).

@jameslamb jameslamb self-requested a review November 10, 2022 03:25
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously it was probably falling back to just local training

OHHHHHHHHHHHH. Thank you so much for that explanation. Makes sense to me!

Let's merge this then 😁

@jameslamb
Copy link
Collaborator

Will merge this whenever we get CI working again 😭

Hopefully if/when #5514 is merged.

@jameslamb
Copy link
Collaborator

thanks so much for fixing this @jmoralez !

@jameslamb jameslamb merged commit 93f2da4 into master Nov 21, 2022
@jameslamb jameslamb deleted the test-worker-without-data branch November 21, 2022 14:27
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ci] [dask] fix test on "only one worker has data" training
2 participants