Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix hanging trainings #132

Merged
merged 2 commits into from
Jan 27, 2020
Merged

fix hanging trainings #132

merged 2 commits into from
Jan 27, 2020

Conversation

CodingCat
Copy link
Member

we observed infinitely hanging trainings with low probability when training a single model in Spark app and high probability when training multiple models

basically the tracker cannot figure out the topology of workers in the case of hanging

@CodingCat CodingCat requested a review from trivialfis January 7, 2020 14:16
@CodingCat
Copy link
Member Author

there are many changes not fully validated in the master branch, we are syncing with upstream (since 0.82) very very slowly...expecting to see more issues

@CodingCat
Copy link
Member Author

@trivialfis looks like some CI env problem? python3 not found?

@hcho3
Copy link
Contributor

hcho3 commented Jan 7, 2020

I'll take a look. I've seen the same problem in the CI for dmlc-core

@hcho3
Copy link
Contributor

hcho3 commented Jan 7, 2020

See dmlc/dmlc-core#589

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what's happening here. Basically to me it's: Some code is removed, some debug messages are changed and not using tracker to print.

It would be nice if you have tests to specify expected behavior.

@CodingCat
Copy link
Member Author

I don't understand what's happening here. Basically to me it's: Some code is removed, some debug messages are changed and not using tracker to print.

It would be nice if you have tests to specify expected behavior.

basically I am reverting the change at e3d51d3#diff-837b172455c2d356f95a6d61ab505595R277

cc: @chenqin

@trivialfis
Copy link
Member

trivialfis commented Jan 8, 2020

Also, from your description:

basically the tracker cannot figure out the topology of workers in the case of hanging

The changes here doesn't seen to be related to tracker. I would really appreciate a small test that shows the expected behavior, maybe just a unittest in single node env. Any code without unittest is considered legacy.

@CodingCat
Copy link
Member Author

Also, from your description:

basically the tracker cannot figure out the topology of workers in the case of hanging

The changes here does seen to be related to tracker. I would really appreciate a small test that shows the expected behavior, maybe just a unittest in single node env. Any code without unittest is considered legacy.

no, the changes I reverted brings wrong information about worker's identity, that makes tracker fall into a dead loop at https://github.com/dmlc/dmlc-core/blob/master/tracker/dmlc_tracker/tracker.py#L105

I think the problem is that the changes at e3d51d3#diff-837b172455c2d356f95a6d61ab505595R277 claims it is to enable all tests according to the titlein PR, however, it turns out the reverting of the changes doesn't impact anything , tests are still passing

I don't have bandwidth to dig into more about why the original changes are messing up port (most suspicious entity) , maybe it is as simple as some value initialization problem especially for multi model trains...

I just put it here as you asked me what the rabit issue we are working on...this is the one we found some time ago

@chenqin
Copy link
Contributor

chenqin commented Jan 8, 2020

The description was not very clear how this might fix the problem.
Previous code change related to the weird issue socket time_wait issue I observed while reconnectlink method demonstrated from time to time.

If the author is confident this is the fix, please go ahead and merge.

@CodingCat
Copy link
Member Author

CodingCat commented Jan 8, 2020

The description was not very clear how this might fix the problem.
Previous code change related to the weird issue socket time_wait issue I observed while reconnectlink method demonstrated from time to time.

If the author is confident this is the fix, please go ahead and merge.

I think there are two questions

(1) why in 0.8x we never see this problem?

my answer is 0.8x rabit dependency does not have your change

(2) why I reverted your change nothing related the weird issue socket time_wait issue I observed while reconnectlink method demonstrated from time to time. as you claimed is showing up? <= this is the question for you

@CodingCat
Copy link
Member Author

CodingCat commented Jan 8, 2020

The description was not very clear how this might fix the problem.

As I said, I did not go into details, the fixing logic here is simple, if any new bugs shows up, just revert anything new and suspicious

@trivialfis
Copy link
Member

Em, so no one can reason about the code?

@CodingCat
Copy link
Member Author

Em, so no one can reason about the code?

we have tried the fixed version in Dec, and everything goes fine

when I try to debug this issue, I have found many changes which are lack of explanation of why and no verification in master branch. We should higher the bar of accepting code in Rabit

at this point, my personal strategy to move things forward is, " if any new bugs shows up, just revert anything new and suspicious/proved to be problematic" (imagine your service down after a new deployment, you will roll back the deployment or spend days to debug?)

@RAMitchell
Copy link
Member

I think you need a test at the very least. Otherwise we are just reverting progress of others and we can't move forward. If there is a test you can revert it and then @chenqin has a chance to modify his code to find the root cause.

@trivialfis
Copy link
Member

@chenqin Would you please help adding a test or explanation for the removed code? I just wanna get the code base a little bit trust worthy.

@chenqin
Copy link
Contributor

chenqin commented Jan 9, 2020

@chenqin Would you please help adding a test or explanation for the removed code? I just wanna get the code base a little bit trust worthy.

After what @CodingCat more detailed explanation, I felt he might be right on finding problematic PR. But yeah, I can work on getting a test runs with help from @CodingCat on how to run multiple models with xgb-spark.

@CodingCat
Copy link
Member Author

I think you need a test at the very least. Otherwise we are just reverting progress of others and we can't move forward. If there is a test you can revert it and then @chenqin has a chance to modify his code to find the root cause.

the removed code was added to "enable test" now it turns out that without the added code, the tests (including the ones added after this PR) can still pass , shall the thing still be there? and I don't take bugs as progress honestly

for now the most important progress is releasing a better 1.0, this is a bug blocking the release IMO...

for anyone wants to try, use https://github.com/CodingCat/xgboost4j-spark-scalability/blob/master/src/main/scala/me/codingcat/xgboost4j/PureXGBoostLearner.scala to run with 0.9, and set repeated to be, say 50, you will see it almost stuck at somewhere every time

@CodingCat
Copy link
Member Author

CodingCat commented Jan 9, 2020 via email

@RAMitchell
Copy link
Member

Seems reasonable to get this merged for 1.0

@trivialfis
Copy link
Member

trivialfis commented Jan 20, 2020

Spent some time on understanding the code base of rabit. It would be nice to have written form of communication protocol between tracker and workers, also recovery consensus. I had to match those function calls one by one ...

Out of curious, @chenqin those code used for "smoothing the tracker", is the root problem in network bandwidth in hardware, or in the single threaded implementation of tracker?

@chenqin
Copy link
Contributor

chenqin commented Jan 20, 2020

Spent some time on understanding the code base of rabit. It would be nice to have to written form of communication protocol between tracker and workers, also recovery consensus. I had to match those function calls one by one ...

Out of curious, @chenqin those code used for "smoothing the tracker", is the root problem in network bandwidth in hardware, or in the single threaded implementation of tracker?

@trivialfis I have been rushing for planing and submission deadline. Will take a time to root cause bug raised in this issue latter. For the smooth traffic part, this is a fix to resolve rabit "hang" issue caused by combination of c++ assert and kill spark executor (error message unable to connect to tracker after one worker dead) as well as rabit internal reset and reconnect to tracker if any worker lost connection to tracker.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock the next release for XGBoost. We may want to consider JSON RPC for rabit in the future.

caused by combination of c++ assert and kill spark executor

There might be a better way to resolve this.

@trivialfis
Copy link
Member

@CodingCat @chenqin @RAMitchell @hcho3 Is there anything you want in this PR, or rabit in general for the next XGBoost release?

@hcho3
Copy link
Contributor

hcho3 commented Jan 27, 2020

I have no objection for this PR to be merged.

@CodingCat
Copy link
Member Author

thanks folks!

@CodingCat CodingCat merged commit 6e56395 into master Jan 27, 2020
@CodingCat CodingCat deleted the fix_hanging branch January 27, 2020 17:12
@hcho3 hcho3 mentioned this pull request Jan 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants