Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace select() with poll() in SelectHandler #57

Closed
wants to merge 1 commit into from

Conversation

keanpantraw
Copy link

I use xgboost4j-spark to train several XGBoost instances sequentially one after another using 2 workers each with 4 threads.
I found that after training 5-7 models process aborts with error described in dmlc/xgboost#2736
This is due to SelectHandler using select() to watch for open sockets.

When launching multiple distributed XGBoosts within
one spark job it's pretty easy to run beyond FD_SETSIZE=1024
on linux, because fd number is always increases and it's not just sockets, but file descriptors too.

When launching multiple distributed XGBoosts within
one spark job it's pretty easy to run beyond FD_SETSIZE=1024
on linux.
@nitinkak001
Copy link

I am getting this error while running XGBoost(0.72) through pyspark(2.3) shell. This happens when I run the job twice. First time it succeeds, second time the same job on the same data fails. I have tested it around 5 times now. Its only the second time it fails. Has there been any progress on this issue?

CodingCat pushed a commit that referenced this pull request Oct 22, 2018
* fix error in #57, clean up comments and naming

* include missing packages, disable recovery tests for now

* disable local_recover tests until we have a bug fix

* support larger cluster

* fix lint, merge with master
@CodingCat
Copy link
Member

close due to the merge of #73

@CodingCat CodingCat closed this Oct 22, 2018
@chenqin
Copy link
Contributor

chenqin commented Oct 22, 2018

Thanks for @frenzykryger contribution! your pr has been very helpful.

CodingCat pushed a commit that referenced this pull request Oct 26, 2018
* fix error in #57, clean up comments and naming

* include missing packages, disable recovery tests for now

* disable local_recover tests until we have a bug fix

* support larger cluster

* fix lint, merge with master

* fix mac osx test failure in dmlc/xgboost#3818

* Update allreduce_robust.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants