[rabit improvement] support rabit worker set/get configs from tracker #94

chenqin · 2019-06-13T03:09:59Z

native rabit checkpoint restore were failing in XGB.

In order to let restart worker avoid run allreduce before loadcheckpoint, we plan to save those config(s) in dmlc-tracker when training job starts first time(e.g number of columns of partitioned training data set)

If worker failure, instead of calling allreduce and break rabit recovery assumption, we can fetch configs from tracker. This would allow checkpoint load correctly and starts at right iteration number.

If tracker die, training job will die anyway. We might leverage spark hdfs checkpoint and recover entire cluster from there.

More detail here
dmlc/xgboost#4250 (comment)

investigation

This reverts commit 2a28e5e.

Chen Qin and others added 10 commits June 10, 2019 14:00

support run rabit tests as xgboost subproject using xgboost/dmlc-core

ed06620

support tracker config set/get

dddcac7

remove redudant printf

2fac91b

remove redudant printf

f5a9727

Merge branch 'master' of https://github.com/chenqin/rabit

acc011d

add c++0x declaration

e391238

log allreduce/broadcast caller, engine should track caller stack for

a9d7331

investigation

tracker support binary config format

2a28e5e

Revert "tracker support binary config format"

2c322f3

This reverts commit 2a28e5e.

remove caller, prototype fetch allreduce/broadcast results from resbuf

0249ae6

chenqin closed this Jun 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rabit improvement] support rabit worker set/get configs from tracker #94

[rabit improvement] support rabit worker set/get configs from tracker #94

chenqin commented Jun 13, 2019

[rabit improvement] support rabit worker set/get configs from tracker #94

[rabit improvement] support rabit worker set/get configs from tracker #94

Conversation

chenqin commented Jun 13, 2019