-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] Checkpointing performance issue in XGBoost4J-Spark #3946
Comments
IMO, there were unnecessary reshuffle after each checkpoint. Is there any info around memory utilization? |
@cq not true, extra shuffle happens for every 100 iterations in this case.. |
@chenqin and @CodingCat , do you need any other "benchmark" ? |
Okay, I think I can add some info this issue on my side with different experimentation dataset. 2B rows, 26 features (including some numerical ones) doing regression on max depth 20 trees with MAE evaluator, checkpoint every 10 iterations. [2] shows per iteration time get slower from around 6 mins to 13 mins. What's also interesting is at same time the driver side memory pressure keeps increasing every checkpoint( green line at [1]). [2] |
Hello, has this question raised any interest? we are willing to investigate and contribute and for that wanted to check first that no update / fix existed... Is anyone watching/involved? |
@CodingCat Is anyone in your org looking at this issue? |
no, please go ahead, we do not have bandwidth to take care of it for now |
I have started investigating the issue. And it looks like I found one possible cause of it. The prediction caches are not initialized after the first checkpoint (
|
@bourbaki thanks for the analysis and I agree with it regarding how to resolve it, I think those two solutions have their own problems as you said
the ideal fix should be more fundamental (on how we do checkpoint and needing more native apis)...I got some ideas and am about to start my work...but I just noticed you added please let me know your proposal before you submit the code to avoid the waste of energy, and I am actually in a hurry to get this fixed due to an internally raised request |
@bourbaki if you are dedicated to this, I am more than happy to assign this to you and work with you to get it fixed and merged |
@CodingCat Yes. I would be glad to help you with it. I agree with all of the objections :). What approach would you suggest? |
@bourbaki sorry for the late reply, really busy in these days, my suggestion would be
2.1 save the prediction results with Spark 2.2 expose an API to directly call Init Function in Predictor 2.3 when starting training from a checkpoint, we should load back data (maybe have to through HDFS API directly) saved in 2.1, and use a resembled version of PredLoopSpecalize to update the cache with the loaded Dataset the most important thing we need to be careful is to guarantee the deterministic partitioning what do you think? |
It sounds good. But there's one thing that bothers me with the current design of
In this case when training goes according to the plan we only copy booster to driver and then restart the training from the cached state. |
yes, in that way we need some way to pass the message from driver to executor saying "please move on"...leading us to more complexities from the message system for training... if we want to pursue that approach, we need to extend rabit layer to allow tracker and worker to communicate more than "start/stop/I died/please reborn", etc. |
but still we need to save the prediction results to allow the training recover from a checkpoint..which does not reduce the complexity I talked about earlier |
@bourbaki are you working on this? |
ping @bourbaki |
Sorry for the late reply. I have been busy for the last couple of weeks. I started to work on it. I am planning to spend some time on this week. |
@bourbaki are you still working on this? if you are too busy, I can take over it |
@CodingCat Sorry, I am too busy right now. You can take over it. |
Hello @CodingCat. We allocated significant time to fix checkpointing performance issue in XGBoost4J-Spark (it is a blocker for us) in July-August. So I am wandering what is a current status of your work and whether you prepare any pull requests now or I could take this over. Thank you |
I am currently working on it |
@CodingCat could you please share the status of your work and confirm the approach you decided to take? This issue blocks us (Criteo). We allocated resources to fix it. Let us know how we can contribute or how you can hand over if you have higher priorities. |
it’s still WIP in my side (half finished), and I will share the design in the format of a RFC within this month (actually I have shared some in this thread) and code later , |
update: basically, I have run the prototype fix with our production data and everything looks fine. Now, it comes to the cost of fix. The fix under test goes to the path that the prediction cache is rebuilt for each checkpoint interval and it still slows down the first iteration of each interval. However, a more fundamental fix requires API changes/adding, and the added API might be only consumed by jvm package, we need to evaluate whether it is worth that (or we have other better design option) let me start a new design discussion thread on this topic later |
Thank you for the update. I am glad you have a fix and from your description I think I have a similar fix being tested on production dataset too. I am looking forward for you sharing notes|patches so we can compare our notes. By the way, my fix is published as a branch in our fork. I can link it here if you are interested. |
@CodingCat do you have any updates on your side? I want to prepare a RFC with my idea for potential fixes. What do you think? |
I just noticed a weird performance issue when using checkpoints in the Spark wrapper for XGBoost.
Before the first checkpoint, the learning time is constant per tree, but after the first checkpoint, the time needed to learn a new tree is much higher, and it grows with the number of trees.
Here is the parameter map used by XGBoost:
Map(alpha -> 0.0, min_child_weight -> 1.0, sample_type -> uniform, base_score -> 0.5, colsample_bylevel -> 1.0, grow_policy -> depthwise, skip_drop -> 0.0, lambda_bias -> 0.0, silent -> 0, scale_pos_weight -> 1.0, seed -> 0, features_col -> features, num_early_stopping_rounds -> 0, label_col -> label, num_workers -> 5, subsample -> 1.0, lambda -> 1.0, max_depth -> 5, probability_col -> probability, raw_prediction_col -> rawPrediction, tree_limit -> 0, custom_eval -> null, rate_drop -> 0.0, max_bin -> 16, train_test_ratio -> 1.0, use_external_memory -> false, objective -> binary:logistic, eval_metric -> logloss, num_round -> 500, timeout_request_workers -> 1800000, missing -> NaN, checkpoint_path -> viewfs://root/user/XXX/YYYY/checkpoint, tracker_conf -> TrackerConf(0,python), tree_method -> auto, max_delta_step -> 0.0, eta -> 0.3, colsample_bytree -> 1.0, normalize_type -> tree, custom_obj -> null, gamma -> 0.0, sketch_eps -> 0.03, nthread -> 4, prediction_col -> prediction, checkpoint_interval -> 100)
Attached is a small graph, where this issue is shown on the same dataset, with different checkpointing number. Please disregard the discrepancy between the two run, as they might not have been launched with the same multithreading parameters.
A quick profiling (using async-profiler) before and after the checkpoint might give a hint to the problem. Before the checkpoint, I see many calls to
tree::CQHistMaker
ortree::GlobalProposalHistMaker
, but after the checkpoint point, the C++ calls are all toCPUPredictor::PredLoopSpecialize
. I do not know whether this means that the method used for learning is not the same after the checkpoint, or that the time needed to evaluate the tree is so long that it the profiler only sees it. I could upload some flamegraphs, but I would need first to make sure they are actually representative of what is going on.The text was updated successfully, but these errors were encountered: