-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert #4529 #5008
Revert #4529 #5008
Conversation
This reverts commit 4d6590b.
Please keep this branch open. We will work on it together. :-) |
@trivialfis Sure, I'll keep the branch. For now, can you review this PR as it is? The criteria should be whether the codebase after the revert is legible or not. (Can you understand all parts of the updater? Is it clear what each function does? Is it easy to understand which objects are being modified? etc). Once the updater becomes legible and clear, then we can start adding improvements. |
Yup. Will do a full verification on whether there is implicit conflicts and regression with proper tests. I want to make the whole XGBoost code base robust as possible. |
I realize the importance of legibility and organization. If developers cannot understand the code, they cannot improve it or debug it. Some ideas:
I realize that there is always a subjective element when it comes to legibility and maintainability. However, having a good list of guidelines would help. |
Shall we consider put this in https://xgboost.readthedocs.io/en/latest/contrib/index.html |
Working on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reverted PR on small number of thread boosted the performance close to twice. So this revert will double the training time on many dataset (tested with dense, not sparse). After some thoughts I still believe reverting is the right thing to do.
For future reference, the applied optimization in original PR are primarily 3-folds:
- Block based parallelism. This way the granularity can be fine tuned instead of fixing on rows.
- Node based parallelism. One more level of parallelism. Not sure if this is really helpful.
- Preallocating memory for partitioning. The memory usage effectively goes up but performance is also better due to no longer searching the heap for memory block.
So all good stuffs, why do we want to revert it? Mostly due to the optimization PR tried to prepare for a task based execution model, which can utilize further more parallelism with a right scheduler. But with block + explicit closure (named tasks in original code) + many copied and pasted code blocks, the code became messy and difficult to read. The graph based computation is not that difficult to implement if we have something similar to tbb, or cuda graph.
@trivialfis Should we consider using TBB? |
@hcho3. Sorry but I don't think I will spend a lots of time trying to optimize the CPU hist other than usual maintenance, as my priority is GPU. Sometimes i might create PRs using spare time for possible extended algorithms that I'm interested in ( like the unifying leaf index PR). Hence the above note for future contributors (or our future selves). Hope you can understand. But to answer your question, this requires experiments. If we can restore the performance before this PR with clean code then it's already a big win. I'm curious in why LGB can has faster computation time on CPU as their code is quite simple and without any exhaust optimization. |
You can use it as code comments if you like. |
@trivialfis Don't worry about the CPU hist. I am trying to gather support from my org to optimize the memory usage and performance of the CPU hist. Stay tuned. Can you give me the benchmark you used to test performance, so that I can try to restore the performance later? |
That sounds awesome! |
@trivialfis I'm thinking of using gbm-bench |
I think Rory pointed a link for gbm-bench in NV s repo. I also have a public fork that comes with a branch |
Yup. 1 minutes late. I'm on phone. :-) |
Hi @hcho3, @trivialfis, it is news that my changes are reverted, I have found it by accident just now, because I wasn't mentioned in the PR originally... I measured performance after and before code reverting on c5.metal AWS instance:
It looks like quite large difference in performance for many-core systems. I'm interested in these optimizations (and I also know people who see them quite valuable). I'm ready to refactor code according to your comments above, add tests and documentation to have the optimizations in master. What is your opinion? Is it a full list what I need to do? |
@SmirnovEgorRu Not pinging you was an oversight, we should do better with communication. If you can find ways to reintroduce these performance improvements without largely increasing amount of code or obscuring readability that would be welcome. |
@SmirnovEgorRu Seconding @RAMitchell, I apologize for not mentioning you in this move. I will do better with communication going forward. Let me make this clear: It is my intent to bring back your optimization work before 1.0 release. As for what you can do to help, you should give us the list of datasets you think we should do to measure performance. I have spent a fair amount of time with performance benchmarking in my current tenure at AWS (Amazon SageMaker), and I now feel more confident about setting up a regular, automatic benchmark suite. Once the suite is set up, we can make a good trade-off between performance and code legibility. (Hopefully we should get both.) Do you have some concrete idea about refactoring? If so, you should draft an RFC (Request for Comment) that outlines what each construct does. See example for RFC at https://discuss.tvm.ai/t/unifying-object-protocol-in-the-stack/4273. If you don't yet have an idea, I will review #4529 and post an RFC myself by end of this year. |
@hcho3 @RAMitchell, good, thank you! I will work on RFC and will provide my benchmarks. |
@hcho3, does variant to refactor existing code at once work for you? For me it is quite easier to make all changes in one commit. |
@SmirnovEgorRu Is it possible to use multiple commits? (Having a single pull request is fine) Only the latest commit needs to pass CI. Then I can review one commit at a time. Also, RFC to summarize new constructs and abstractions will help a lot. |
@SmirnovEgorRu given that the core problem was code complexity, I would highly recommend making PRs as small as possible. Can you find a way to incrementally introduce changes? I would start with smaller changes that are simple and give good value in terms of performance. |
I prepared my plans on performance reverting and splitting by PRs in this issue #5104 |
After #4529, the 'hist' updater became quite unwieldy and currently few developers understand it. In particular,
CreateTasksForXXX
functions have created multiple layers of indirection, making it unclear which objects are being computed/modified.This PR reverts #4529 in an attempt to simplify the 'hist' updater. I plan to incorporate parts of #4529 in a later date, under the following conditions:
Also closes #4679.
cc @chenqin @trams