Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert #4529 #5008

Merged
merged 2 commits into from
Nov 12, 2019
Merged

Revert #4529 #5008

merged 2 commits into from
Nov 12, 2019

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Nov 4, 2019

After #4529, the 'hist' updater became quite unwieldy and currently few developers understand it. In particular, CreateTasksForXXX functions have created multiple layers of indirection, making it unclear which objects are being computed/modified.

This PR reverts #4529 in an attempt to simplify the 'hist' updater. I plan to incorporate parts of #4529 in a later date, under the following conditions:

  • Test suite to detect regression in performance is correctness must be set up.
  • All new abstractions should be clearly documented, with unit tests.

Also closes #4679.

cc @chenqin @trams

@trivialfis
Copy link
Member

Please keep this branch open. We will work on it together. :-)

@hcho3
Copy link
Collaborator Author

hcho3 commented Nov 4, 2019

@trivialfis Sure, I'll keep the branch. For now, can you review this PR as it is? The criteria should be whether the codebase after the revert is legible or not. (Can you understand all parts of the updater? Is it clear what each function does? Is it easy to understand which objects are being modified? etc). Once the updater becomes legible and clear, then we can start adding improvements.

@trivialfis
Copy link
Member

Yup. Will do a full verification on whether there is implicit conflicts and regression with proper tests. I want to make the whole XGBoost code base robust as possible.

@hcho3
Copy link
Collaborator Author

hcho3 commented Nov 4, 2019

I realize the importance of legibility and organization. If developers cannot understand the code, they cannot improve it or debug it.

Some ideas:

  • No big, monolithic functions that try to do lots of things. Break down a big function into a combination of smaller functions and have the original function call them.
  • If possible, try to hide complex implementation details in an abstraction (class object).
  • Each function should try to modify only a small number of objects, and these objects should be passed via the function arguments. The use of many class variables to pass states between different class functions is an anti-pattern, since it's liable to create spaghetti code. If the argument list becomes too long, create a new class to consolidate arguments.
  • Do not perform complex manipulation on public class members. Instead, create public methods (functions) to manipulate them instead.

I realize that there is always a subjective element when it comes to legibility and maintainability. However, having a good list of guidelines would help.

@chenqin
Copy link
Contributor

chenqin commented Nov 5, 2019

I realize the importance of legibility and organization. If developers cannot understand the code, they cannot improve it or debug it.

Some ideas:

  • No big, monolithic functions that try to do lots of things. Break down a big function into a combination of smaller functions and have the original function call them.
  • If possible, try to hide complex implementation details in an abstraction (class object).
  • Each function should try to modify only a small number of objects, and these objects should be passed via the function arguments. The use of many class variables to pass states between different class functions is an anti-pattern, since it's liable to create spaghetti code. If the argument list becomes too long, create a new class to consolidate arguments.
  • Do not perform complex manipulation on public class members. Instead, create public methods (functions) to manipulate them instead.

I realize that there is always a subjective element when it comes to legibility and maintainability. However, having a good list of guidelines would help.

Shall we consider put this in https://xgboost.readthedocs.io/en/latest/contrib/index.html

@trivialfis
Copy link
Member

Working on this.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hcho3

The reverted PR on small number of thread boosted the performance close to twice. So this revert will double the training time on many dataset (tested with dense, not sparse). After some thoughts I still believe reverting is the right thing to do.

For future reference, the applied optimization in original PR are primarily 3-folds:

  • Block based parallelism. This way the granularity can be fine tuned instead of fixing on rows.
  • Node based parallelism. One more level of parallelism. Not sure if this is really helpful.
  • Preallocating memory for partitioning. The memory usage effectively goes up but performance is also better due to no longer searching the heap for memory block.

So all good stuffs, why do we want to revert it? Mostly due to the optimization PR tried to prepare for a task based execution model, which can utilize further more parallelism with a right scheduler. But with block + explicit closure (named tasks in original code) + many copied and pasted code blocks, the code became messy and difficult to read. The graph based computation is not that difficult to implement if we have something similar to tbb, or cuda graph.

@hcho3
Copy link
Collaborator Author

hcho3 commented Nov 12, 2019

@trivialfis Should we consider using TBB?

@trivialfis
Copy link
Member

trivialfis commented Nov 12, 2019

@hcho3. Sorry but I don't think I will spend a lots of time trying to optimize the CPU hist other than usual maintenance, as my priority is GPU. Sometimes i might create PRs using spare time for possible extended algorithms that I'm interested in ( like the unifying leaf index PR). Hence the above note for future contributors (or our future selves). Hope you can understand.

But to answer your question, this requires experiments. If we can restore the performance before this PR with clean code then it's already a big win. I'm curious in why LGB can has faster computation time on CPU as their code is quite simple and without any exhaust optimization.

@trivialfis
Copy link
Member

You can use it as code comments if you like.

@hcho3
Copy link
Collaborator Author

hcho3 commented Nov 12, 2019

@trivialfis Don't worry about the CPU hist. I am trying to gather support from my org to optimize the memory usage and performance of the CPU hist. Stay tuned.

Can you give me the benchmark you used to test performance, so that I can try to restore the performance later?

@hcho3 hcho3 merged commit f4e7b70 into dmlc:master Nov 12, 2019
@hcho3 hcho3 deleted the simplify_hist branch November 12, 2019 17:35
@trivialfis
Copy link
Member

That sounds awesome!

@hcho3 hcho3 restored the simplify_hist branch November 12, 2019 17:35
@hcho3 hcho3 deleted the simplify_hist branch November 12, 2019 17:36
@hcho3
Copy link
Collaborator Author

hcho3 commented Nov 12, 2019

@trivialfis I'm thinking of using gbm-bench

@trivialfis
Copy link
Member

I think Rory pointed a link for gbm-bench in NV s repo. I also have a public fork that comes with a branch add-url for using the URL dataset (extremely sparse).

@trivialfis
Copy link
Member

trivialfis commented Nov 12, 2019

Yup. 1 minutes late. I'm on phone. :-)

@SmirnovEgorRu
Copy link
Contributor

Hi @hcho3, @trivialfis, it is news that my changes are reverted, I have found it by accident just now, because I wasn't mentioned in the PR originally...
Anyway, your comments look reasonable. Code complexity is a downside of low-level optimizations.

I measured performance after and before code reverting on c5.metal AWS instance:

  Abalone Letters Mortgage Higgs1m Airline MSRank
Time after reverting, sec 10.0 144.6 53.4 123.5 133.2 332.2
Time before reverting, sec 0.7 10.3 18.3 15.5 55.3 99.1
Difference, times 14.3 14.0 2.9 8.0 2.4 3.4

It looks like quite large difference in performance for many-core systems. I'm interested in these optimizations (and I also know people who see them quite valuable). I'm ready to refactor code according to your comments above, add tests and documentation to have the optimizations in master. What is your opinion? Is it a full list what I need to do?

@RAMitchell
Copy link
Member

@SmirnovEgorRu Not pinging you was an oversight, we should do better with communication. If you can find ways to reintroduce these performance improvements without largely increasing amount of code or obscuring readability that would be welcome.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 6, 2019

@SmirnovEgorRu Seconding @RAMitchell, I apologize for not mentioning you in this move. I will do better with communication going forward.

Let me make this clear: It is my intent to bring back your optimization work before 1.0 release. As for what you can do to help, you should give us the list of datasets you think we should do to measure performance. I have spent a fair amount of time with performance benchmarking in my current tenure at AWS (Amazon SageMaker), and I now feel more confident about setting up a regular, automatic benchmark suite. Once the suite is set up, we can make a good trade-off between performance and code legibility. (Hopefully we should get both.)

Do you have some concrete idea about refactoring? If so, you should draft an RFC (Request for Comment) that outlines what each construct does. See example for RFC at https://discuss.tvm.ai/t/unifying-object-protocol-in-the-stack/4273. If you don't yet have an idea, I will review #4529 and post an RFC myself by end of this year.

@SmirnovEgorRu
Copy link
Contributor

@hcho3 @RAMitchell, good, thank you! I will work on RFC and will provide my benchmarks.
I hope I will bring results soon.

@SmirnovEgorRu
Copy link
Contributor

@hcho3, does variant to refactor existing code at once work for you? For me it is quite easier to make all changes in one commit.
If it is hard for you - I will think how to split code, but unfortunalty, there are many code dependencies.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 6, 2019

@SmirnovEgorRu Is it possible to use multiple commits? (Having a single pull request is fine) Only the latest commit needs to pass CI. Then I can review one commit at a time. Also, RFC to summarize new constructs and abstractions will help a lot.

@RAMitchell
Copy link
Member

@SmirnovEgorRu given that the core problem was code complexity, I would highly recommend making PRs as small as possible. Can you find a way to incrementally introduce changes? I would start with smaller changes that are simple and give good value in terms of performance.

@SmirnovEgorRu
Copy link
Contributor

I prepared my plans on performance reverting and splitting by PRs in this issue #5104
Let's move discussion there.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BLOCKING] Per-node sync slows down distributed training with 'hist'
5 participants