Revert #4529 #5008

hcho3 · 2019-11-04T08:05:37Z

After #4529, the 'hist' updater became quite unwieldy and currently few developers understand it. In particular, CreateTasksForXXX functions have created multiple layers of indirection, making it unclear which objects are being computed/modified.

This PR reverts #4529 in an attempt to simplify the 'hist' updater. I plan to incorporate parts of #4529 in a later date, under the following conditions:

Test suite to detect regression in performance is correctness must be set up.
All new abstractions should be clearly documented, with unit tests.

Also closes #4679.

cc @chenqin @trams

This reverts commit 4d6590b.

trivialfis · 2019-11-04T08:28:04Z

Please keep this branch open. We will work on it together. :-)

hcho3 · 2019-11-04T08:30:35Z

@trivialfis Sure, I'll keep the branch. For now, can you review this PR as it is? The criteria should be whether the codebase after the revert is legible or not. (Can you understand all parts of the updater? Is it clear what each function does? Is it easy to understand which objects are being modified? etc). Once the updater becomes legible and clear, then we can start adding improvements.

trivialfis · 2019-11-04T08:37:32Z

Yup. Will do a full verification on whether there is implicit conflicts and regression with proper tests. I want to make the whole XGBoost code base robust as possible.

hcho3 · 2019-11-04T08:49:41Z

I realize the importance of legibility and organization. If developers cannot understand the code, they cannot improve it or debug it.

Some ideas:

No big, monolithic functions that try to do lots of things. Break down a big function into a combination of smaller functions and have the original function call them.
If possible, try to hide complex implementation details in an abstraction (class object).
Each function should try to modify only a small number of objects, and these objects should be passed via the function arguments. The use of many class variables to pass states between different class functions is an anti-pattern, since it's liable to create spaghetti code. If the argument list becomes too long, create a new class to consolidate arguments.
Do not perform complex manipulation on public class members. Instead, create public methods (functions) to manipulate them instead.

I realize that there is always a subjective element when it comes to legibility and maintainability. However, having a good list of guidelines would help.

chenqin · 2019-11-05T17:12:45Z

I realize the importance of legibility and organization. If developers cannot understand the code, they cannot improve it or debug it.

Some ideas:

No big, monolithic functions that try to do lots of things. Break down a big function into a combination of smaller functions and have the original function call them.

If possible, try to hide complex implementation details in an abstraction (class object).

Each function should try to modify only a small number of objects, and these objects should be passed via the function arguments. The use of many class variables to pass states between different class functions is an anti-pattern, since it's liable to create spaghetti code. If the argument list becomes too long, create a new class to consolidate arguments.

Do not perform complex manipulation on public class members. Instead, create public methods (functions) to manipulate them instead.

I realize that there is always a subjective element when it comes to legibility and maintainability. However, having a good list of guidelines would help.

Shall we consider put this in https://xgboost.readthedocs.io/en/latest/contrib/index.html

trivialfis · 2019-11-07T05:57:21Z

Working on this.

trivialfis

@hcho3

The reverted PR on small number of thread boosted the performance close to twice. So this revert will double the training time on many dataset (tested with dense, not sparse). After some thoughts I still believe reverting is the right thing to do.

For future reference, the applied optimization in original PR are primarily 3-folds:

Block based parallelism. This way the granularity can be fine tuned instead of fixing on rows.
Node based parallelism. One more level of parallelism. Not sure if this is really helpful.
Preallocating memory for partitioning. The memory usage effectively goes up but performance is also better due to no longer searching the heap for memory block.

So all good stuffs, why do we want to revert it? Mostly due to the optimization PR tried to prepare for a task based execution model, which can utilize further more parallelism with a right scheduler. But with block + explicit closure (named tasks in original code) + many copied and pasted code blocks, the code became messy and difficult to read. The graph based computation is not that difficult to implement if we have something similar to tbb, or cuda graph.

hcho3 · 2019-11-12T16:46:05Z

@trivialfis Should we consider using TBB?

trivialfis · 2019-11-12T17:18:06Z

@hcho3. Sorry but I don't think I will spend a lots of time trying to optimize the CPU hist other than usual maintenance, as my priority is GPU. Sometimes i might create PRs using spare time for possible extended algorithms that I'm interested in ( like the unifying leaf index PR). Hence the above note for future contributors (or our future selves). Hope you can understand.

But to answer your question, this requires experiments. If we can restore the performance before this PR with clean code then it's already a big win. I'm curious in why LGB can has faster computation time on CPU as their code is quite simple and without any exhaust optimization.

trivialfis · 2019-11-12T17:26:32Z

You can use it as code comments if you like.

hcho3 · 2019-11-12T17:32:56Z

@trivialfis Don't worry about the CPU hist. I am trying to gather support from my org to optimize the memory usage and performance of the CPU hist. Stay tuned.

Can you give me the benchmark you used to test performance, so that I can try to restore the performance later?

trivialfis · 2019-11-12T17:35:36Z

That sounds awesome!

hcho3 · 2019-11-12T17:38:09Z

@trivialfis I'm thinking of using gbm-bench

trivialfis · 2019-11-12T17:40:39Z

I think Rory pointed a link for gbm-bench in NV s repo. I also have a public fork that comes with a branch add-url for using the URL dataset (extremely sparse).

trivialfis · 2019-11-12T17:41:26Z

Yup. 1 minutes late. I'm on phone. :-)

SmirnovEgorRu · 2019-12-06T00:18:52Z

Hi @hcho3, @trivialfis, it is news that my changes are reverted, I have found it by accident just now, because I wasn't mentioned in the PR originally...
Anyway, your comments look reasonable. Code complexity is a downside of low-level optimizations.

I measured performance after and before code reverting on c5.metal AWS instance:

	Abalone	Letters	Mortgage	Higgs1m	Airline	MSRank
Time after reverting, sec	10.0	144.6	53.4	123.5	133.2	332.2
Time before reverting, sec	0.7	10.3	18.3	15.5	55.3	99.1
Difference, times	14.3	14.0	2.9	8.0	2.4	3.4

It looks like quite large difference in performance for many-core systems. I'm interested in these optimizations (and I also know people who see them quite valuable). I'm ready to refactor code according to your comments above, add tests and documentation to have the optimizations in master. What is your opinion? Is it a full list what I need to do?

RAMitchell · 2019-12-06T00:52:06Z

@SmirnovEgorRu Not pinging you was an oversight, we should do better with communication. If you can find ways to reintroduce these performance improvements without largely increasing amount of code or obscuring readability that would be welcome.

hcho3 · 2019-12-06T01:01:07Z

@SmirnovEgorRu Seconding @RAMitchell, I apologize for not mentioning you in this move. I will do better with communication going forward.

Let me make this clear: It is my intent to bring back your optimization work before 1.0 release. As for what you can do to help, you should give us the list of datasets you think we should do to measure performance. I have spent a fair amount of time with performance benchmarking in my current tenure at AWS (Amazon SageMaker), and I now feel more confident about setting up a regular, automatic benchmark suite. Once the suite is set up, we can make a good trade-off between performance and code legibility. (Hopefully we should get both.)

Do you have some concrete idea about refactoring? If so, you should draft an RFC (Request for Comment) that outlines what each construct does. See example for RFC at https://discuss.tvm.ai/t/unifying-object-protocol-in-the-stack/4273. If you don't yet have an idea, I will review #4529 and post an RFC myself by end of this year.

SmirnovEgorRu · 2019-12-06T15:30:11Z

@hcho3 @RAMitchell, good, thank you! I will work on RFC and will provide my benchmarks.
I hope I will bring results soon.

SmirnovEgorRu · 2019-12-06T17:01:59Z

@hcho3, does variant to refactor existing code at once work for you? For me it is quite easier to make all changes in one commit.
If it is hard for you - I will think how to split code, but unfortunalty, there are many code dependencies.

hcho3 · 2019-12-06T17:07:45Z

@SmirnovEgorRu Is it possible to use multiple commits? (Having a single pull request is fine) Only the latest commit needs to pass CI. Then I can review one commit at a time. Also, RFC to summarize new constructs and abstractions will help a lot.

RAMitchell · 2019-12-07T00:23:17Z

@SmirnovEgorRu given that the core problem was code complexity, I would highly recommend making PRs as small as possible. Can you find a way to incrementally introduce changes? I would start with smaller changes that are simple and give good value in terms of performance.

SmirnovEgorRu · 2019-12-09T13:41:52Z

I prepared my plans on performance reverting and splitting by PRs in this issue #5104
Let's move discussion there.

Revert " Optimize ‘hist’ for multi-core CPU (dmlc#4529)"

bfa17ef

This reverts commit 4d6590b.

hcho3 requested review from trivialfis, CodingCat and RAMitchell November 4, 2019 08:05

hcho3 mentioned this pull request Nov 4, 2019

Use global leave index, small clean up for cpu hist. #5005

Closed

Fix build

61658c0

trivialfis approved these changes Nov 12, 2019

View reviewed changes

hcho3 merged commit f4e7b70 into dmlc:master Nov 12, 2019

hcho3 deleted the simplify_hist branch November 12, 2019 17:35

hcho3 restored the simplify_hist branch November 12, 2019 17:35

hcho3 deleted the simplify_hist branch November 12, 2019 17:36

SmirnovEgorRu mentioned this pull request Dec 9, 2019

CPU optimizations - 'hist' method #5104

Closed

SmirnovEgorRu mentioned this pull request Jan 29, 2020

Optimized BuildHist function #5156

Merged

lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert #4529 #5008

Revert #4529 #5008

hcho3 commented Nov 4, 2019 •

edited

Loading

trivialfis commented Nov 4, 2019

hcho3 commented Nov 4, 2019 •

edited

Loading

trivialfis commented Nov 4, 2019

hcho3 commented Nov 4, 2019 •

edited

Loading

chenqin commented Nov 5, 2019

trivialfis commented Nov 7, 2019

trivialfis left a comment •

edited

Loading

hcho3 commented Nov 12, 2019

trivialfis commented Nov 12, 2019 •

edited

Loading

trivialfis commented Nov 12, 2019

hcho3 commented Nov 12, 2019 •

edited

Loading

trivialfis commented Nov 12, 2019

hcho3 commented Nov 12, 2019

trivialfis commented Nov 12, 2019

trivialfis commented Nov 12, 2019 •

edited

Loading

SmirnovEgorRu commented Dec 6, 2019

RAMitchell commented Dec 6, 2019

hcho3 commented Dec 6, 2019 •

edited

Loading

SmirnovEgorRu commented Dec 6, 2019

SmirnovEgorRu commented Dec 6, 2019

hcho3 commented Dec 6, 2019 •

edited

Loading

RAMitchell commented Dec 7, 2019

SmirnovEgorRu commented Dec 9, 2019

Revert #4529 #5008

Revert #4529 #5008

Conversation

hcho3 commented Nov 4, 2019 • edited Loading

trivialfis commented Nov 4, 2019

hcho3 commented Nov 4, 2019 • edited Loading

trivialfis commented Nov 4, 2019

hcho3 commented Nov 4, 2019 • edited Loading

chenqin commented Nov 5, 2019

trivialfis commented Nov 7, 2019

trivialfis left a comment • edited Loading

Choose a reason for hiding this comment

hcho3 commented Nov 12, 2019

trivialfis commented Nov 12, 2019 • edited Loading

trivialfis commented Nov 12, 2019

hcho3 commented Nov 12, 2019 • edited Loading

trivialfis commented Nov 12, 2019

hcho3 commented Nov 12, 2019

trivialfis commented Nov 12, 2019

trivialfis commented Nov 12, 2019 • edited Loading

SmirnovEgorRu commented Dec 6, 2019

RAMitchell commented Dec 6, 2019

hcho3 commented Dec 6, 2019 • edited Loading

SmirnovEgorRu commented Dec 6, 2019

SmirnovEgorRu commented Dec 6, 2019

hcho3 commented Dec 6, 2019 • edited Loading

RAMitchell commented Dec 7, 2019

SmirnovEgorRu commented Dec 9, 2019

hcho3 commented Nov 4, 2019 •

edited

Loading

hcho3 commented Nov 4, 2019 •

edited

Loading

hcho3 commented Nov 4, 2019 •

edited

Loading

trivialfis left a comment •

edited

Loading

trivialfis commented Nov 12, 2019 •

edited

Loading

hcho3 commented Nov 12, 2019 •

edited

Loading

trivialfis commented Nov 12, 2019 •

edited

Loading

hcho3 commented Dec 6, 2019 •

edited

Loading

hcho3 commented Dec 6, 2019 •

edited

Loading