Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrades XGBoost to 1.3.1 #107

Merged
merged 6 commits into from
Mar 23, 2021
Merged

Upgrades XGBoost to 1.3.1 #107

merged 6 commits into from
Mar 23, 2021

Conversation

Craigacp
Copy link
Member

Description

Bumps XGBoost from 1.0.0 to 1.3.2. Also plumbs through additional XGBoost parameters, so now the booster type and tree method are controllable. Exposes the logging verbosity as a replacement for the deprecated silent parameter.

@Craigacp Craigacp changed the title Upgrades XGBoost to 1.3.2 Upgrades XGBoost to 1.3.1 Mar 3, 2021
@Craigacp
Copy link
Member Author

Craigacp commented Mar 3, 2021

I rolled this back to 1.3.1 as that version has a macOS binary in Maven Central, and the fixes in 1.3.2 are Python or Solaris related.

@nezda
Copy link
Contributor

nezda commented Mar 18, 2021

I'm trying this out locally on macOS (11.2.1). Seems to work fine. With silent false I see a full path on someone else's machine which seems odd:

[13:29:22] INFO: /Users/nanzhu/code/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 24 extra nodes, 0 pruned nodes, max_depth=6

Also multi-threading doesn't seem to be working (just based on looking at CPU usage - tried numThreads 8), however it did train 5x faster than the previous version! The training time (and CPU usage) didn't seem to be impacted by the numThreads parameter.

@Craigacp
Copy link
Member Author

Craigacp commented Mar 18, 2021

I believe the JVM builds of XGBoost for macOS are currently compiled by the developers rather than the CI, and the logger gives the line of the code and it's path for each logging message. So I think thats probably compiled by https://github.com/CodingCat who does a lot of the JVM work in XGBoost. This is probably changing in the next few releases, I've been talking to them about adding Windows support to the Maven Central builds and there is a CI build for macOS too - dmlc/xgboost#6630 (comment). If that lands in the upcoming 1.4.0 release (and that release happens before Tribuo's 4.1 release) we'll update to that version to make things easier for our users.

I'm a little worried about the multithreading aspect, I'll see if I can replicate it. Roughly how big a problem were you using?

@Craigacp
Copy link
Member Author

Craigacp commented Mar 18, 2021

Ah, so I'd missed the threading issue because internally we build XGBoost4j with OpenMP turned on for Windows, macOS and Linux, but it's not turned on in the builds provided by dmlc in Maven Central for macOS. I'll open an issue upstream as it looks like they build the Python macOS whl with OpenMP turned on.

@Craigacp
Copy link
Member Author

On Linux with this branch building a 500 tree model on MNIST (using Tribuo's default parameters) takes 30s with 1 thread and 10s with 6 threads (on an Intel Core i7-8700), so the Tribuo side of things is definitely passing down the right parameters.

public float xbgAlpha = 0.0f;
@Option(longName = "xgb-min-weight", usage = "Minimum sum of instance weights needed in a leaf (default 1, range [0,inf]).")
@Option(longName = "xgb-min-weight", usage = "Minimum sum of instance weights needed in a leaf (range [0,inf]).")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does one specify inf here? Float.MAX_VALUE? That maybe should be documented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if you do specify inf I don't think it can make any splits, so it probably goes pop. I should check the XGBoost docs again, this comes from their CLI docs but I don't think that line has been updated since XGBoost 0.7 ish.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this same questions works better on xgb-max-depth, which has the same range.

public float xgbSubsample = 1.0f;
@Option(longName = "xgb-num-threads", usage = "Number of threads to use (default 4, range (1, num hw threads)).")
@Option(longName = "xgb-num-threads", usage = "Number of threads to use (range (1, num hw threads)).")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this need a default value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLCUT auto-generates it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant doesn't the actual variable need a default value. Or at least, shouldn't it be required? Will xgboost crash if you give it zero threads or use all hardware threads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you give it zero it uses all hardware threads. I should note that in the usage.

Copy link
Member

@JackSullivan JackSullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall, there are a couple of UI nitpicks in XGBoostOptions I'd like clarification on. See inline comments.

Copy link
Member

@JackSullivan JackSullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@JackSullivan JackSullivan merged commit 892afa1 into main Mar 23, 2021
@Craigacp Craigacp deleted the xgboost-upgrade branch March 23, 2021 17:44
@nezda
Copy link
Contributor

nezda commented Mar 26, 2021

Ah, so I'd missed the threading issue because internally we build XGBoost4j with OpenMP turned on for Windows, macOS and Linux, but it's not turned on in the builds provided by dmlc in Maven Central for macOS. I'll open an issue upstream as it looks like they build the Python macOS whl with OpenMP turned on.

I got multi-threading working on macOS following https://xgboost.readthedocs.io/en/latest/jvm/index.html#enabling-openmp-for-mac-os directions and building that and this project with <xgboost.version>1.4.0-SNAPSHOT</xgboost.version> - quite the time saver 👍

@Craigacp
Copy link
Member Author

Excellent. Looks like the xgboost developers will discuss turning it on in xgboost builds after the upcoming 1.4.0 release. I'm hopeful that that 1.4.0 release will include windows binaries though which will make things much simpler for Tribuo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants