-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] High-level interface for external memory. #11000
Comments
Hey @trivialfis, this is a great path forward. A couple immediate thoughts:
On our side, we'd be very interested to see Ray Data's training iterator API to work well with the external memory support. Something like:
|
@richardliaw Thank you for reaching out.
The distributed training is quite similar, see https://github.com/dmlc/xgboost/blob/master/demo/guide-python/distributed_extmem_basic.py .
This is not applicable to XGBoost, unfortunately. We still need to access the entire dataset to train each tree instead of using a part of the data to fit a part of tree. The difference is we can now fetch the data on demand.
None of the |
This document concerns the design for the future high-level external memory interface for XGBoost. The closest existing examples are data loaders in deep learning libraries, and there's no standardized interface. I want to tap into your wisdom and experience to build a new and general interface.
Background
In the last few months, we have significantly improved the external memory implementation, and it's itching toward becoming production-ready. The improvement is an essential milestone for XGBoost. The memory capacity has been a pain point for users for a long time. Not being a stochastic algorithm, XGBoost has to load all data into the main memory for efficient model training. This constraint restricts how much data a user can feed into the training algorithm.
With the help of the external memory version of XGBoost, we can now fetch data from a different (presumably larger) memory location, be it host memory for GPU or persistent memory for CPU. During training, XGBoost constructs a cache in the external memory and fetches the data from this cache. To build this cache, users first define a custom iterator that loads raw data in batches:
For a full example, please see https://github.com/dmlc/xgboost/blob/master/demo/guide-python/external_memory.py .
The iterator is considered the native interface. We want to build something at a higher level that can be composed with other libraries and downstream projects. With this new input schema, we are getting closer to what deep learning libraries do, like the data loader from Pytorch. However, since XGBoost is mainly built upon the idea that it can access all the data during training, existing interfaces are biased toward the scikit-learn-like
fit
method, where a user specifies all data either as a dataframe or as an array-like object:The mismatch between two interfaces means scikit-learn utilities like cross-validation and calibration are no longer applicable. As a result, we need to design a new interface or a paradigm for batch-based algorithms.
Due to the flexible nature of DL architectures, there's no standard way to perform these wrapper tasks, and hence, it's unlikely that we can simply adopt the tools from the DL community. In addition, due to the size of modern-day DL models, cross-validation is not a particularly important topic for DL training.
Features
I want to list a set of features the new interface might possess. I will move from XGBoost-specific feature requests to more general wishes. This list is up for discussion and is not a hard requirement. You are more than welcome to make any changes.
Bindings
I don't have a strong preference for any particular language. Any design paradigm is welcome, and we can generalize it to other languages.
The text was updated successfully, but these errors were encountered: