[RFC] High-level interface for external memory. #11000

trivialfis · 2024-11-18T07:13:13Z

This document concerns the design for the future high-level external memory interface for XGBoost. The closest existing examples are data loaders in deep learning libraries, and there's no standardized interface. I want to tap into your wisdom and experience to build a new and general interface.

Background

In the last few months, we have significantly improved the external memory implementation, and it's itching toward becoming production-ready. The improvement is an essential milestone for XGBoost. The memory capacity has been a pain point for users for a long time. Not being a stochastic algorithm, XGBoost has to load all data into the main memory for efficient model training. This constraint restricts how much data a user can feed into the training algorithm.

With the help of the external memory version of XGBoost, we can now fetch data from a different (presumably larger) memory location, be it host memory for GPU or persistent memory for CPU. During training, XGBoost constructs a cache in the external memory and fetches the data from this cache. To build this cache, users first define a custom iterator that loads raw data in batches:

class Iterator(xgboost.DataIter):
    def __init__(self) -> None:
	    super().__init__(cache_prefix=os.path.join(".", "cache"))

	def next(self, input_data: Callable) -> bool:
	    return True

	def reset(self) -> None:
	    pass

For a full example, please see https://github.com/dmlc/xgboost/blob/master/demo/guide-python/external_memory.py .

The iterator is considered the native interface. We want to build something at a higher level that can be composed with other libraries and downstream projects. With this new input schema, we are getting closer to what deep learning libraries do, like the data loader from Pytorch. However, since XGBoost is mainly built upon the idea that it can access all the data during training, existing interfaces are biased toward the scikit-learn-like fit method, where a user specifies all data either as a dataframe or as an array-like object:

def fit(self, X: ArrayLike, y: ArrayLike, *, sample_weight: Optional[ArrayLike] = None)

The mismatch between two interfaces means scikit-learn utilities like cross-validation and calibration are no longer applicable. As a result, we need to design a new interface or a paradigm for batch-based algorithms.

Due to the flexible nature of DL architectures, there's no standard way to perform these wrapper tasks, and hence, it's unlikely that we can simply adopt the tools from the DL community. In addition, due to the size of modern-day DL models, cross-validation is not a particularly important topic for DL training.

Features

I want to list a set of features the new interface might possess. I will move from XGBoost-specific feature requests to more general wishes. This list is up for discussion and is not a hard requirement. You are more than welcome to make any changes.

We should be able to design a cross-validation routine for this interface for both standard regression and time series (rolling windows). In addition, this CV procedure is likely to be part of the XGBoost package instead of a general routine for other libraries, as we can make specialized optimizations like sharing histogram bins between folds. The CV procedure doesn't' have to be as general as the scikit-learn version. For example, we can require the user to pre-shuffle the input and simply load portions sequentially.

class XGBClassifierCV(Base):
    ...

The interface should be easy for downstream libraries to use. For example, if we want to define a doubly robust or a xlearner causal inference estimator with XGBoost, there should be a canonical way to do it.
The interface should have distributed computing in mind. Since distributed frameworks like dask and spark manage partitions, we would like to reduce the framework-specific glue code for training and CV.
We should impose as little constraint as possible. Data loading is highly flexible. The data might be from S3 buckets, and the loading might be done internally using GPU-Direct. The data may be generated on the fly, which is not unheard of with time series, as users might generate the rolling windows in their custom pipeline. We should allow all potential solutions.

Bindings

I don't have a strong preference for any particular language. Any design paradigm is welcome, and we can generalize it to other languages.

The text was updated successfully, but these errors were encountered:

richardliaw · 2024-12-02T00:54:01Z

Hey @trivialfis, this is a great path forward. A couple immediate thoughts:

It would be great to see what you're thinking about in terms of distributed training, which would be one of the primary use cases of this API
partial_fit is one of the primary APIs that other libraries would use for such a use case
It's not clear why you can't just use the standard CV API that Scikit-Learn proposes, but just make the data loading iterative / optimize this underneath the hood.

On our side, we'd be very interested to see Ray Data's training iterator API to work well with the external memory support. Something like:

ds = ray.data.read_parquet(...)
iterators = ds.streaming_split(num_workers = 10)
# each iterator will read a split from the given dataset

workers = [Worker(it) for it in iterators]
on each worker:
   dmatrix = DMatrix(it)
   fit(dmatrix)

trivialfis · 2024-12-02T08:08:54Z

@richardliaw Thank you for reaching out.

It would be great to see what you're thinking about in terms of distributed training, which would be one of the primary use cases of this API

The distributed training is quite similar, see https://github.com/dmlc/xgboost/blob/master/demo/guide-python/distributed_extmem_basic.py .

partial_fit is one of the primary APIs that other libraries would use for such a use case

This is not applicable to XGBoost, unfortunately. We still need to access the entire dataset to train each tree instead of using a part of the data to fit a part of tree. The difference is we can now fetch the data on demand. partial_fit is to fit a partial model, whereas we need to train a complete tree for each iteration.

It's not clear why you can't just use the standard CV API that Scikit-Learn proposes, but just make the data loading iterative / optimize this underneath the hood.

None of the sklearn utilities accept an iterator or anything similar, they accept arrays and dataframes that can be sliced and randomly indexed. Unless you have a lazy evaluation dataframe (maybe polars can do that), we have to load the entire dataset into the main memory for sklearn to process, which defeats the purpose of external memory.

trivialfis added the status: RFC label Nov 18, 2024

trivialfis changed the title ~~[RFC] Interface for External memory.~~ [RFC] High-level interface for External memory. Nov 18, 2024

trivialfis changed the title ~~[RFC] High-level interface for External memory.~~ [RFC] High-level interface for external memory. Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] High-level interface for external memory. #11000

[RFC] High-level interface for external memory. #11000

trivialfis commented Nov 18, 2024 •

edited

Loading

richardliaw commented Dec 2, 2024

trivialfis commented Dec 2, 2024

[RFC] High-level interface for external memory. #11000

[RFC] High-level interface for external memory. #11000

Comments

trivialfis commented Nov 18, 2024 • edited Loading

Background

Features

Bindings

richardliaw commented Dec 2, 2024

trivialfis commented Dec 2, 2024

trivialfis commented Nov 18, 2024 •

edited

Loading