-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA] GPU training fails with (split_indices_block_size_data_partition) > (0) on Ubuntu 22.04 #6727
Comments
have you tried using: |
Interesting suggestion, but I'm not able to get that solution working in a Docker container either... FWIW, my local lightgbm.basic.LightGBMError: Check failed: (split_indices_block_size_data_partition) > (0) at ~/LightGBM/src/treelearner/cuda/cuda_data_partition.cpp, line 280 . error when implemented in code. This remains true if I install LightGBM inside a Docker container with something like RUN python3.11 -m pip install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON lightgbm and then copy the installed package outside of the container into my local environment. |
The build will be successful even if the CUDA version does not match (driver, lgbm, etc) and will cause error when attempting to run a model. Which is probability the cause i reckon. For a quick test (5 to 10mins build time), you can try the following dockerfile (tested and working for 2 hosts(pc and laptop)). After build complete, run(8888 is default jupyter port, unless your host is already using the port #, it should work): docker run --gpus all -p 8888:8888 <image-name> enter http://localhost:8888/lab in your browser and you can test your code. Note: notebook is set with no password for testing, hence, you might want to change that if it works. check nvidia-smi in docker and host that both has the same CUDA Version if you are still encountering problem. This is from my host: FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
#################################################################################################################
# Global
#################################################################################################################
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
ARG DEBIAN_FRONTEND=noninteractive
#################################################################################################################
# SYSTEM
#################################################################################################################
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
curl \
bzip2 \
ca-certificates \
libglib2.0-0 \
libxext6 \
libsm6 \
libxrender1 \
git \
gnupg \
swig \
vim \
mercurial \
subversion \
python3-dev \
python3-pip \
python3-setuptools \
ocl-icd-opencl-dev \
cmake \
libboost-dev \
libboost-system-dev \
libboost-filesystem-dev \
gcc \
g++ && \
# Remove old CMake and install the latest version
apt-get remove -y cmake && \
curl -fsSL https://github.com/Kitware/CMake/releases/download/v3.28.5/cmake-3.28.5-linux-x86_64.tar.gz | tar -xz -C /usr/local --strip-components=1 && \
# Install Node.js 18.x
curl -fsSL https://deb.nodesource.com/setup_18.x | bash - && \
apt-get update && \
apt-get install -y nodejs=18.20.4*
# Add OpenCL ICD files for LightGBM
RUN mkdir -p /etc/OpenCL/vendors && \
echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
#################################################################################################################
# ML libraries and dependencies
#################################################################################################################
RUN pip3 install --upgrade pip
RUN pip3 install jupyterlab==4.2.5
RUN pip3 install scikit-learn==1.5.2
RUN pip3 install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON lightgbm
RUN apt-get autoremove -y && apt-get clean && \
rm -rf /var/lib/apt/lists/*
CMD ["jupyter-lab", "--ip=0.0.0.0", "--allow-root", "--NotebookApp.token=''", "--NotebookApp.password=''", "--no-browser"] Lastly, lightgbm installed cmd is kept at the end, hence, you can switch different installation methods without having to rebuild the prior installations, making bebugging lgbm installation much faster. If you are missing any SYSTEM app, just add-on within the system section, and it will only rebuild those after the line. Hopefully this will get a working lgbm on your machine. For completeness, i used the following to test the build: import numpy as np
from sklearn.model_selection import train_test_split
import lightgbm as lgb
np.random.seed(42)
n_samples = 500 * 10000
n_features = 51
X = np.random.rand(n_samples, n_features).astype(np.float32)
y = np.random.rand(n_samples).astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
X_train = np.ascontiguousarray(X_train, dtype=np.float32)
y_train = np.ascontiguousarray(y_train, dtype=np.float32)
new_lgb_train = lgb.Dataset(X_train, label=y_train)
cuda_params = {
'objective': 'regression',
'boosting_type': 'dart',
'colsample_bytree': 0.7,
'learning_rate': 0.01,
'max_depth': 7,
'subsample': 0.7,
'n_jobs': 32,
'num_leaves': 63,
'verbose': 1,
'device': 'cuda',
'force_row_wise': True
}
gbm_cuda = lgb.train(cuda_params,
new_lgb_train,
num_boost_round=100)
gbm_cuda.predict(X_test)
|
I'm not sure I'm following. Is the purpose of building LightGBM inside a Docker container to solve this supposed mismatch between NVIDIA driver and Cuda version? Or is it to identify such a mismatch in the first place so that I can solve it locally? The project I am working on doesn't use Docker, so even if this works, I still need to be able to use LightGBM with Cuda outside of the container. This feels like a pretty circuitous workaround for a pretty common OS and GPU combination. It would be great to hear from a current maintainer if the issue I originally posted above is reproducible (and therefore an actual bug), or something particular to my system. |
Using docker is primarily to avoid conflicts between NVIDIA drivers and CUDA versions and comparing msi between docker and host will identify mismatch which you can solve locally with the correct version installation. However, the main reason is that I work with multiple ML frameworks and libraries, and docker helps manage conflicting dependencies without risking issues on my host system, hence, suggested what worked for my use. Anyways, hope the current maintainer can provide a solution for you soon. Cheers! |
I was able to fix this issue, though I did change a few things all at once so I can't be 100% sure what precisely did it. Here's what I did:
Things I'm pretty sure of:
Downgrading sudo apt install gcc-10 g++-10
export CC=/usr/bin/gcc-10
export CXX=/usr/bin/g++-10
export CUDA_ROOT=/usr/local/cuda
ln -s /usr/bin/gcc-10 $CUDA_ROOT/bin/gcc
ln -s /usr/bin/g++-10 $CUDA_ROOT/bin/g++ Upgrading cuda-toolkit and NVIDIA drivers Downgrading sudo snap refresh cmake --channel=3.28/stable |
Description
My issue is very similar to #6705, though I believe I can rule out the compute capacity of my GPUs as the issue as they are >8 (NVIDIA RTX A6000).
Reproducible example
Environment info
LightGBM version or commit hash:
Commit: 5151fe8
Command(s) you used to install LightGBM
I followed the instructions here, with a slight modification.
Note that only the inclusion of
--target _lightgbm
worked for me. Otherwise I encountered the same issue as reported in #5089.OS: Ubuntu 22.04.5 LTS
Cuda version:
NVIDIA driver version: 535.183.01
GPU: NVIDIA RTX A6000
Python version: 3.11.9
Additional information
Traceback from the example script:
The text was updated successfully, but these errors were encountered: