Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] forest_inference_demo.ipynb is broken #6008

Closed
jameslamb opened this issue Aug 5, 2024 · 1 comment
Closed

[BUG] forest_inference_demo.ipynb is broken #6008

jameslamb opened this issue Aug 5, 2024 · 1 comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@jameslamb
Copy link
Member

jameslamb commented Aug 5, 2024

Describe the bug

The forest_inference_demo.ipynb notebook here is broken. XGBoost model loading with FIL is failing.

I've observed this behavior on the 24.08 release of cuml and all its dependencies. I suspect it's a problem on 24.10 as well, but haven't tested that yet.

Steps/Code to reproduce bug

Created a conda environment and installed cuml, jupyterlab, and xgboost into it.

setup (click me)

Ran the following from the root of the repo, on a machine with V100s and CUDA 12.2.

conda env create \
    --name cuml-cu12-dev \
    --file ./conda/environments/all_cuda-125_arch-x86_64.yaml

source activate cuml-cu12-dev

conda install \
    -c conda-forge \
    -c rapidsai-nightly \
    -c rapidsai \
    --yes \
        cuml=24.8.* \
        jupyterlab

Then launched JupyterLab.

jupyter lab --ip 0.0.0.0 --port 1234

Ran the cells in notebooks/forest_inference_demo.ipynb in order.

This call to ForestInference.load()

"fil_model = ForestInference.load(\n",

Fails like this:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[12], line 1
----> 1 fil_model = ForestInference.load(
      2     filename=model_path,
      3     algo='BATCH_TREE_REORG',
      4     output_class=True,
      5     threshold=0.50,
      6     model_type='xgboost'
      7 )

File fil.pyx:1033, in cuml.fil.fil.ForestInference.load()

File fil.pyx:212, in cuml.fil.fil.TreeliteModel.from_filename()

RuntimeError: Failed to load xgb.model (basic_string::_M_replace_aux)

This same error can be seen in the most recent run of this notebook in the CI for rapidsai/docker: https://github.com/rapidsai/docker/actions/runs/10244736365/job/28356773321#step:9:15

Expected behavior

Expected this notebook to run end-to-end without error.

Environment details (please complete the following information):

output of 'conda info' (click me)
     active environment : cuml-cu12-dev
    active env location : /raid/jlamb/miniforge/envs/cuml-cu12-dev
            shell level : 1
       user config file : /home/nfs/jlamb/.condarc
 populated config files : /raid/jlamb/miniforge/.condarc
                          /home/nfs/jlamb/.condarc
          conda version : 23.7.4
    conda-build version : 24.5.1
         python version : 3.10.12.final.0
       virtual packages : __archspec=1=x86_64
                          __cuda=12.2=0
                          __glibc=2.31=0
                          __linux=5.4.0=0
                          __unix=0=0
       base environment : /raid/jlamb/miniforge  (writable)
      conda av data dir : /raid/jlamb/miniforge/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /raid/jlamb/miniforge/pkgs
                          /home/nfs/jlamb/.conda/pkgs
       envs directories : /raid/jlamb/miniforge/envs
                          /home/nfs/jlamb/.conda/envs
               platform : linux-64
             user-agent : conda/23.7.4 requests/2.32.3 CPython/3.10.12 Linux/5.4.0-182-generic ubuntu/20.04.6 glibc/2.31
                UID:GID : 10349:10004
             netrc file : None
           offline mode : False
output of 'nvidia-smi' (click me)
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:06:00.0 Off |                    0 |
| N/A   31C    P0              41W / 300W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0              42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           On  | 00000000:0A:00.0 Off |                    0 |
| N/A   31C    P0              42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           On  | 00000000:0B:00.0 Off |                    0 |
| N/A   29C    P0              41W / 300W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2-32GB           On  | 00000000:85:00.0 Off |                    0 |
| N/A   31C    P0              42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2-32GB           On  | 00000000:86:00.0 Off |                    0 |
| N/A   30C    P0              42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2-32GB           On  | 00000000:89:00.0 Off |                    0 |
| N/A   35C    P0              43W / 300W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2-32GB           On  | 00000000:8A:00.0 Off |                    0 |
| N/A   31C    P0              43W / 300W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Additional context

This was only noticed because of a CI failure over in rapidsai/docker: rapidsai/docker#699 (comment).

Ideally, it could be caught here in cuml's CI. As of this writing, this notebook is not tested in CI.

SKIPPING: ./forest_inference_demo.ipynb (suspected Dask usage, not currently automatable)

(build link)

This notebook has been running in rapidsai/docker CI for a while. It passed on 24.08 as recently as 2 weeks ago.

Testing cuml/forest_inference_demo.ipynb
Completed cuml/forest_inference_demo.ipynb with 1 warnings and 0 errors

(build link)

So I suspect this is a result of a recent change. Maybe some mix of these:

@jameslamb jameslamb added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 5, 2024
@hcho3
Copy link
Contributor

hcho3 commented Aug 5, 2024

The error is likely due to the change of XGBoost version. Starting with 2.1.0 version, XGBoost defaults to using UBJSON format when saving the model.

Treelite 4.3 contains support for UBJSON, but regrettably FIL was not yet updated to recognize the UBJSON format, hence the error. Let me prepare a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants