-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] SIGABRT in CUML RF, out of bounds memory usage #4046
Comments
Same kind of run, SIGABRT again. This time dmesg says:
Basically the CUML RF is not stable/usable as-is. this time the parameters were:
and data was |
|
tagging @vinaydes who's looking into the issue |
Without a reproducer it is going to be difficult to debug this one. I created following snippet for reproducing the crash. However trying on two different GPUs (Titan V, RTX 3070 Ti) gave me no crash. import sys
import numpy as np
from sklearn.datasets import make_classification
from cuml.ensemble import RandomForestClassifier as cumlRFClassifier
import time
N_REPS = 50
# (91457, 331)
# OrderedDict([('output_type', 'numpy'), ('random_state', 840607124), ('verbose', False),
# ('n_estimators', 200), ('n_bins', 128), ('split_criterion', 1), ('max_depth', 18),
# ('max_leaves', 1024), ('max_features', 'auto'), ('min_samples_leaf', 1),
# ('min_samples_split', 10), ('min_impurity_decrease', 0.0)])
rf_params = {
'n_estimators' : 200,
'split_criterion' : 1,
'bootstrap' : True,
'max_samples' : 1.0,
'max_depth' : 18,
'max_leaves' : 1024,
'max_features' : 'auto',
'n_bins' : 128,
'min_samples_leaf' : 1,
'min_samples_split' : 10,
'min_impurity_decrease' : 0.0,
'accuracy_metric' : 'mse',
'max_batch_size' : 128,
'random_state' : 840607124,
'n_streams' : 4,
'output_type' : 'numpy',
'verbose' : False
}
dataset_params = {
'n_samples' : 91457,
'n_features' : 331,
'n_informative' : 60,
'n_redundant' : 0,
'n_repeated' : 0,
'n_classes' : 2,
'n_clusters_per_class' : 5,
'weights' : None,
'flip_y' : 0.1,
'class_sep' : 1.0,
'hypercube' : True,
'shift' : 0.0,
'scale' : 1.0,
'shuffle' : True,
'random_state' : None
}
start = time.time()
for i in range(N_REPS):
[X, y] = make_classification(**dataset_params)
X = np.float32(X)
y = np.int32(y)
cuml_cls = cumlRFClassifier(**rf_params)
cuml_cls.fit(X, y)
end = time.time()
print('Time to fit = ', end - start)
# (91457, 492)
# OrderedDict([('output_type', 'numpy'), ('random_state', 620152258), ('verbose', False),
# ('n_estimators', 300), ('n_bins', 128), ('split_criterion', 0), ('max_depth', 20),
# ('max_leaves', 1024), ('max_features', 'auto'), ('min_samples_leaf', 1),
# ('min_samples_split', 2), ('min_impurity_decrease', 0.0)])
dataset_params['n_features'] = 492
rf_params['random_state'] = 620152258
rf_params['split_criterion'] = 0
rf_params['max_depth'] = 20
rf_params['min_samples_split'] = 2
start = time.time()
for i in range(N_REPS):
[X, y] = make_classification(**dataset_params)
X = np.float32(X)
y = np.int32(y)
cuml_cls = cumlRFClassifier(**rf_params)
cuml_cls.fit(X, y)
end = time.time()
print('Time to fit = ', end - start)
# (91457, 498)
# OrderedDict([('output_type', 'numpy'), ('random_state', 1037940298), ('verbose', False),
# ('n_estimators', 300), ('n_bins', 128), ('split_criterion', 1), ('max_depth', 16),
# ('max_leaves', 1024), ('max_features', 'auto'), ('min_samples_leaf', 1),
# ('min_samples_split', 10), ('min_impurity_decrease', 0.0)])
dataset_params['n_features'] = 498
rf_params['random_state'] = 1037940298
rf_params['n_estimators'] = 300
rf_params['split_criterion'] = 1
rf_params['min_samples_split'] = 10
start = time.time()
for i in range(N_REPS):
[X, y] = make_classification(**dataset_params)
X = np.float32(X)
y = np.int32(y)
cuml_cls = cumlRFClassifier(**rf_params)
cuml_cls.fit(X, y)
end = time.time()
print('Time to fit = ', end - start) Can you try to run this snippet on the machine where you see the crash? Is any cuML RF code crashing for you? Or is it observed only for certain values of parameters? |
Ok will try to setup repro. Happens quite often. New ones:
|
@vinaydes thanks for patience. I've been trying out various CUML things and got back to this because it still happens all the time. This is a repro for me, but it does not fail every time I run it. It might not fail for you at all, since it is some bug that seems to be sometimes accessing wrong memory.
gives
This is not a rare situation. As I mentioned, when changing various data and parameters for the model, this happens within about 100 fits. It's a very fatal crash, so makes it not possible to rely upon the algorithm until fixed. FYI this is on latest nightly as of 2 nights ago.
Relevant parts from an install from doing
|
Here's another one, that also doesn't crash every time but most of the time:
https://0xdata-public.s3.amazonaws.com/jon/sigabrt2.pkl.zip
|
I tried changing varoius parameters, back to defaults, but none mattered until I reached
So
does not crash. But this just means unlimited leaves. However, the max_depth of 18 is up to 262144 leaves, so limiting the leaves to 1024 shouldn't violate some condition that happen to be unprotected. And going back to defaults for
doesn't crash. But this still crashes, suggesting experimental backend doesn't matter/cause it.
However, this does not crash:
So it seems max_batch_size matters even when use_experimental_backend = False, which makes no sense given the documentation that says:
And this still crashes:
so seems unrelated to streams. |
Other odd thing is I got those parameters by doing get_params() from the model that crashed on me. The max_batch_size is always 4096, but document says default is 128. So something wrong there in docs or code. sigabrt.pkl is same with max_batch_size set automatically to 4096 according to get_params(), even though default is supposed to be 128. |
Thanks @pseudotensor. I'll run your samples and see if I can reproduce the issue.
Default batch size is recently changed from 128 to 4096. You are using nightly cuML but probably looking at stable release documentation. Nightly documentation reflects the correct default value of 4096 https://docs.rapids.ai/api/cuml/nightly/api.html#random-forest. |
@vinaydes Ok, but with 4096 I get all the above SIGABRT's. Only when using 128 do I happen to not hit them. So it seems the new default exposes major issues. I gave many reproducible examples. |
Got it, I am investigating the issue. |
Thanks to @venkywonka we seem to have reached to the cause of this issue. For now, to unblock you, I would suggest leaving the @RAMitchell This issue is related how we are updating |
@pseudotensor Thanks for your patience. The PR #4126 should fix this issue. Please test in your setup once the PR is merged. |
Great thanks! Will this be part of 21.08 release? Or is it too late for that? Seems like a critical bug that needs to be in 21.08 |
Yes, it should be part of 21.08. |
Fixes issue #4046. In the `nodeSplitKernel` each thread calls `leafBasedOnParams()` which reads global variable `n_leaves`. Different threads from same threadblock read `n_leaves` at different times. Between two threads reading `n_leaves`, value of it could be changed by some other threadblock. Thus one or few threads might concluded that `max_leaves` is reached, and rest of the threads might conclude otherwise. This caused crash in partitioning the samples. In the solution provided here, instead of every thread reading `n_leaves`, only one thread from a threadblock reads the value and broadcasts it to every other thread via shared memory. This ensures complete agreement on `max_leaves` criterion among threads from threadblock. Performance results to be posted shortly. Authors: - Vinay Deshpande (https://github.com/vinaydes) Approvers: - Venkat (https://github.com/venkywonka) - Rory Mitchell (https://github.com/RAMitchell) - Thejaswi. N. S (https://github.com/teju85) URL: #4126
Confirming that the fix was just in time for 21.08 and was just merged. Thanks!!! |
This issue has been labeled |
This was closed long time ago #4046 (comment) |
…i#4126) Fixes issue rapidsai#4046. In the `nodeSplitKernel` each thread calls `leafBasedOnParams()` which reads global variable `n_leaves`. Different threads from same threadblock read `n_leaves` at different times. Between two threads reading `n_leaves`, value of it could be changed by some other threadblock. Thus one or few threads might concluded that `max_leaves` is reached, and rest of the threads might conclude otherwise. This caused crash in partitioning the samples. In the solution provided here, instead of every thread reading `n_leaves`, only one thread from a threadblock reads the value and broadcasts it to every other thread via shared memory. This ensures complete agreement on `max_leaves` criterion among threads from threadblock. Performance results to be posted shortly. Authors: - Vinay Deshpande (https://github.com/vinaydes) Approvers: - Venkat (https://github.com/venkywonka) - Rory Mitchell (https://github.com/RAMitchell) - Thejaswi. N. S (https://github.com/teju85) URL: rapidsai#4126
Describe the bug
SIGABRT, seemingly from out of bounds
Steps/Code to reproduce bug
Unknown, but paraameters were just Kaggle Paribas with some various Frequency encoding features to get to
(91457, 331)
size.parameters
For a binary classification problem.
No messages in console at all, even though ran in debug mode with verbose=4. All I got was SIGABRT and in
dmesg
this:Expected behavior
Not to crash, be more stable.
Environment details (please complete the following information):
conda_list.txt.zip
Additional context
If hit again will try to produce repro. But I expect just various testing on NVIDIA's side will reveal. I've only been using CUML RF for a day and already hit this after (maybe) 200 fits on small data.
The text was updated successfully, but these errors were encountered: