Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Support for sample_weight parameter in LogisticRegression #3572

Merged

Conversation

viclafargue
Copy link
Contributor

@viclafargue viclafargue commented Mar 2, 2021

Closes #3559

This PR adds the sample_weight and class_weight parameters to the LogisticRegression estimator.

@viclafargue viclafargue requested review from a team as code owners March 2, 2021 17:15
@github-actions github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Mar 2, 2021
@viclafargue viclafargue changed the title [WIP] sample_weight for LogisticRegression [WIP] sample_weight parameter for LogisticRegression Mar 3, 2021
@viclafargue viclafargue changed the title [WIP] sample_weight parameter for LogisticRegression [WIP] Support for sample_weight parameter in LogisticRegression Mar 3, 2021
Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ looks very clean, most comments were on the Python side (also looks clean, mostly minor stuff)

python/cuml/linear_model/logistic_regression.pyx Outdated Show resolved Hide resolved
python/cuml/solvers/qn.pyx Outdated Show resolved Hide resolved
python/cuml/solvers/qn.pyx Outdated Show resolved Hide resolved
python/cuml/test/test_linear_model.py Outdated Show resolved Hide resolved
python/cuml/test/test_linear_model.py Outdated Show resolved Hide resolved
python/cuml/test/test_linear_model.py Show resolved Hide resolved
python/cuml/test/test_linear_model.py Show resolved Hide resolved
@dantegd dantegd added feature request New feature or request non-breaking Non-breaking change labels Mar 5, 2021
@viclafargue viclafargue requested a review from a team as a code owner March 12, 2021 16:29
@github-actions github-actions bot added the CMake label Mar 12, 2021
@viclafargue
Copy link
Contributor Author

This PR requires a RAFT PR to be approved and merged for CI to pass : raft:#168 .

Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just 2 very small things, otherwise it looks great

@@ -126,6 +126,15 @@ class LogisticRegression(Base,
If False, the model expects that you have centered the data.
class_weight: None
Custom class weighs are currently not supported.
class_weight: dict or 'balanced', default=None
By default all classes have a weight one. However, a dictionnary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By default all classes have a weight one. However, a dictionnary
By default all classes have a weight one. However, a dictionary

if class_weight == 'balanced':
self.class_weight_ = 'balanced'
else:
classes = list(class_weight.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but my comment applies more so that the classes that the dictionary of weights has (as keys) have to coincide with the classes of the operation self.classes_ = cp.unique(y_m), no? i.e if the user passes a class in fit that was not in the dict of class_weights (when it was passed as a dict) it would be an error, which would be consistent with what Scikit does if I’m not mistaken https://github.com/scikit-learn/scikit-learn/blob/95119c13af77c76e150b753485c662b7c52a41a2/sklearn/utils/class_weight.py#L67

@dantegd dantegd added the 4 - Waiting on Author Waiting for author to respond to review label Mar 18, 2021
@viclafargue
Copy link
Contributor Author

viclafargue commented Mar 18, 2021

@dantegd In my understanding there's two cases:

  • class_weights does not specify a weight for all of the unique labels provided in fit -> not an issue, they will take a value of 1 as a default
  • some weight(s) are explicitly specified in class_weights but aren't present in the list of unique labels -> should raise an error

Is this right, or am I mistaken somewhere? If I'm right, thanks for noticing this and providing the link, I'll fix this.

@dantegd
Copy link
Member

dantegd commented Mar 18, 2021

@viclafargue yes indeed that's the case and thanks for writing it more clear than I did :). For reference we can see how that works in scikit indeed:

>>> from sklearn.linear_model import LogisticRegression
>>> import numpy as np
>>> weights = {0: 0.2, 1:0.4}
>>> model1 = LogisticRegression(class_weight=weights)
>>> a = np.random.rand(8).reshape(4,2)
>>> a
array([[0.05872595, 0.7262419 ],
       [0.78006228, 0.47405287],
       [0.52005636, 0.63693384],
       [0.66651034, 0.18605195]])
>>> b = np.array([0, 0, 0, 1])
>>> model1.fit(a, b)
LogisticRegression(class_weight={0: 0.2, 1: 0.4})
>>> model1.__dict__
{'penalty': 'l2', 'dual': False, 'tol': 0.0001, 'C': 1.0, 'fit_intercept': True, 'intercept_scaling': 1, 'class_weight': {0: 0.2, 1: 0.4}, 'random_state': None, 'solver': 'lbfgs', 'max_iter': 100, 'multi_class': 'auto', 'verbose': 0, 'warm_start': False, 'n_jobs': None, 'l1_ratio': None, 'n_features_in_': 2, 'classes_': array([0, 1]), 'coef_': array([[ 0.04954003, -0.10064803]]), 'intercept_': array([-0.38777167]), 'n_iter_': array([6], dtype=int32)}
>>> c = np.array([0, 2, 0, 1])
>>> model1.fit(a, c)
LogisticRegression(class_weight={0: 0.2, 1: 0.4})
>>> model1.__dict__
{'penalty': 'l2', 'dual': False, 'tol': 0.0001, 'C': 1.0, 'fit_intercept': True, 'intercept_scaling': 1, 'class_weight': {0: 0.2, 1: 0.4}, 'random_state': None, 'solver': 'lbfgs', 'max_iter': 100, 'multi_class': 'auto', 'verbose': 0, 'warm_start': False, 'n_jobs': None, 'l1_ratio': None, 'n_features_in_': 2, 'classes_': array([0, 1, 2]), 'coef_': array([[-0.13812798,  0.08768523],
       [ 0.00804634, -0.10698594],
       [ 0.13008164,  0.0193007 ]]), 'intercept_': array([-0.25710905, -0.26129142,  0.51840047]), 'n_iter_': array([6], dtype=int32)}
>>> d = np.array([0, 2, 0, 2])
>>> model1.fit(a, d)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py", line 1407, in fit
    fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py", line 665, in _logistic_regression_path
    class_weight_ = compute_class_weight(class_weight,
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/sklearn/utils/class_weight.py", line 68, in compute_class_weight
    raise ValueError("Class label {} not present.".format(c))
ValueError: Class label 1 not present.
>>> 

@JohnZed JohnZed changed the title [WIP] Support for sample_weight parameter in LogisticRegression [REVIEW] Support for sample_weight parameter in LogisticRegression Mar 18, 2021
Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@viclafargue to merge could you check that we're pinning raft to the latest possible commit, and add an xfail to the non monotonic sil score test if it hasn't been done already in branch-0.19

@dantegd dantegd added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Waiting on Author Waiting for author to respond to review labels Mar 19, 2021
@github-actions github-actions bot removed the CMake label Mar 23, 2021
@codecov-io
Copy link

Codecov Report

Merging #3572 (a15e538) into branch-0.19 (c2f246a) will increase coverage by 0.13%.
The diff coverage is 88.23%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #3572      +/-   ##
===============================================
+ Coverage        80.87%   81.00%   +0.13%     
===============================================
  Files              228      228              
  Lines            17630    17811     +181     
===============================================
+ Hits             14258    14428     +170     
- Misses            3372     3383      +11     
Flag Coverage Δ
dask 45.13% <7.35%> (+0.17%) ⬆️
non-dask 73.31% <88.23%> (+0.21%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/cuml/solvers/qn.pyx 97.20% <85.71%> (-0.43%) ⬇️
python/cuml/linear_model/logistic_regression.pyx 89.16% <88.88%> (+0.20%) ⬆️
python/cuml/linear_model/linear_regression.pyx 88.23% <0.00%> (-3.53%) ⬇️
python/cuml/dask/solvers/cd.py 100.00% <0.00%> (ø)
python/cuml/common/numba_utils.py 0.00% <0.00%> (ø)
python/cuml/internals/global_settings.py 100.00% <0.00%> (ø)
python/cuml/dask/preprocessing/encoders.py 100.00% <0.00%> (ø)
python/cuml/metrics/_ranking.py 98.57% <0.00%> (+0.02%) ⬆️
python/cuml/preprocessing/encoders.py 95.10% <0.00%> (+0.02%) ⬆️
python/cuml/common/array_descriptor.py 98.24% <0.00%> (+0.03%) ⬆️
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2f246a...a15e538. Read the comment docs.

@dantegd
Copy link
Member

dantegd commented Mar 23, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 5bb564e into rapidsai:branch-0.19 Mar 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge Cython / Python Cython or Python issue feature request New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support sample_weights in logistic regression
3 participants