Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Decorator to generate docstrings with autodetection of parameters #2635

Merged
merged 36 commits into from
Aug 19, 2020
Merged
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
83b3cd7
FEA New version of docstring decorator with autodetection of parameters
dantegd Aug 3, 2020
848eb68
FEA Change all linear_models to use decorator
dantegd Aug 4, 2020
4b1532b
FEA Add capability of decorator to skip generating parameters header
dantegd Aug 4, 2020
5991996
FEA Change cluster estimators to use decorator
dantegd Aug 4, 2020
6446a37
FEA Add new insert_to_docstring decorator and multiple updates
dantegd Aug 5, 2020
97e74c7
Merge remote-tracking branch 'original/branch-0.15' into 015-fea-doc-…
dantegd Aug 5, 2020
b362e68
FEA More improvements and more classes migrated
dantegd Aug 5, 2020
9874510
FIX Undo temporary changes
dantegd Aug 5, 2020
6ec313f
DOC Added entry to changelog
dantegd Aug 6, 2020
534884f
FIX Small fixes for test_fit, for some reason it was not detecting mu…
dantegd Aug 6, 2020
05d67c6
Merge remote-tracking branch 'original/branch-0.15' into 015-fea-doc-…
dantegd Aug 10, 2020
416e22c
Merge branch-0.15 into 015-fea-doc-deco2
dantegd Aug 12, 2020
de5d5e4
FIX Multiple fixes and improvements from PR feedback
dantegd Aug 12, 2020
6ea94b5
DOC Added further clarification to decorator docstring
dantegd Aug 12, 2020
3f4ca19
Update python/cuml/common/doc_utils.py
dantegd Aug 12, 2020
4aa01dd
Update python/cuml/common/doc_utils.py
dantegd Aug 12, 2020
eedb9f5
Update python/cuml/common/doc_utils.py
dantegd Aug 12, 2020
656a4a6
Update python/cuml/common/doc_utils.py
dantegd Aug 12, 2020
7a7b381
Update python/cuml/naive_bayes/naive_bayes.py
dantegd Aug 12, 2020
648d756
Update python/cuml/common/base.pyx
dantegd Aug 12, 2020
153a532
Update python/cuml/common/base.pyx
dantegd Aug 12, 2020
5264b45
Update python/cuml/naive_bayes/naive_bayes.py
dantegd Aug 12, 2020
2aaf10a
Merge remote-tracking branch 'original/branch-0.15' into 015-fea-doc-…
dantegd Aug 14, 2020
f8cfc61
FIX Incorporate multiple fixes from review to remove sphinx warnings
dantegd Aug 19, 2020
46f1c04
Merge branch-0.15 into 015-fea-doc-deco2
dantegd Aug 19, 2020
c9c73de
Merge remote into local
dantegd Aug 19, 2020
86ee107
ENH Add env variable and autodetection of ipython to see when to buil…
dantegd Aug 19, 2020
b815082
ENH Add pickled models to gitignore and autogenerate check to insert …
dantegd Aug 19, 2020
a788a91
FIX PEP8 fixes
dantegd Aug 19, 2020
e705acb
FIX DOC Fix example of KMeans
dantegd Aug 19, 2020
e3a605f
FIX Remove check for generating docstrings based on feedback
dantegd Aug 19, 2020
75e228a
Merge branch-0.15 into 015-fea-doc-deco2
dantegd Aug 19, 2020
02e4c08
FIX remove accidental line break
dantegd Aug 19, 2020
5b4af21
FIX Remove stray variable in init
dantegd Aug 19, 2020
2202aa4
FIX Remove stray variable in conf.py
dantegd Aug 19, 2020
0e11828
ENH Add prepend parameters decorator option
dantegd Aug 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ log
dask-worker-space/
tmp/

## files pickled in notebook when ran during python docstring generation
docs/source/*.model

## eclipse
.project
.cproject
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
- PR #2594: Confidence intervals for ARIMA forecasts
- PR #2607: Add support for probability estimates in SVC
- PR #2618: SVM class and sample weights
- PR #2635: Decorator to generate docstrings with autodetection of parameters
- PR #2270: Multi class MNMG RF
- PR #2661: CUDA-11 support for single-gpu code
- PR #2322: Sparse FIL forests with 8-byte nodes
Expand Down
6 changes: 6 additions & 0 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,12 @@ Mini Batch SGD Regressor
.. autoclass:: cuml.MBSGDRegressor
:members:

Mutinomial Naive Bayes
----------------------

.. autoclass:: cuml.MultinomialNB
:members:

Stochastic Gradient Descent
---------------------------

Expand Down
3 changes: 2 additions & 1 deletion python/cuml/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@
from cuml.metrics.cluster.adjustedrandindex import adjusted_rand_score
from cuml.metrics.regression import r2_score

from cuml.naive_bayes.naive_bayes import MultinomialNB
JohnZed marked this conversation as resolved.
Show resolved Hide resolved

from cuml.neighbors.nearest_neighbors import NearestNeighbors

from cuml.preprocessing.LabelEncoder import LabelEncoder
Expand Down Expand Up @@ -84,7 +86,6 @@

from cuml.common.memory_utils import set_global_output_type, using_output_type


# Version configuration

from ._version import get_versions
Expand Down
29 changes: 14 additions & 15 deletions python/cuml/cluster/dbscan.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ from libc.stdlib cimport calloc, malloc, free

from cuml.common.array import CumlArray
from cuml.common.base import Base
from cuml.common.doc_utils import generate_docstring
from cuml.common.handle cimport cumlHandle
from cuml.common import input_to_cuml_array

Expand Down Expand Up @@ -204,19 +205,17 @@ class DBSCAN(Base):
if self.max_mbytes_per_batch is None:
self.max_mbytes_per_batch = 0

@generate_docstring(skip_parameters_heading=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be difficult, but I noticed in the output that the order of the parameters no longer matches the order of the function signature. This looks a bit ugly in the final output. However, matching the order may be difficult to accomplish given the way its currently implemented. Would require some significant parsing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to avoid the parsing to reduce the time as I mentioned above, looking out of order might be a minor sacrifice to keep this as fast as possible. One of the trickier parts has been balancing capabilities of the decorator with not impacting import time of the module

def fit(self, X, out_dtype="int32"):
"""
Perform DBSCAN clustering from features.

Parameters
----------
X : array-like (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features).
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
ndarray, cuda array interface compliant array like CuPy
out_dtype: dtype Determines the precision of the output labels array.
default: "int32". Valid values are { "int32", np.int32,
"int64", np.int64}. When the number of samples exceed
"int64", np.int64}.

"""
self._set_n_features_in(X)
self._set_output_type(X)
Expand Down Expand Up @@ -321,21 +320,21 @@ class DBSCAN(Base):

return self

@generate_docstring(skip_parameters_heading=True,
return_values={'name': 'preds',
'type': 'dense',
'description': 'Cluster labels',
'shape': '(n_samples, 1)'})
def fit_predict(self, X, out_dtype="int32"):
"""
Performs clustering on input_gdf and returns cluster labels.
Performs clustering on X and returns cluster labels.

Parameters
----------
X : array-like (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features)
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
ndarray, cuda array interface compliant array like CuPy

Returns
-------
y : cuDF Series, shape (n_samples)
cluster labels
out_dtype: dtype Determines the precision of the output labels array.
default: "int32". Valid values are { "int32", np.int32,
"int64", np.int64}.

"""
self.fit(X, out_dtype)
return self.labels_
Expand Down
103 changes: 24 additions & 79 deletions python/cuml/cluster/kmeans.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ from libc.stdlib cimport calloc, malloc, free

from cuml.common.array import CumlArray
from cuml.common.base import Base
from cuml.common.doc_utils import generate_docstring
from cuml.common.handle cimport cumlHandle
from cuml.common import input_to_cuml_array
from cuml.cluster.kmeans_utils cimport *
Expand Down Expand Up @@ -140,7 +141,7 @@ class KMeans(Base):
print(b)

print("Calling fit")
kmeans_float = KMeans(n_clusters=2, n_gpu=-1)
kmeans_float = KMeans(n_clusters=2)
kmeans_float.fit(b)

print("labels:")
Expand Down Expand Up @@ -306,21 +307,11 @@ class KMeans(Base):
params.n_init = <int>self.n_init
self._params = params

@generate_docstring()
JohnZed marked this conversation as resolved.
Show resolved Hide resolved
def fit(self, X, sample_weight=None):
"""
Compute k-means clustering with X.

Parameters
----------
X : array-like (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features).
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
ndarray, cuda array interface compliant array like CuPy

sample_weight : array-like (device or host) shape = (n_samples,), default=None # noqa
The weights for each observation in X. If None, all observations
are assigned equal weight.

"""
self._set_n_features_in(X)
self._set_output_type(X)
Expand Down Expand Up @@ -407,21 +398,14 @@ class KMeans(Base):
del(sample_weight_m)
return self

@generate_docstring(return_values={'name': 'preds',
'type': 'dense',
'description': 'Cluster indexes',
'shape': '(n_samples, 1)'})
def fit_predict(self, X, sample_weight=None):
"""
Compute cluster centers and predict cluster index for each sample.

Parameters
----------
X : array-like (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features).
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
ndarray, cuda array interface compliant array like CuPy

sample_weight : array-like (device or host) shape = (n_samples,), default=None # noqa
The weights for each observation in X. If None, all observations
are assigned equal weight.

"""
return self.fit(X, sample_weight=sample_weight).labels_

Expand Down Expand Up @@ -522,43 +506,29 @@ class KMeans(Base):
del(sample_weight_m)
return self._labels_.to_output(out_type), inertia

@generate_docstring(return_values={'name': 'preds',
'type': 'dense',
'description': 'Cluster indexes',
'shape': '(n_samples, 1)'})
def predict(self, X, convert_dtype=False, sample_weight=None):
"""
Predict the closest cluster each sample in X belongs to.

Parameters
----------
X : array-like (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features).
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
ndarray, cuda array interface compliant array like CuPy

Returns
-------
labels : array
Which cluster each datapoint belongs to.
"""

labels, _ = self._predict_labels_inertia(X,
convert_dtype=convert_dtype,
sample_weight=sample_weight)
return labels

@generate_docstring(return_values={'name': 'X_new',
'type': 'dense',
'description': 'Transformed data',
'shape': '(n_samples, n_clusters)'})
def transform(self, X, convert_dtype=False):
"""
Transform X to a cluster-distance space.

Parameters
----------
X : array-like (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features).
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
ndarray, cuda array interface compliant array like CuPy

convert_dtype : bool, optional (default = False)
When set to True, the transform method will, when necessary,
convert the input to the data type which was used to train the
model. This will increase memory used for the method.
"""

out_type = self._get_output_type(X)
Expand Down Expand Up @@ -614,54 +584,29 @@ class KMeans(Base):
del(X_m)
return preds.to_output(out_type)

@generate_docstring(return_values={'name': 'score',
'type': 'float',
'description': 'Opposite of the value \
of X on the K-means \
objective.'})
def score(self, X, y=None, sample_weight=None, convert_dtype=True):
"""
Opposite of the value of X on the K-means objective.

Parameters
----------
X : array-like (device or host) shape (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features).
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
ndarray, cuda array interface compliant array like CuPy
y : Ignored
Not used, present here for API consistency by convention.
sample_weight : array-like (device or host) of shape (n_samples,),
default=None. Acceptable formats: cuDF DataFrame, NumPy ndarray,
Numba device ndarray, cuda array interface compliant array like
CuPy.
convert_dtype : bool, optional (default = False)
When set to True, the transform method will, when necessary,
convert the input to the data type which was used to train the
model. This will increase memory used for the method.


Returns
-------
score: float
Opposite of the value of X on the K-means objective.
"""

return -1 * self._predict_labels_inertia(
X, convert_dtype=convert_dtype,
sample_weight=sample_weight)[1]

@generate_docstring(return_values={'name': 'X_new',
'type': 'dense',
'description': 'Transformed data',
'shape': '(n_samples, n_clusters)'})
def fit_transform(self, X, convert_dtype=False):
"""
Compute clustering and transform X to cluster-distance space.

Parameters
----------
X : array-like (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features).
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
ndarray, cuda array interface compliant array like CuPy

convert_dtype : bool, optional (default = False)
When set to True, the fit_transform method will automatically
convert the input to the data type which was used to train the
model. This will increase memory used for the method.

"""
return self.fit(X).transform(X, convert_dtype=convert_dtype)

Expand Down
40 changes: 13 additions & 27 deletions python/cuml/common/base.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ import inspect
from cudf.core import Series as cuSeries
from cudf.core import DataFrame as cuDataFrame
from cuml.common.array import CumlArray
from cuml.common.doc_utils import generate_docstring
from cupy import ndarray as cupyArray
from numba.cuda import devicearray as numbaArray
from numpy import ndarray as numpyArray
Expand Down Expand Up @@ -336,26 +337,16 @@ class RegressorMixin:

_estimator_type = "regressor"

@generate_docstring(return_values={'name': 'score',
'type': 'float',
'description': 'R^2 of self.predict(X) '
'wrt. y.'})
def score(self, X, y, **kwargs):
"""Scoring function for regression estimators
"""
Scoring function for regression estimators

Returns the coefficient of determination R^2 of the prediction.

Parameters
----------
X : array-like (device or host) shape = (n_samples, n_features)
Test samples on which we predict
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
ndarray, cuda array interface compliant array like CuPy
y : array-like (device or host) shape = (n_samples, n_features)
Ground truth values for predict(X)
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
ndarray, cuda array interface compliant array like CuPy

Returns
-------
score : float
R^2 of self.predict(X) wrt. y.
"""
from cuml.metrics.regression import r2_score

Expand All @@ -373,21 +364,16 @@ class ClassifierMixin:

_estimator_type = "classifier"

@generate_docstring(return_values={'name': 'score',
'type': 'float',
'description': 'Accuracy of \
self.predict(X) wrt. y \
(fraction where y == \
pred_y)'})
def score(self, X, y, **kwargs):
"""
Scoring function for classifier estimators based on mean accuracy.

Parameters
----------
X : [cudf.DataFrame]
Test samples on which we predict
y : [cudf.Series, device array, or numpy array]
Ground truth values for predict(X)

Returns
-------
score : float
Accuracy of self.predict(X) wrt. y (fraction where y == pred_y)
"""
from cuml.metrics.accuracy import accuracy_score
from cuml.common import input_to_dev_array
Expand Down
Loading