Skip to content
This repository has been archived by the owner on Nov 29, 2023. It is now read-only.

Commit

Permalink
chore: Manual extraction of enum types and boundaries (#28)
Browse files Browse the repository at this point in the history
* Manual extraction of enum types and boundaries

* style: apply automatic fixes of linters

Co-authored-by: duklin <[email protected]>
Co-authored-by: Lars Reimann <[email protected]>
  • Loading branch information
3 people authored Dec 13, 2021
1 parent adccbca commit ce5e079
Show file tree
Hide file tree
Showing 54 changed files with 4,270 additions and 0 deletions.
164 changes: 164 additions & 0 deletions refined_types/CategorizedDocstrings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Scikit-learn docstrings

The following is a proposal for categorizing the parts of docstrings that are relevant for the task of finding refined types (enums and boundaries) from the Scikit-Learn documentation.

## Enums
```
algorithm : {'SAMME', 'SAMME.R'}, default='SAMME.R'
algorithm : {"auto", "full", "elkan"}, default="auto"
algorithm : {'arpack', 'randomized'}, default='randomized'
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'
average : {'micro', 'macro', 'samples', 'weighted'} or None, default='macro'
average : {'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary'
analyzer : {'word', 'char', 'char_wb'} or callable, default='word'
categories : 'auto' or a list of array-like, default='auto'
class_weight : dict or 'balanced', default=None
class_weight : dict, list of dict or "balanced", default=None
class_weight : dict, {class_label: weight} or "balanced", default=None
class_weight : {"balanced", "balanced_subsample"}, dict or list of dicts, default=None
criterion : {"gini", "entropy"}, default="gini"
criterion : {'friedman_mse', 'squared_error', 'mse', 'mae'}, default='friedman_mse'
decode_error : {'strict', 'ignore', 'replace'}, default='strict'
decision_function_shape : {'ovo', 'ovr'}, default='ovr'
drop : {'first', 'if_binary'} or a array-like of shape (n_features,), default=None
error_score : 'raise' or numeric, default=np.nan
gamma : {'scale', 'auto'} or float, default='scale'
handle_unknown : {'error', 'ignore'}, default='error'
init : estimator or 'zero', default=None
init : {'k-means++', 'random'}, callable or array-like of shape (n_clusters, n_features), default='k-means++'
input : {'filename', 'file', 'content'}, default='content'
kernel : {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}, default='rbf'
loss : {'squared_error', 'absolute_error', 'huber', 'quantile'}, default='squared_error'
max_features : {'auto', 'sqrt', 'log2'}, int or float
max_features : int, float or {"auto", "sqrt", "log2"}, default=None
max_features : {"auto", "sqrt", "log2"}, int or float, default="auto"
multioutput : {'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average'
multioutput : {'raw_values', 'uniform_average', 'variance_weighted'}, array-like of shape (n_outputs,) or None, default='uniform_average'
multi_class : {'raise', 'ovr', 'ovo'}, default='raise'
multi_class : {'auto', 'ovr', 'multinomial'}, default='auto'
norm : {'l1', 'l2'}, default='l2'
normalize : {'true', 'pred', 'all'}, default=None
n_components : int, float or 'mle', default=None
order : {'C', 'F'}, default='C'
penalty : {'l2', 'l1', 'elasticnet'}, default='l2'
penalty : {'l1', 'l2', 'elasticnet', 'none'}, default='l2'
precompute : 'auto', bool or array-like of shape (n_features, n_features)
selection : {'cyclic', 'random'}, default='cyclic'
splitter : {"best", "random"}, default="best"
strip_accents : {'ascii', 'unicode'}, default=None
stop_words : {'english'}, list, default=None
solver : {'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'}, default='lbfgs'
solver : {'auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'}, default='auto'
svd_solver : {'auto', 'full', 'arpack', 'randomized'}, default='auto'
weights : {'linear', 'quadratic'}, default=None
weights : {'uniform', 'distance'} or callable, default='uniform'
zero_division : "warn", 0 or 1, default="warn"
```

## Multi-part Enums
```
loss : str, default='hinge'
The possible options are 'hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron', or a regression loss: 'squared_error', 'huber', 'epsilon_insensitive', or 'squared_epsilon_insensitive'
learning_rate : str, default='optimal'
- 'constant': `eta = eta0`
- 'optimal': `eta = 1.0 / (alpha * (t + t0))` where t0 is chosen by a heuristic proposed by Leon Bottou
- 'invscaling': `eta = eta0 / pow(t, power_t)`
- 'adaptive': eta = eta0, as long as the training keeps decreasing
strategy : str, default='mean'
If "mean", then replace missing values using the mean along each column
If "median", then replace missing values using the median along each column
If "most_frequent", then replace missing using the most frequent value along each column
If "constant", then replace missing values with fill_value
```


# Boundaries
```
ccp_alpha : non-negative float, default=0.0
max_fpr : float > 0 and <= 1, default=None
quantile_range : tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0, default=(25.0, 75.0)
```

## Two-part boundaries
The boundary is defined in the body of the docstring
```
validation_fraction : float, default=0.1
Must be between 0 and 1
C : float, default=1.0
Inverse of regularization strength; must be a positive float
verbose : int, default=0
For the liblinear and lbfgs solvers set verbose to any positive number for verbosity
l1_ratio : float, default=0.5
The ElasticNet mixing parameter, with ``0 <= l1_ratio <= 1``
C : float, default=1.0
Must be strictly positive
tol : float, default=0.0
Must be of range [0.0, infinity)
n_splits : int, default=5
Must be at least 2
```

## Boundaries in specific case
```
max_samples : int or float, default=None
If float, then draw `max_samples * X.shape[0]` samples. Thus, `max_samples` should be in the interval `(0.0, 1.0]`.
max_df : float or int, default=1.0
If float in range [0.0, 1.0]
min_df : float or int, default=1
If float in range of [0.0, 1.0]
max_df : float in range [0.0, 1.0] or int, default=1.0
min_df : float in range [0.0, 1.0] or int, default=1
iterated_power : int or 'auto', default='auto'
Must be of range [0, infinity)
test_size : float or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split
train_size : float or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split
```
218 changes: 218 additions & 0 deletions refined_types/docstrings.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
criterion : {"gini", "entropy"}, default="gini"

splitter : {"best", "random"}, default="best"

max_features : int, float or {"auto", "sqrt", "log2"}, default=None

class_weight : dict, list of dict or "balanced", default=None

ccp_alpha : non-negative float, default=0.0

multioutput : {'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average'

average : {'micro', 'macro', 'samples', 'weighted'} or None, default='macro'

max_fpr : float > 0 and <= 1, default=None

multi_class : {'raise', 'ovr', 'ovo'}, default='raise'

normalize : {'true', 'pred', 'all'}, default=None

multioutput : {'raw_values', 'uniform_average', 'variance_weighted'}, array-like of shape (n_outputs,) or None, default='uniform_average'

multioutput : {'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average'

average : {'micro', 'macro', 'samples','weighted', 'binary'} or None, default='binary'

zero_division : "warn", 0 or 1, default="warn"

multioutput : {'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average'

average : {'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary'

zero_division : "warn", 0 or 1, default="warn"

zero_division : "warn", 0 or 1, default="warn"

weights : {'linear', 'quadratic'}, default=None

average : {'micro', 'macro', 'samples','weighted', 'binary'} or None, default='binary'

zero_division : "warn", 0 or 1, default="warn"

algorithm : {'SAMME', 'SAMME.R'}, default='SAMME.R'

loss : {'squared_error', 'absolute_error', 'huber', 'quantile'}, default='squared_error'

criterion : {'friedman_mse', 'squared_error', 'mse', 'mae'}, default='friedman_mse'

init : estimator or 'zero', default=None

max_features : {'auto', 'sqrt', 'log2'}, int or float

validation_fraction : float, default=0.1
Must be between 0 and 1

criterion : {"gini", "entropy"}, default="gini"

max_features : {"auto", "sqrt", "log2"}, int or float, default="auto"

class_weight : {"balanced", "balanced_subsample"}, dict or list of dicts, default=None

max_samples : int or float, default=None
If float, then draw `max_samples * X.shape[0]` samples. Thus, `max_samples` should be in the interval `(0.0, 1.0]`.

criterion : {"gini", "entropy"}, default="gini"

max_features : {"auto", "sqrt", "log2"}, int or float, default="auto"

class_weight : {"balanced", "balanced_subsample"}, dict or list of dicts, default=None

max_samples : int or float, default=None
If float, then draw `max_samples * X.shape[0]` samples. Thus, `max_samples` should be in the interval `(0.0, 1.0]`.

init : {'k-means++', 'random'}, callable or array-like of shape (n_clusters, n_features), default='k-means++'

algorithm : {"auto", "full", "elkan"}, default="auto"

input : {'filename', 'file', 'content'}, default='content'

decode_error : {'strict', 'ignore', 'replace'}, default='strict'

strip_accents : {'ascii', 'unicode'}, default=None

stop_words : {'english'}, list, default=None

analyzer : {'word', 'char', 'char_wb'} or callable, default='word'

max_df : float in range [0.0, 1.0] or int, default=1.0

min_df : float in range [0.0, 1.0] or int, default=1

input : {'filename', 'file', 'content'}, default='content'

decode_error : {'strict', 'ignore', 'replace'}, default='strict'

strip_accents : {'ascii', 'unicode'}, default=None

analyzer : {'word', 'char', 'char_wb'} or callable, default='word'

stop_words : {'english'}, list, default=None

max_df : float or int, default=1.0
If float in range [0.0, 1.0]

min_df : float or int, default=1
If float in range of [0.0, 1.0]

norm : {'l1', 'l2'}, default='l2'

solver : {'auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'}, default='auto'

penalty : {'l1', 'l2', 'elasticnet', 'none'}, default='l2'

C : float, default=1.0
Inverse of regularization strength; must be a positive float

class_weight : dict or 'balanced', default=None

solver : {'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'}, default='lbfgs'

multi_class : {'auto', 'ovr', 'multinomial'}, default='auto'

For the liblinear and lbfgs solvers set verbose to any positive number for verbosity

l1_ratio : float, default=0.5
The ElasticNet mixing parameter, with ``0 <= l1_ratio <= 1``

loss : str, default='hinge'
The possible options are 'hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron', or a regression loss: 'squared_error', 'huber', 'epsilon_insensitive', or 'squared_epsilon_insensitive'

penalty : {'l2', 'l1', 'elasticnet'}, default='l2'

l1_ratio : float, default=0.5
The ElasticNet mixing parameter, with ``0 <= l1_ratio <= 1``

learning_rate : str, default='optimal'
- 'constant': `eta = eta0`
- 'optimal': `eta = 1.0 / (alpha * (t + t0))` where t0 is chosen by a heuristic proposed by Leon Bottou
- 'invscaling': `eta = eta0 / pow(t, power_t)`
- 'adaptive': eta = eta0, as long as the training keeps decreasing

validation_fraction : float, default=0.1
Must be between 0 and 1

class_weight : dict, {class_label: weight} or "balanced", default=None

precompute : 'auto', bool or array-like of shape (n_features, n_features)

selection : {'cyclic', 'random'}, default='cyclic'

l1_ratio : float, default=0.5
The ElasticNet mixing parameter, with ``0 <= l1_ratio <= 1``

selection : {'cyclic', 'random'}, default='cyclic'

strategy : str, default='mean
If "mean", then replace missing values using the mean along each column
If "median", then replace missing values using the median along each column
If "most_frequent", then replace missing using the most frequent value along each column
If "constant", then replace missing values with fill_value

C : float, default=1.0
Must be strictly positive

kernel : {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}, default='rbf'

gamma : {'scale', 'auto'} or float, default='scale'

class_weight : dict or 'balanced', default=None

decision_function_shape : {'ovo', 'ovr'}, default='ovr'

order : {'C', 'F'}, default='C'

quantile_range : tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0, default=(25.0, 75.0)

categories : 'auto' or a list of array-like, default='auto'

drop : {'first', 'if_binary'} or a array-like of shape (n_features,), default=None

handle_unknown : {'error', 'ignore'}, default='error'

n_splits : int, default=5
Must be at least 2

n_splits : int, default=5
Must be at least 2

error_score : 'raise' or numeric, default=np.nan

n_splits : int, default=5
Must be at least 2

error_score : 'raise' or numeric, default=np.nan

error_score : 'raise' or numeric, default=np.nan

test_size : float or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split

train_size : float or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split

algorithm : {'arpack', 'randomized'}, default='randomized'

n_components : int, float or 'mle', default=None

svd_solver : {'auto', 'full', 'arpack', 'randomized'}, default='auto'

Must be of range [0.0, infinity)

iterated_power : int or 'auto', default='auto'
Must be of range [0, infinity)

weights : {'uniform', 'distance'} or callable, default='uniform'

algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'

Loading

0 comments on commit ce5e079

Please sign in to comment.