Skip to content

Commit

Permalink
Merge branch 'branch-0.18' into fea-ext-svm-multiclass
Browse files Browse the repository at this point in the history
  • Loading branch information
tfeher committed Dec 11, 2020
2 parents dcc9caf + 2e4388d commit d57fa0b
Show file tree
Hide file tree
Showing 86 changed files with 6,823 additions and 501 deletions.
39 changes: 34 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,20 @@
# cuML 0.18.0 (Date TBD)

## New Features

## Improvements

## Bug Fixes
- PR #3279: Correct pure virtual declaration in manifold_inputs_t

# cuML 0.17.0 (Date TBD)

## New Features
- PR #3164: Expose silhouette score in Python
- PR #3214: Correct flaky silhouette score test by setting atol
- PR #3160: Least Angle Regression (experimental)
- PR #2659: Add initial max inner product sparse knn
- PR #3092: Multiclass meta estimator wrappers and multiclass SVC
- PR #2836: Refactor UMAP to accept sparse inputs
- PR #3186: Add gain to RF JSON dump
- PR #3126: Experimental versions of GPU accelerated Kernel and Permutation SHAP

## Improvements
- PR #3077: Improve runtime for test_kmeans
Expand All @@ -20,6 +28,7 @@
- PR #2956: Follow cuML array conventions in ARIMA and remove redundancy
- PR #3000: Pin cmake policies to cmake 3.17 version, bump project version to 0.17
- PR #3083: Improving test_make_blobs testing time
- PR #3223: Increase default SVM kernel cache to 2000 MiB
- PR #2906: Moving `linalg` decomp to RAFT namespaces
- PR #2988: FIL: use tree-per-class reduction for GROVE_PER_CLASS_FEW_CLASSES
- PR #2996: Removing the max_depth restriction for switching to the batched backend
Expand All @@ -34,7 +43,7 @@
- PR #3115: Speeding up MNMG UMAP testing
- PR #3112: Speed test_array
- PR #3111: Adding Cython to Code Coverage
- PR #3129: Update notebooks README
- PR #3129: Update notebooks README
- PR #3002: Update flake8 Config To With Per File Settings
- PR #3135: Add QuasiNewton tests
- PR #3040: Improved Array Conversion with CumlArrayDescriptor and Decorators
Expand All @@ -47,8 +56,17 @@
- PR #3155: Eliminate unnecessary warnings from random projection test
- PR #3176: Add probabilistic SVM tests with various input array types
- PR #3180: FIL: `blocks_per_sm` support in Python
- PR #3186: Add gain to RF JSON dump
- PR #3219: Update CI to use XGBoost 1.3.0 RCs
- PR #3221: Update contributing doc for label support
- PR #3177: Make Multinomial Naive Bayes inherit from `ClassifierMixin` and use it for score
- PR #3241: Updating RAFT to latest
- PR #3240: Minor doc updates

## Bug Fixes
- PR #3164: Expose silhouette score in Python
- PR #3258: Revert silhouette_score Python exposure due to memory issue
- PR #3218: Specify dependency branches in conda dev environment to avoid pip resolver issue
- PR #3196: Disable ascending=false path for sortColumnsPerRow
- PR #3051: MNMG KNN Cl&Re fix + multiple improvements
- PR #3179: Remove unused metrics.cu file
Expand Down Expand Up @@ -81,11 +99,22 @@
- PR #3152: Fix access to attributes of individual NB objects in dask NB
- PR #3156: Force local conda artifact install
- PR #3162: Removing accidentally checked in debug file
- PR #3191: Fix __repr__ function for preprocessing models
- PR #3175: Fix gtest pinned cmake version for build from source option
- PR #3182: Fix a bug in MSE metric calculation
- PR #3187: Update docstring to document behavior of `bootstrap=False`
- PR #3215: Add a missing `__syncthreads()`
- PR #3246: Fix MNMG KNN doc (adding batch_size)
- PR #3185: Add documentation for Distributed TFIDF Transformer
- PR #3190: Fix Attribute error on ICPA #3183 and PCA input type
- PR #3208: Fix EXITCODE override in notebook test script

- PR #3250: Fixing label binarizer bug with multiple partitions
- PR #3214: Correct flaky silhouette score test by setting atol
- PR #3216: Ignore splits that do not satisfy constraints
- PR #3239: Fix intermittent dask random forest failure
- PR #3243: Avoid unnecessary split for degenerate case where all labels are identical
- PR #3245: Rename `rows_sample` -> `max_samples` to be consistent with sklearn's RF
- PR #3282: Add secondary test to kernel explainer pytests for stability in Volta

# cuML 0.16.0 (23 Oct 2020)

Expand Down
13 changes: 12 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,21 @@ into three categories:
### A note related to our CI process
After you have started a PR (refer to step 6 in the previous section), every time you do a `git push <yourRemote> <pr-branch>`, it triggers a new CI run on all the commits thus far. Even though GPUCI has mechanisms to deal with this to a certain extent, if you keep `push`ing too frequently, it might just clog our GPUCI servers and slow down every PR and conda package generation! So, please be mindful of this and try not to do many frequent pushes.

To quantify this, the average check in our CI takes between 25 and 32 minutes on our servers. The GPUCI infrastructure has limited resources, so if the servers get overwhelmed, every current active PR will not be able to correctly schedule CI.
To quantify this, the average check in our CI takes between 80 and 90 minutes on our servers. The GPUCI infrastructure has limited resources, so if the servers get overwhelmed, every current active PR will not be able to correctly schedule CI.

Remember, if you are unsure about anything, don't hesitate to comment on issues and ask for clarifications!

### Managing PR labels

Each PR must be labeled according to whether it is a "breaking" or "non-breaking" change (using Github labels). This is used to highlight changes that users should know about when upgrading.

For cuML, a "breaking" change is one that modifies the public, non-experimental, Python API in a
non-backward-compatible way. The C++ API does not have an expectation of backward compatibility at this
time, so changes to it are not typically considered breaking. Backward-compatible API changes to the Python
API (such as adding a new keyword argument to a function) do not need to be labeled.

Additional labels must be applied to indicate whether the change is a feature, improvement, bugfix, or documentation change. See the shared RAPIDS documentation for these labels: https://github.com/rapidsai/kb/issues/42.

### Seasoned developers

Once you have gotten your feet wet and are more comfortable with the code, you
Expand Down
7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,10 @@ repo](https://github.com/rapidsai/notebooks-contrib).
| | Epsilon-Support Vector Regression (SVR) | |
| **Time Series** | Holt-Winters Exponential Smoothing | |
| | Auto-regressive Integrated Moving Average (ARIMA) | Supports seasonality (SARIMA) |
| **Other** | K-Nearest Neighbors (KNN) Search | Multi-node multi-GPU via Dask+[UCX](https://github.com/rapidsai/ucx-py), uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |
| **Model Explanation** | SHAP Kernel Explainer | [Based on SHAP](https://shap.readthedocs.io/en/latest/) (experimental) |
| | SHAP Permutation Explainer | [Based on SHAP](https://shap.readthedocs.io/en/latest/) (experimental) |
| **Other** | K-Nearest Neighbors (KNN) Search | Multi-node multi-GPU via Dask+[UCX](https://github.com/rapidsai/ucx-py), uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |

---

## Installation
Expand All @@ -127,6 +130,8 @@ Please see our [guide for contributing to cuML](CONTRIBUTING.md).

## References

The RAPIDS team has a number of blogs with deeper technical dives and examples. [You can find them here on Medium.](https://medium.com/rapids-ai/tagged/machine-learning)

For additional details on the technologies behind cuML, as well as a broader overview of the Python Machine Learning landscape, see [_Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence_ (2020)](https://arxiv.org/abs/2002.04803) by Sebastian Raschka, Joshua Patterson, and Corey Nolet.

Please consider citing this when using cuML in a project. You can use the citation BibTeX:
Expand Down
6 changes: 3 additions & 3 deletions ci/gpu/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ gpuci_conda_retry install -c conda-forge -c rapidsai -c rapidsai-nightly -c nvid
"dask-cudf=${MINOR_VERSION}" \
"dask-cuda=${MINOR_VERSION}" \
"ucx-py=${MINOR_VERSION}" \
"xgboost=1.2.0dev.rapidsai${MINOR_VERSION}" \
"xgboost=1.3.0dev.rapidsai${MINOR_VERSION}" \
"rapids-build-env=${MINOR_VERSION}.*" \
"rapids-notebook-env=${MINOR_VERSION}.*" \
"rapids-doc-env=${MINOR_VERSION}.*"
Expand All @@ -70,8 +70,8 @@ fi

gpuci_logger "Install the master version of dask and distributed"
set -x
pip install "git+https://github.com/dask/distributed.git" --upgrade --no-deps
pip install "git+https://github.com/dask/dask.git" --upgrade --no-deps
pip install "git+https://github.com/dask/distributed.git@master" --upgrade --no-deps
pip install "git+https://github.com/dask/dask.git@master" --upgrade --no-deps
set +x

gpuci_logger "Check compiler versions"
Expand Down
22 changes: 11 additions & 11 deletions conda/environments/cuml_dev_cuda10.1.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ channels:
- conda-forge
dependencies:
- cudatoolkit=10.1
- rapids-build-env=0.17
- rapids-notebook-env=0.17
- rapids-doc-env=0.17
- cudf=0.17.*
- rmm=0.17.*
- libcumlprims=0.17.*
- dask-cudf=0.17.*
- dask-cuda=0.17.*
- ucx-py=0.17.*
- rapids-build-env=0.18
- rapids-notebook-env=0.18
- rapids-doc-env=0.18
- cudf=0.18.*
- rmm=0.18.*
- libcumlprims=0.18.*
- dask-cudf=0.18.*
- dask-cuda=0.18.*
- ucx-py=0.18.*
- dask-ml
- doxygen>=1.8.20
- libfaiss>=1.6.3
Expand All @@ -25,8 +25,8 @@ dependencies:
- pip
- pip:
- sphinx_markdown_tables
- git+https://github.com/dask/dask.git
- git+https://github.com/dask/distributed.git
- git+https://github.com/dask/dask.git@master
- git+https://github.com/dask/distributed.git@master

# rapids-build-env, notebook-env and doc-env meta packages are defined in
# https://docs.rapids.ai/maintainers/depmgmt/
Expand Down
22 changes: 11 additions & 11 deletions conda/environments/cuml_dev_cuda10.2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ channels:
- conda-forge
dependencies:
- cudatoolkit=10.2
- rapids-build-env=0.17
- rapids-notebook-env=0.17
- rapids-doc-env=0.17
- cudf=0.17.*
- rmm=0.17.*
- libcumlprims=0.17.*
- dask-cudf=0.17.*
- dask-cuda=0.17.*
- ucx-py=0.17.*
- rapids-build-env=0.18
- rapids-notebook-env=0.18
- rapids-doc-env=0.18
- cudf=0.18.*
- rmm=0.18.*
- libcumlprims=0.18.*
- dask-cudf=0.18.*
- dask-cuda=0.18.*
- ucx-py=0.18.*
- dask-ml
- doxygen>=1.8.20
- libfaiss>=1.6.3
Expand All @@ -25,8 +25,8 @@ dependencies:
- pip
- pip:
- sphinx_markdown_tables
- git+https://github.com/dask/dask.git
- git+https://github.com/dask/distributed.git
- git+https://github.com/dask/dask.git@master
- git+https://github.com/dask/distributed.git@master

# rapids-build-env, notebook-env and doc-env are defined in
# https://docs.rapids.ai/maintainers/depmgmt/
Expand Down
22 changes: 11 additions & 11 deletions conda/environments/cuml_dev_cuda11.0.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ channels:
- conda-forge
dependencies:
- cudatoolkit=11.0
- rapids-build-env=0.17
- rapids-notebook-env=0.17
- rapids-doc-env=0.17
- cudf=0.17.*
- rmm=0.17.*
- libcumlprims=0.17.*
- dask-cudf=0.17.*
- dask-cuda=0.17.*
- ucx-py=0.17.*
- rapids-build-env=0.18
- rapids-notebook-env=0.18
- rapids-doc-env=0.18
- cudf=0.18.*
- rmm=0.18.*
- libcumlprims=0.18.*
- dask-cudf=0.18.*
- dask-cuda=0.18.*
- ucx-py=0.18.*
- dask-ml
- doxygen>=1.8.20
- libfaiss>=1.6.3
Expand All @@ -25,8 +25,8 @@ dependencies:
- pip
- pip:
- sphinx_markdown_tables
- git+https://github.com/dask/dask.git
- git+https://github.com/dask/distributed.git
- git+https://github.com/dask/dask.git@master
- git+https://github.com/dask/distributed.git@master

# rapids-build-env, notebook-env and doc-env are defined in
# https://docs.rapids.ai/maintainers/depmgmt/
Expand Down
5 changes: 4 additions & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ set (CMAKE_FIND_NO_INSTALL_PREFIX TRUE FORCE)

cmake_minimum_required(VERSION 3.14...3.17 FATAL_ERROR)

project(CUML VERSION 0.17.0 LANGUAGES C CXX CUDA)
project(cuML VERSION 0.18.0 LANGUAGES C CXX CUDA)

##############################################################################
# - build type ---------------------------------------------------------------
Expand Down Expand Up @@ -395,6 +395,8 @@ if(BUILD_CUML_CPP_LIBRARY)
src/datasets/make_regression.cu
src/dbscan/dbscan.cu
src/decisiontree/decisiontree.cu
src/explainer/kernel_shap.cu
src/explainer/permutation_shap.cu
src/fil/fil.cu
src/fil/infer.cu
src/glm/glm.cu
Expand All @@ -418,6 +420,7 @@ if(BUILD_CUML_CPP_LIBRARY)
src/pca/pca.cu
src/randomforest/randomforest.cu
src/random_projection/rproj.cu
src/solver/lars.cu
src/solver/solver.cu
src/spectral/spectral.cu
src/svm/svc.cu
Expand Down
2 changes: 1 addition & 1 deletion cpp/bench/sg/fil.cu
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ std::vector<Params> getInputs() {
set_rf_params(p.rf, // Output RF parameters
1, // n_trees, just a placeholder value, anyway changed below
true, // bootstrap
1.f, // rows_sample
1.f, // max_samples
1234, // seed
8); // n_streams

Expand Down
2 changes: 1 addition & 1 deletion cpp/bench/sg/rf_classifier.cu
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ std::vector<Params> getInputs() {
set_rf_params(p.rf, // Output RF parameters
500, // n_trees
true, // bootstrap
1.f, // rows_sample
1.f, // max_samples
1234, // seed
8); // n_streams

Expand Down
2 changes: 1 addition & 1 deletion cpp/bench/sg/rf_regressor.cu
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ std::vector<RegParams> getInputs() {
set_rf_params(p.rf, // Output RF parameters
500, // n_trees
true, // bootstrap
1.f, // rows_sample
1.f, // max_samples
1234, // seed
8); // n_streams

Expand Down
2 changes: 1 addition & 1 deletion cpp/cmake/Dependencies.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ else(DEFINED ENV{RAFT_PATH})

ExternalProject_Add(raft
GIT_REPOSITORY https://github.com/rapidsai/raft.git
GIT_TAG eebd0e306624b419168b2cd5cd7aa44ebaec51f1
GIT_TAG f75d7b437bf1da3df749108161b8a0505fb6b7b3
PREFIX ${RAFT_DIR}
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
Expand Down
10 changes: 5 additions & 5 deletions cpp/include/cuml/ensemble/randomforest.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ struct RF_params {
* Control bootstrapping.
* If bootstrapping is set to true, bootstrapped samples are used for building
* each tree. Bootstrapped sampling is done by randomly drawing
* round(rows_sample * n_samples) number of samples with replacement. More on
* round(max_samples * n_samples) number of samples with replacement. More on
* bootstrapping:
* https://en.wikipedia.org/wiki/Bootstrap_aggregating
* If boostrapping is set to false, whole dataset is used to build each
Expand All @@ -70,7 +70,7 @@ struct RF_params {
/**
* Ratio of dataset rows used while fitting each tree.
*/
float rows_sample;
float max_samples;
/**
* Decision tree training hyper parameter struct.
*/
Expand All @@ -88,10 +88,10 @@ struct RF_params {
};

void set_rf_params(RF_params& params, int cfg_n_trees = 1,
bool cfg_bootstrap = true, float cfg_rows_sample = 1.0f,
bool cfg_bootstrap = true, float cfg_max_samples = 1.0f,
int cfg_seed = -1, int cfg_n_streams = 8);
void set_all_rf_params(RF_params& params, int cfg_n_trees, bool cfg_bootstrap,
float cfg_rows_sample, int cfg_seed, int cfg_n_streams,
float cfg_max_samples, int cfg_seed, int cfg_n_streams,
DecisionTree::DecisionTreeParams cfg_tree_params);
void validity_check(const RF_params rf_params);
void print(const RF_params rf_params);
Expand Down Expand Up @@ -190,7 +190,7 @@ RF_params set_rf_class_obj(int max_depth, int max_leaves, float max_features,
int n_bins, int split_algo, int min_samples_leaf,
int min_samples_split, float min_impurity_decrease,
bool bootstrap_features, bool bootstrap, int n_trees,
float rows_sample, int seed,
float max_samples, int seed,
CRITERION split_criterion, bool quantile_per_tree,
int cfg_n_streams, bool use_experimental_backend,
int max_batch_size);
Expand Down
Loading

0 comments on commit d57fa0b

Please sign in to comment.