Merge branch 'branch-0.18' into fea-ext-svm-multiclass

rapidsai · Dec 11, 2020 · d57fa0b · d57fa0b
2 parents dcc9caf + 2e4388d
commit d57fa0b
Show file tree

Hide file tree

Showing 86 changed files with 6,823 additions and 501 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,12 +1,20 @@
+# cuML 0.18.0 (Date TBD)
+
+## New Features
+
+## Improvements
+
+## Bug Fixes
+- PR #3279: Correct pure virtual declaration in manifold_inputs_t
+
 # cuML 0.17.0 (Date TBD)
 
 ## New Features
-- PR #3164: Expose silhouette score in Python
-- PR #3214: Correct flaky silhouette score test by setting atol
+- PR #3160: Least Angle Regression (experimental)
 - PR #2659: Add initial max inner product sparse knn
 - PR #3092: Multiclass meta estimator wrappers and multiclass SVC
 - PR #2836: Refactor UMAP to accept sparse inputs
-- PR #3186: Add gain to RF JSON dump
+- PR #3126: Experimental versions of GPU accelerated Kernel and Permutation SHAP
 
 ## Improvements
 - PR #3077: Improve runtime for test_kmeans
@@ -20,6 +28,7 @@
 - PR #2956: Follow cuML array conventions in ARIMA and remove redundancy
 - PR #3000: Pin cmake policies to cmake 3.17 version, bump project version to 0.17
 - PR #3083: Improving test_make_blobs testing time
+- PR #3223: Increase default SVM kernel cache to 2000 MiB
 - PR #2906: Moving `linalg` decomp to RAFT namespaces
 - PR #2988: FIL: use tree-per-class reduction for GROVE_PER_CLASS_FEW_CLASSES
 - PR #2996: Removing the max_depth restriction for switching to the batched backend
@@ -34,7 +43,7 @@
 - PR #3115: Speeding up MNMG UMAP testing
 - PR #3112: Speed test_array
 - PR #3111: Adding Cython to Code Coverage
-- PR #3129:  Update notebooks README
+- PR #3129: Update notebooks README
 - PR #3002: Update flake8 Config To With Per File Settings
 - PR #3135: Add QuasiNewton tests
 - PR #3040: Improved Array Conversion with CumlArrayDescriptor and Decorators
@@ -47,8 +56,17 @@
 - PR #3155: Eliminate unnecessary warnings from random projection test
 - PR #3176: Add probabilistic SVM tests with various input array types
 - PR #3180: FIL: `blocks_per_sm` support in Python
+- PR #3186: Add gain to RF JSON dump
+- PR #3219: Update CI to use XGBoost 1.3.0 RCs
+- PR #3221: Update contributing doc for label support
+- PR #3177: Make Multinomial Naive Bayes inherit from `ClassifierMixin` and use it for score
+- PR #3241: Updating RAFT to latest
+- PR #3240: Minor doc updates
 
 ## Bug Fixes
+- PR #3164: Expose silhouette score in Python
+- PR #3258: Revert silhouette_score Python exposure due to memory issue
+- PR #3218: Specify dependency branches in conda dev environment to avoid pip resolver issue
 - PR #3196: Disable ascending=false path for sortColumnsPerRow
 - PR #3051: MNMG KNN Cl&Re fix + multiple improvements
 - PR #3179: Remove unused metrics.cu file
@@ -81,11 +99,22 @@
 - PR #3152: Fix access to attributes of individual NB objects in dask NB
 - PR #3156: Force local conda artifact install
 - PR #3162: Removing accidentally checked in debug file
+- PR #3191: Fix __repr__ function for preprocessing models
 - PR #3175: Fix gtest pinned cmake version for build from source option
 - PR #3182: Fix a bug in MSE metric calculation
+- PR #3187: Update docstring to document behavior of `bootstrap=False`
+- PR #3215: Add a missing `__syncthreads()`
+- PR #3246: Fix MNMG KNN doc (adding batch_size)
+- PR #3185: Add documentation for Distributed TFIDF Transformer
 - PR #3190: Fix Attribute error on ICPA #3183 and PCA input type
 - PR #3208: Fix EXITCODE override in notebook test script
-
+- PR #3250: Fixing label binarizer bug with multiple partitions
+- PR #3214: Correct flaky silhouette score test by setting atol
+- PR #3216: Ignore splits that do not satisfy constraints
+- PR #3239: Fix intermittent dask random forest failure
+- PR #3243: Avoid unnecessary split for degenerate case where all labels are identical
+- PR #3245: Rename `rows_sample` -> `max_samples` to be consistent with sklearn's RF
+- PR #3282: Add secondary test to kernel explainer pytests for stability in Volta
 
 # cuML 0.16.0 (23 Oct 2020)
 

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -41,10 +41,21 @@ into three categories:
 ### A note related to our CI process
 After you have started a PR (refer to step 6 in the previous section), every time you do a `git push <yourRemote> <pr-branch>`, it triggers a new CI run on all the commits thus far. Even though GPUCI has mechanisms to deal with this to a certain extent, if you keep `push`ing too frequently, it might just clog our GPUCI servers and slow down every PR and conda package generation! So, please be mindful of this and try not to do many frequent pushes.
 
-To quantify this, the average check in our CI takes between 25 and 32 minutes on our servers. The GPUCI infrastructure has limited resources, so if the servers get overwhelmed, every current active PR will not be able to correctly schedule CI.
+To quantify this, the average check in our CI takes between 80 and 90 minutes on our servers. The GPUCI infrastructure has limited resources, so if the servers get overwhelmed, every current active PR will not be able to correctly schedule CI.
 
 Remember, if you are unsure about anything, don't hesitate to comment on issues and ask for clarifications!
 
+### Managing PR labels
+
+Each PR must be labeled according to whether it is a "breaking" or "non-breaking" change (using Github labels). This is used to highlight changes that users should know about when upgrading.
+
+For cuML, a "breaking" change is one that modifies the public, non-experimental, Python API in a
+non-backward-compatible way. The C++ API does not have an expectation of backward compatibility at this
+time, so changes to it are not typically considered breaking. Backward-compatible API changes to the Python
+API (such as adding a new keyword argument to a function) do not need to be labeled.
+
+Additional labels must be applied to indicate whether the change is a feature, improvement, bugfix, or documentation change. See the shared RAPIDS documentation for these labels: https://github.com/rapidsai/kb/issues/42.
+
 ### Seasoned developers
 
 Once you have gotten your feet wet and are more comfortable with the code, you

diff --git a/README.md b/README.md
@@ -108,7 +108,10 @@ repo](https://github.com/rapidsai/notebooks-contrib).
 |  | Epsilon-Support Vector Regression (SVR) | |
 | **Time Series** | Holt-Winters Exponential Smoothing | |
 |  | Auto-regressive Integrated Moving Average (ARIMA) | Supports seasonality (SARIMA) |
-| **Other** | K-Nearest Neighbors (KNN) Search | Multi-node multi-GPU via Dask+[UCX](https://github.com/rapidsai/ucx-py), uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |
+| **Model Explanation**                                 | SHAP Kernel Explainer                                                                                                               | [Based on SHAP](https://shap.readthedocs.io/en/latest/) (experimental)                                                                                                                                               |
+|                                                       | SHAP Permutation Explainer                       | [Based on SHAP](https://shap.readthedocs.io/en/latest/) (experimental)                                                                                                                                                |
+| **Other**                                             | K-Nearest Neighbors (KNN) Search                                                                                                          | Multi-node multi-GPU via Dask+[UCX](https://github.com/rapidsai/ucx-py), uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |
+
 ---
 
 ## Installation
@@ -127,6 +130,8 @@ Please see our [guide for contributing to cuML](CONTRIBUTING.md).
 
 ## References
 
+The RAPIDS team has a number of blogs with deeper technical dives and examples. [You can find them here on Medium.](https://medium.com/rapids-ai/tagged/machine-learning)
+
 For additional details on the technologies behind cuML, as well as a broader overview of the Python Machine Learning landscape, see [_Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence_ (2020)](https://arxiv.org/abs/2002.04803) by Sebastian Raschka, Joshua Patterson, and Corey Nolet.
 
 Please consider citing this when using cuML in a project. You can use the citation BibTeX:

diff --git a/ci/gpu/build.sh b/ci/gpu/build.sh
@@ -53,7 +53,7 @@ gpuci_conda_retry install -c conda-forge -c rapidsai -c rapidsai-nightly -c nvid
       "dask-cudf=${MINOR_VERSION}" \
       "dask-cuda=${MINOR_VERSION}" \
       "ucx-py=${MINOR_VERSION}" \
-      "xgboost=1.2.0dev.rapidsai${MINOR_VERSION}" \
+      "xgboost=1.3.0dev.rapidsai${MINOR_VERSION}" \
       "rapids-build-env=${MINOR_VERSION}.*" \
       "rapids-notebook-env=${MINOR_VERSION}.*" \
       "rapids-doc-env=${MINOR_VERSION}.*"
@@ -70,8 +70,8 @@ fi
 
 gpuci_logger "Install the master version of dask and distributed"
 set -x
-pip install "git+https://github.com/dask/distributed.git" --upgrade --no-deps
-pip install "git+https://github.com/dask/dask.git" --upgrade --no-deps
+pip install "git+https://github.com/dask/distributed.git@master" --upgrade --no-deps
+pip install "git+https://github.com/dask/dask.git@master" --upgrade --no-deps
 set +x
 
 gpuci_logger "Check compiler versions"

diff --git a/conda/environments/cuml_dev_cuda10.1.yml b/conda/environments/cuml_dev_cuda10.1.yml
@@ -6,15 +6,15 @@ channels:
 - conda-forge
 dependencies:
 - cudatoolkit=10.1
-- rapids-build-env=0.17
-- rapids-notebook-env=0.17
-- rapids-doc-env=0.17
-- cudf=0.17.*
-- rmm=0.17.*
-- libcumlprims=0.17.*
-- dask-cudf=0.17.*
-- dask-cuda=0.17.*
-- ucx-py=0.17.*
+- rapids-build-env=0.18
+- rapids-notebook-env=0.18
+- rapids-doc-env=0.18
+- cudf=0.18.*
+- rmm=0.18.*
+- libcumlprims=0.18.*
+- dask-cudf=0.18.*
+- dask-cuda=0.18.*
+- ucx-py=0.18.*
 - dask-ml
 - doxygen>=1.8.20
 - libfaiss>=1.6.3
@@ -25,8 +25,8 @@ dependencies:
 - pip
 - pip:
     - sphinx_markdown_tables
-    - git+https://github.com/dask/dask.git
-    - git+https://github.com/dask/distributed.git
+    - git+https://github.com/dask/dask.git@master
+    - git+https://github.com/dask/distributed.git@master
 
 # rapids-build-env, notebook-env and doc-env meta packages are defined in
 # https://docs.rapids.ai/maintainers/depmgmt/

diff --git a/conda/environments/cuml_dev_cuda10.2.yml b/conda/environments/cuml_dev_cuda10.2.yml
@@ -6,15 +6,15 @@ channels:
 - conda-forge
 dependencies:
 - cudatoolkit=10.2
-- rapids-build-env=0.17
-- rapids-notebook-env=0.17
-- rapids-doc-env=0.17
-- cudf=0.17.*
-- rmm=0.17.*
-- libcumlprims=0.17.*
-- dask-cudf=0.17.*
-- dask-cuda=0.17.*
-- ucx-py=0.17.*
+- rapids-build-env=0.18
+- rapids-notebook-env=0.18
+- rapids-doc-env=0.18
+- cudf=0.18.*
+- rmm=0.18.*
+- libcumlprims=0.18.*
+- dask-cudf=0.18.*
+- dask-cuda=0.18.*
+- ucx-py=0.18.*
 - dask-ml
 - doxygen>=1.8.20
 - libfaiss>=1.6.3
@@ -25,8 +25,8 @@ dependencies:
 - pip
 - pip:
     - sphinx_markdown_tables
-    - git+https://github.com/dask/dask.git
-    - git+https://github.com/dask/distributed.git
+    - git+https://github.com/dask/dask.git@master
+    - git+https://github.com/dask/distributed.git@master
 
 # rapids-build-env, notebook-env and doc-env are defined in
 # https://docs.rapids.ai/maintainers/depmgmt/

diff --git a/conda/environments/cuml_dev_cuda11.0.yml b/conda/environments/cuml_dev_cuda11.0.yml
@@ -6,15 +6,15 @@ channels:
 - conda-forge
 dependencies:
 - cudatoolkit=11.0
-- rapids-build-env=0.17
-- rapids-notebook-env=0.17
-- rapids-doc-env=0.17
-- cudf=0.17.*
-- rmm=0.17.*
-- libcumlprims=0.17.*
-- dask-cudf=0.17.*
-- dask-cuda=0.17.*
-- ucx-py=0.17.*
+- rapids-build-env=0.18
+- rapids-notebook-env=0.18
+- rapids-doc-env=0.18
+- cudf=0.18.*
+- rmm=0.18.*
+- libcumlprims=0.18.*
+- dask-cudf=0.18.*
+- dask-cuda=0.18.*
+- ucx-py=0.18.*
 - dask-ml
 - doxygen>=1.8.20
 - libfaiss>=1.6.3
@@ -25,8 +25,8 @@ dependencies:
 - pip
 - pip:
     - sphinx_markdown_tables
-    - git+https://github.com/dask/dask.git
-    - git+https://github.com/dask/distributed.git
+    - git+https://github.com/dask/dask.git@master
+    - git+https://github.com/dask/distributed.git@master
 
 # rapids-build-env, notebook-env and doc-env are defined in
 # https://docs.rapids.ai/maintainers/depmgmt/

diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
@@ -18,7 +18,7 @@ set (CMAKE_FIND_NO_INSTALL_PREFIX TRUE FORCE)
 
 cmake_minimum_required(VERSION 3.14...3.17 FATAL_ERROR)
 
-project(CUML VERSION 0.17.0 LANGUAGES C CXX CUDA)
+project(cuML VERSION 0.18.0 LANGUAGES C CXX CUDA)
 
 ##############################################################################
 # - build type ---------------------------------------------------------------
@@ -395,6 +395,8 @@ if(BUILD_CUML_CPP_LIBRARY)
     src/datasets/make_regression.cu
     src/dbscan/dbscan.cu
     src/decisiontree/decisiontree.cu
+    src/explainer/kernel_shap.cu
+    src/explainer/permutation_shap.cu
     src/fil/fil.cu
     src/fil/infer.cu
     src/glm/glm.cu
@@ -418,6 +420,7 @@ if(BUILD_CUML_CPP_LIBRARY)
     src/pca/pca.cu
     src/randomforest/randomforest.cu
     src/random_projection/rproj.cu
+    src/solver/lars.cu
     src/solver/solver.cu
     src/spectral/spectral.cu
     src/svm/svc.cu

diff --git a/cpp/bench/sg/fil.cu b/cpp/bench/sg/fil.cu
@@ -146,7 +146,7 @@ std::vector<Params> getInputs() {
   set_rf_params(p.rf,  // Output RF parameters
                 1,  // n_trees, just a placeholder value, anyway changed below
                 true,  // bootstrap
-                1.f,   // rows_sample
+                1.f,   // max_samples
                 1234,  // seed
                 8);    // n_streams
 

diff --git a/cpp/bench/sg/rf_classifier.cu b/cpp/bench/sg/rf_classifier.cu
@@ -86,7 +86,7 @@ std::vector<Params> getInputs() {
   set_rf_params(p.rf,  // Output RF parameters
                 500,   // n_trees
                 true,  // bootstrap
-                1.f,   // rows_sample
+                1.f,   // max_samples
                 1234,  // seed
                 8);    // n_streams
 

diff --git a/cpp/bench/sg/rf_regressor.cu b/cpp/bench/sg/rf_regressor.cu
@@ -88,7 +88,7 @@ std::vector<RegParams> getInputs() {
   set_rf_params(p.rf,  // Output RF parameters
                 500,   // n_trees
                 true,  // bootstrap
-                1.f,   // rows_sample
+                1.f,   // max_samples
                 1234,  // seed
                 8);    // n_streams
 

diff --git a/cpp/cmake/Dependencies.cmake b/cpp/cmake/Dependencies.cmake
@@ -39,7 +39,7 @@ else(DEFINED ENV{RAFT_PATH})
 
   ExternalProject_Add(raft
     GIT_REPOSITORY    https://github.com/rapidsai/raft.git
-    GIT_TAG           eebd0e306624b419168b2cd5cd7aa44ebaec51f1
+    GIT_TAG           f75d7b437bf1da3df749108161b8a0505fb6b7b3
     PREFIX            ${RAFT_DIR}
     CONFIGURE_COMMAND ""
     BUILD_COMMAND     ""

diff --git a/cpp/include/cuml/ensemble/randomforest.hpp b/cpp/include/cuml/ensemble/randomforest.hpp
@@ -60,7 +60,7 @@ struct RF_params {
    * Control bootstrapping.
    * If bootstrapping is set to true, bootstrapped samples are used for building
    * each tree. Bootstrapped sampling is done by randomly drawing
-   * round(rows_sample * n_samples) number of samples with replacement. More on
+   * round(max_samples * n_samples) number of samples with replacement. More on
    * bootstrapping:
    *     https://en.wikipedia.org/wiki/Bootstrap_aggregating
    * If boostrapping is set to false, whole dataset is used to build each
@@ -70,7 +70,7 @@ struct RF_params {
   /**
    * Ratio of dataset rows used while fitting each tree.
    */
-  float rows_sample;
+  float max_samples;
   /**
    * Decision tree training hyper parameter struct.
    */
@@ -88,10 +88,10 @@ struct RF_params {
 };
 
 void set_rf_params(RF_params& params, int cfg_n_trees = 1,
-                   bool cfg_bootstrap = true, float cfg_rows_sample = 1.0f,
+                   bool cfg_bootstrap = true, float cfg_max_samples = 1.0f,
                    int cfg_seed = -1, int cfg_n_streams = 8);
 void set_all_rf_params(RF_params& params, int cfg_n_trees, bool cfg_bootstrap,
-                       float cfg_rows_sample, int cfg_seed, int cfg_n_streams,
+                       float cfg_max_samples, int cfg_seed, int cfg_n_streams,
                        DecisionTree::DecisionTreeParams cfg_tree_params);
 void validity_check(const RF_params rf_params);
 void print(const RF_params rf_params);
@@ -190,7 +190,7 @@ RF_params set_rf_class_obj(int max_depth, int max_leaves, float max_features,
                            int n_bins, int split_algo, int min_samples_leaf,
                            int min_samples_split, float min_impurity_decrease,
                            bool bootstrap_features, bool bootstrap, int n_trees,
-                           float rows_sample, int seed,
+                           float max_samples, int seed,
                            CRITERION split_criterion, bool quantile_per_tree,
                            int cfg_n_streams, bool use_experimental_backend,
                            int max_batch_size);