rapidsai · rapids-bot · Mar 14, 2023 · Mar 9, 2023 · Mar 9, 2023 · Mar 9, 2023
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -18,6 +18,13 @@ repos:
             types_or: [python, cython]
             exclude: thirdparty
             additional_dependencies: [flake8-force]
+    - repo: https://github.com/codespell-project/codespell
+      rev: v2.2.2
+      hooks:
+          - id: codespell
+            additional_dependencies: [tomli]
+            args: ["--toml", "pyproject.toml"]
+            exclude: (?x)^(.*stemmer.*|.*stop_words.*|^CHANGELOG.md$)
     - repo: local
       hooks:
           - id: no-deprecationwarning

diff --git a/BUILD.md b/BUILD.md
@@ -61,7 +61,7 @@ $ ./build.sh cuml --singlegpu          # build the cuML python package without M
 $ ./build.sh --ccache                  # use ccache to cache compilations, speeding up subsequent builds
 ```
 
-By default, Ninja is used as the cmake generator. To override this and use (e.g.) `make`, define the `CMAKE_GENERATOR` environment variable accodingly:
+By default, Ninja is used as the cmake generator. To override this and use (e.g.) `make`, define the `CMAKE_GENERATOR` environment variable accordingly:
 ```bash
 CMAKE_GENERATOR='Unix Makefiles' ./build.sh
 ```
@@ -123,7 +123,7 @@ If using a conda environment (recommended), then cmake can be configured appropr
 $ cmake .. -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX
 ```
 
-Note: The following warning message is dependent upon the version of cmake and the `CMAKE_INSTALL_PREFIX` used. If this warning is displayed, the build should still run succesfully. We are currently working to resolve this open issue. You can silence this warning by adding `-DCMAKE_IGNORE_PATH=$CONDA_PREFIX/lib` to your `cmake` command.
+Note: The following warning message is dependent upon the version of cmake and the `CMAKE_INSTALL_PREFIX` used. If this warning is displayed, the build should still run successfully. We are currently working to resolve this open issue. You can silence this warning by adding `-DCMAKE_IGNORE_PATH=$CONDA_PREFIX/lib` to your `cmake` command.
 ```
 Cannot generate a safe runtime search path for target ml_test because files
 in some directories may conflict with libraries in implicit directories:

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -29,9 +29,9 @@ into three categories:
 2. Find an issue to work on. The best way is to look for the [good first issue](https://github.com/rapidsai/cuml/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
     or [help wanted](https://github.com/rapidsai/cuml/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22) labels
 3. Comment on the issue saying you are going to work on it.
-4. Get familar with the developer guide relevant for you:
+4. Get familiar with the developer guide relevant for you:
     * For C++ developers it is available here [DEVELOPER_GUIDE.md](wiki/cpp/DEVELOPER_GUIDE.md)
-    * For Python developers, a [Python DEVELOPER_GUIDE.md](wiki/python/DEVELOPER_GUIDE.md) is availabe as well.
+    * For Python developers, a [Python DEVELOPER_GUIDE.md](wiki/python/DEVELOPER_GUIDE.md) is available as well.
 5. Code! Make sure to update unit tests!
 6. When done, [create your pull request](https://github.com/rapidsai/cuml/compare).
 7. Verify that CI passes all [status checks](https://help.github.com/articles/about-status-checks/), or fix if needed.
@@ -88,6 +88,16 @@ To skip the checks temporarily, use `git commit --no-verify` or its short form
 _Note_: If the auto-formatters' changes affect each other, you may need to go
 through multiple iterations of `git commit` and `git add -u`.
 
+cuML also uses [codespell](https://github.com/codespell-project/codespell) to find spelling
+mistakes, and this check is run as part of the pre-commit hook. To apply the suggested spelling
+fixes, you can run  `codespell -i 3 -w .` from the command-line in the cuML root directory.
+This will bring up an interactive prompt to select which spelling fixes to apply.
+
+If you want to ignore errors highlighted by codespell you can:
+ * Add the word to the ignore-words-list in pyproject.toml, to exclude for all of cuML
+ * Exclude the entire file from spellchecking, by adding to the `exclude` regex in .pre-commit-config.yaml
+ * Ignore only specific lines as shown in https://github.com/codespell-project/codespell/issues/1212#issuecomment-654191881
+
 ### Summary of pre-commit hooks
 
 The pre-commit hooks configured for this repository consist of a number of
@@ -102,6 +112,7 @@ please see the `.pre-commit-config.yaml` file.
 - _`#include` syntax checker_: Ensures consistent syntax for C++ `#include` statements.
 - _Copyright header checker and auto-formatter_: Ensures the copyright headers
   of files are up-to-date and in the correct format.
+- `codespell`: Checks for spelling mistakes
 
 ### Managing PR labels
 

@@ -1,10 +1,10 @@
 #!/bin/bash
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2023, NVIDIA CORPORATION.
 ##########################################
 # cuML black listed function call Tester #
 ##########################################
 
-# PR_TARGET_BRANCH is set by the CI enviroment
+# PR_TARGET_BRANCH is set by the CI environment
 
 git checkout --quiet $PR_TARGET_BRANCH
 

@@ -1,5 +1,5 @@
 #=============================================================================
-# Copyright (c) 2018-2022, NVIDIA CORPORATION.
+# Copyright (c) 2018-2023, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -99,7 +99,7 @@ message(VERBOSE "CUML_CPP: Build and statically link FAISS library: ${CUML_USE_F
 message(VERBOSE "CUML_CPP: Build and statically link Treelite library: ${CUML_USE_TREELITE_STATIC}")
 
 set(CUML_ALGORITHMS "ALL" CACHE STRING "Experimental: Choose which algorithms are built into libcuml++.so. Can specify individual algorithms or groups in a semicolon-separated list.")
-message(VERBOSE "CUML_CPP: Building libcuml++ with algoriths: '${CUML_ALGORITHMS}'.")
+message(VERBOSE "CUML_CPP: Building libcuml++ with algorithms: '${CUML_ALGORITHMS}'.")
 
 # Set RMM logging level
 set(RMM_LOGGING_LEVEL "INFO" CACHE STRING "Choose the logging level.")

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -37,7 +37,7 @@ namespace Bench {
  * by every Benchmark's Params structure.
  */
 struct DatasetParams {
-  /** number of rows in the datset */
+  /** number of rows in the dataset */
   int nrows;
   /** number of cols in the dataset */
   int ncols;

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2021-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2021-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -32,7 +32,7 @@
 #include <rmm/device_scalar.hpp>
 #include <rmm/device_uvector.hpp>
 
-// Namspace alias
+// Namespace alias
 namespace cg = cuml::genetic;
 
 #ifndef CUDA_RT_CALL

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2021-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2021-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -497,7 +497,7 @@ void compute_core_dists(const raft::handle_t& handle,
  * @brief Compute the map from final, normalize labels to the labels in the CondensedHierarchy
  *
  * @param[in] handle raft handle for resource reuse
- * @param[in] condensed_tree the Condensed Hiearchy object
+ * @param[in] condensed_tree the Condensed Hierarchy object
  * @param[in] n_leaves number of leaves in the input data
  * @param[in] cluster_selection_method cluster selection method
  * @param[out] inverse_label_map rmm::device_uvector of size 0. It will be resized during the

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -71,7 +71,7 @@ struct RF_params {
    * round(max_samples * n_samples) number of samples with replacement. More on
    * bootstrapping:
    *     https://en.wikipedia.org/wiki/Bootstrap_aggregating
-   * If boostrapping is set to false, whole dataset is used to build each
+   * If bootstrapping is set to false, whole dataset is used to build each
    * tree.
    */
   bool bootstrap;

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2020-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2020-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -20,7 +20,7 @@
  template parameters: data [T]ype, reduction [R]adix
  function parameters:
  @data[] holds one value per thread in shared memory
- @n_groups is the number of indendent reductions
+ @n_groups is the number of independent reductions
  @n_values is the size of each individual reduction,
    that is the number of values to be reduced to a single value
  function returns: one sum per thread, for @n_groups first threads.

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2020-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2020-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -87,7 +87,7 @@ void symRegPredict(const raft::handle_t& handle,
  * @param handle      cuML handle
  * @param input       device pointer to feature matrix
  * @param n_rows      number of rows of the feature matrix
- * @param params      host struct containg training hyperparameters
+ * @param params      host struct containing training hyperparameters
  * @param best_prog   The best program obtained during training. Inferences are made using this
  * @param output      device pointer to output probability(in col major format)
  */
@@ -104,7 +104,7 @@ void symClfPredictProbs(const raft::handle_t& handle,
  * @param handle      cuML handle
  * @param input       device pointer to feature matrix
  * @param n_rows      number of rows of the feature matrix
- * @param params      host struct containg training hyperparameters
+ * @param params      host struct containing training hyperparameters
  * @param best_prog   Best program obtained after training
  * @param output      Device pointer to output predictions
  */

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2021-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2021-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -36,7 +36,7 @@ struct program {
    * Now take the resulting 1D array and reverse it.
    *
    * @note The pointed memory buffer is NOT owned by this class and further it
-   *       is assumed to be a zero-copy (aka pinned memory) buffer, atleast in
+   *       is assumed to be a zero-copy (aka pinned memory) buffer, at least in
    *       this initial version
    */
 

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -121,8 +121,8 @@ struct TSNEParams {
  * @param[out] Y                   The column-major final embedding in device memory
  * @param[in]  n                   Number of rows in data X.
  * @param[in]  p                   Number of columns in data X.
- * @param[in]  knn_indices         Array containing nearest neighors indices.
- * @param[in]  knn_dists           Array containing nearest neighors distances.
+ * @param[in]  knn_indices         Array containing nearest neighbors indices.
+ * @param[in]  knn_dists           Array containing nearest neighbors distances.
  * @param[in]  params              Parameters for TSNE model
  * @param[out] kl_div              (optional) KL divergence output
  *
@@ -155,8 +155,8 @@ void TSNE_fit(const raft::handle_t& handle,
  * @param[in]  nnz                 The number of non-zero entries in the CSR.
  * @param[in]  n                   Number of rows in data X.
  * @param[in]  p                   Number of columns in data X.
- * @param[in]  knn_indices         Array containing nearest neighors indices.
- * @param[in]  knn_dists           Array containing nearest neighors distances.
+ * @param[in]  knn_indices         Array containing nearest neighbors indices.
+ * @param[in]  knn_dists           Array containing nearest neighbors distances.
  * @param[in]  params              Parameters for TSNE model
  * @param[out] kl_div              (optional) KL divergence output
  *

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -143,9 +143,9 @@ void fit_sparse(const raft::handle_t& handle,
  * Dense transform
  *
  * @param[in] handle: raft::handle_t
- * @param[in] X: pointer to input array to be infered
- * @param[in] n: n_samples of input array to be infered
- * @param[in] d: n_features of input array to be infered
+ * @param[in] X: pointer to input array to be inferred
+ * @param[in] n: n_samples of input array to be inferred
+ * @param[in] d: n_features of input array to be inferred
  * @param[in] orig_X: pointer to original training array
  * @param[in] orig_n: number of rows in original training array
  * @param[in] embedding: pointer to embedding created during training
@@ -168,10 +168,10 @@ void transform(const raft::handle_t& handle,
  * Sparse transform
  *
  * @param[in] handle: raft::handle_t
- * @param[in] indptr: pointer to index pointer array of input array to be infered
- * @param[in] indices: pointer to index array of input array to be infered
- * @param[in] data: pointer to data array of input array to be infered
- * @param[in] nnz: number of stored values of input array to be infered
+ * @param[in] indptr: pointer to index pointer array of input array to be inferred
+ * @param[in] indices: pointer to index array of input array to be inferred
+ * @param[in] data: pointer to data array of input array to be inferred
+ * @param[in] nnz: number of stored values of input array to be inferred
  * @param[in] n: n_samples of input array
  * @param[in] d: n_features of input array
  * @param[in] orig_x_indptr: pointer to index pointer array of original training array

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2021-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2021-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -177,7 +177,7 @@ double adjusted_rand_index(const raft::handle_t& handle,
  *
  * The KL divergence tells us how well the probability distribution Q
  * approximates the probability distribution P
- * It is often also used as a 'distance metric' between two probablity ditributions (not symmetric)
+ * It is often also used as a 'distance metric' between two probability distributions (not symmetric)
  *
  * @param handle: raft::handle_t
  * @param y: Array of probabilities corresponding to distribution P
@@ -192,7 +192,7 @@ double kl_divergence(const raft::handle_t& handle, const double* y, const double
  *
  * The KL divergence tells us how well the probability distribution Q
  * approximates the probability distribution P
- * It is often also used as a 'distance metric' between two probablity ditributions (not symmetric)
+ * It is often also used as a 'distance metric' between two probability distributions (not symmetric)
  *
  * @param handle: raft::handle_t
  * @param y: Array of probabilities corresponding to distribution P

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -139,7 +139,7 @@ void knn_classify(raft::handle_t& handle,
 /**
  * @brief Flat C++ API function to perform a knn regression using
  * a given a vector of label arrays. This supports multilabel
- * regression by clasifying on multiple label arrays. Note that
+ * regression by classifying on multiple label arrays. Note that
  * each label is classified independently, as is done in scikit-learn.
  *
  * @param[in] handle RAFT handle

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -56,7 +56,7 @@ struct DecisionTreeParams {
    */
   CRITERION split_criterion;
   /**
-   * Minimum impurity decrease required for spliting a node. If the impurity decrease is below this
+   * Minimum impurity decrease required for splitting a node. If the impurity decrease is below this
    * value, node is leafed out. Default is 0.0
    */
   float min_impurity_decrease = 0.0f;

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2020-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2020-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -315,7 +315,7 @@ struct ARIMAMemory {
 
     if (r <= 5) {
       // Note: temp mem for the direct Lyapunov solver grows very quickly!
-      // This solver is used iff the condition above is satisifed
+      // This solver is used iff the condition above is satisfied
       append_buffer<assign>(I_m_AxA_dense, r * r * r * r * batch_size);
       append_buffer<assign>(I_m_AxA_batches, batch_size);
       append_buffer<assign>(I_m_AxA_inv_dense, r * r * r * r * batch_size);

@@ -1,4 +1,4 @@
-# Copyright (c) 2019-2021, NVIDIA CORPORATION.
+# Copyright (c) 2019-2023, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -85,7 +85,7 @@ def repo_version_major_minor():
 def determine_merge_commit(current_branch="HEAD"):
     """
     When running outside of CI, this will estimate the target merge commit hash
-    of `current_branch` by finding a common ancester with the remote branch
+    of `current_branch` by finding a common ancestor with the remote branch
     'branch-{major}.{minor}' where {major} and {minor} are determined from the
     repo version.
 
@@ -211,8 +211,8 @@ def modifiedFiles(pathFilter=None):
     If inside a CI-env (ie. TARGET_BRANCH and COMMIT_HASH are defined, and
     current branch is "current-pr-branch"), then lists out all files modified
     between these 2 branches. Locally, TARGET_BRANCH will try to be determined
-    from the current repo version and finding a coresponding branch named
-    'branch-{major}.{minor}'. If this fails, this functino will list out all
+    from the current repo version and finding a corresponding branch named
+    'branch-{major}.{minor}'. If this fails, this function will list out all
     the uncommitted files in the current branch.
 
     Such utility function is helpful while putting checker scripts as part of