forked from rapidsai/cuml
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync with upstream #32
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Removes `-g` from the compile commands generated by distutils to compile Cython files. This will make our container images, conda packages, and python wheels smaller.
Closes #4054 Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4179
Summary of the changes: - Remove some unused print functions - Move validity checks into parameter construction, so parameters are checked by default - Remove Node_ID_info struct, we can just use a std::pair - Move builder_base.cuh into builder.cuh - Remove node.cuh. Use InstanceRange to store this information. - Builder.train() directly returns a DT::TreeMetaDataNode<DataT, LabelT> object - computeQuantiles is made into a pure function. Some weird usages of smart pointers removed. - Unused DataInfo struct removed - DecisionTree class member variables removed, member functions made into pure functions (static) - Some unnecessary RandomForest member variables removed, destructor removed - Some instances of new/delete change to use std containers - Tests for instance counts moved from python to gtest - Change indexing type from 32-bit integers to std::size_t - Test fil predictions against rf predictions, fixes a case where ties in multi-class prediction are broken inconsistently in RF's cpu predictor Authors: - Rory Mitchell (https://github.com/RAMitchell) Approvers: - Venkat (https://github.com/venkywonka) - Vinay Deshpande (https://github.com/vinaydes) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4166
Closes #4153. Authors: - Micka (https://github.com/lowener) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4163
Change the error type when trying to predict before fitting SVM to match sklearn. Fixes #4192 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4198
This is a continuation of PR #1763, #4053, and #4079, to add Categorical Naive Bayes. This is supposed to be merged after #4079. Linking issue #1666. Authors: - Micka (https://github.com/lowener) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #4150
…ghbors Estimator (#4178) This pull request partially solves [[FEA] #3461](#3461). This quick-fix has been created to enable cuML's NearestNeighbor estimator to gracefully accept sklearns 'n_jobs' parameter as a pass-through. The purpose of making this quick fix is to allow Imbalanced-Learn samplers to rely on cuML's NearestNeighbor estimator, without producing an error when setting the estimators n_jobs parameter `.set_params(**{"n_jobs": self.n_jobs})` [1](https://github.com/scikit-learn-contrib/imbalanced-learn/blob/edf6eae2c00f7fa6d76ee381f5b625155061a725/imblearn/over_sampling/_adasyn.py#L112) Authors: - https://github.com/NV-jpt Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4178
Fixes the old build instructions for `cuml` cc @dantegd Authors: - https://github.com/shaneding Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4200
Authors: - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - Victor Lafargue (https://github.com/viclafargue) - Corey J. Nolet (https://github.com/cjnolet) URL: #4205
…o distance metrics (#4155) -- This PR depends on RAFT PR - rapidsai/raft#306 -- Adds cpp & python interfaces for these distance metrics with pytest support for each of them. -- also remove redundant commented code in canberra distance metric Authors: - Mahesh Doijade (https://github.com/mdoijade) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #4155
Simplify the type check Authors: - Nanthini (https://github.com/Nanthini10) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4190
…rparts (#4130) This looks to me like a typo, and may be problematic and confusing if the `n_rows` and `n_cols` members from the base class instead of the ones from the derived class are accessed. Signed-off-by: Yitao Li <[email protected]> Authors: - Yitao Li (https://github.com/yitao-li) Approvers: - Micka (https://github.com/lowener) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4130
This will avoid from consumers having to add Thrust explicitly when consuming cuML in CMake. cc @shaneding Authors: - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - Robert Maynard (https://github.com/robertmaynard) URL: #4209
This doesn't include treelite import (export). That will come in #4041 Authors: - https://github.com/levsnv Approvers: - Andy Adinets (https://github.com/canonizer) - Robert Maynard (https://github.com/robertmaynard) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4092
Fixes #3764,#2518 To do: - post charts confirming the improvement in accuracy - address python tests - benchmark Authors: - Rory Mitchell (https://github.com/RAMitchell) Approvers: - Vinay Deshpande (https://github.com/vinaydes) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4191
Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4211
This PR ⬇️ * fixes #4193 and fixes #4194 that relates to API incompatibility with dask-ml GridSearchCV * changes the behaviour of cuml RF in the following cases: * In the not-so-uncommon case when `n_bins` > number of rows in training sample, instead of throwing error and exiting, the estimator is made to print a warning and use the `n_bins` as the number of training samples. * When `.predict()` is called using `float64` data, instead of throwing an error asking user to explicitly specify `predict_model="CPU"` and rerun, a warning is displayed and implicity defaults to CPU-based prediction from the default GPU-based prediction. * Corresponding tests to capture the warnings from above added * the estimators now accept both numbers and strings as input for `split_criterion` parameter thus in parity with sklearn's API that takes in strings as criterion. * `split_algo` and `use_experimental_backend` parameters of the estimator class have now been completely removed from both documentation and warnings after deprecation in previous releases (from both single-gpu and dask RF). * `num_classes` parameter of predict and score methods have also been similarly removed Authors: - Venkat (https://github.com/venkywonka) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - Rory Mitchell (https://github.com/RAMitchell) URL: #4207
when we make a new cuml version, we need to also bump the rapids-cmake version at the same time. Otherwise we will get the previous releases dependencies by mistake. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4213
This PR allows support for missing observations and padding at the start for variable-length batch. Example: ![missing_obs_0](https://user-images.githubusercontent.com/17441062/125832072-1ff903c9-088e-4d77-9b17-be365890d982.png) Note: I had to change ARIMA tests because I used a different method than statsmodels (which is used as a reference in tests) to compute the initial parameter estimation. They cut all missing observations for their initial least-square estimation, and I decided to fill them with naive replacements instead, so I keep the temporal relationships in the data and have a much better initial estimate and often a better fit in the end, according to some MASE measurements I made. So I updated the integration test to use the MASE and pass if we are approximately the same _or better_ than statsmodels. Authors: - Louis Sugy (https://github.com/Nyrio) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Tamas Bela Feher (https://github.com/tfeher) - Dante Gama Dessavre (https://github.com/dantegd) - Ray Douglass (https://github.com/raydouglass) URL: #4058
Forward-merge `branch-21.08` into `branch-21.10`
[gpuCI] Forward-merge branch-21.10 to branch-21.12 [skip gpuci]
* Adds the poisson impurity criterion to RF, in parity with scikit learn's RF regressor [[here](https://scikit-learn.org/stable/modules/tree.html#regression-criteria)] EDIT: * Also adds C++ level testing for RF Objective function gains of Poisson and Gini. Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4156
[gpuCI] Forward-merge branch-21.10 to branch-21.12 [skip gpuci]
The 2.1.0 version of Treelite incorporates the following major improvements: * dmlc/treelite#311 * dmlc/treelite#302 * dmlc/treelite#303 * dmlc/treelite#296 In particular, dmlc/treelite#311 is a critical follow-up to #4191 and addresses a performance regression. Requires rapidsai/integration#353 Authors: - Philip Hyunsu Cho (https://github.com/hcho3) Approvers: - Jordan Jacobelli (https://github.com/Ethyling) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4220
[gpuCI] Forward-merge branch-21.10 to branch-21.12 [skip gpuci]
Benchmarks show that RF performs consistently better with pinned host memory, while DBSCAN sometimes better and sometimes not (within the margin of error), so using pinned host memory by default for both these algorithms. Ignoring KMeans and LARS for now as both show slightly better perf with pinned host memory but only with increasing number of columns. Since this would need more analysis and deciding if a heuristic is needed for selecting memory, deferring it to 21.12. Here are the raw numbers: 1. LARS Normal memory: ```{'lars': {(100000, 10): 0.12429666519165039, (100000, 100): 0.015396833419799805, (100000, 250): 0.015408039093017578, (250000, 10): 0.00986933708190918, (250000, 100): 0.023822546005249023, (250000, 250): 0.03715157508850098, (500000, 10): 0.013423442840576172, (500000, 100): 0.044762372970581055, (500000, 250): 0.07782578468322754}``` Pinned memory: ```{'lars': {(100000, 10): 0.12958097457885742, (100000, 100): 0.01501011848449707, (100000, 250): 0.016597509384155273, (250000, 10): 0.01801013946533203, (250000, 100): 0.022644996643066406, (250000, 250): 0.037090301513671875, (500000, 10): 0.020437955856323242, (500000, 100): 0.044635772705078125, (500000, 250): 0.07696056365966797}``` 2. RFR Normal memory: ```'rfr': {(100000, 10): 1.1951744556427002, (100000, 100): 5.099738359451294, (100000, 250): 11.32804536819458, (250000, 10): 2.0097765922546387, (250000, 100): 9.109776496887207, (250000, 250): 21.058837890625, (500000, 10): 3.3387184143066406, (500000, 100): 15.802990436553955, (500000, 250): 36.80855870246887}``` Pinned memory: ```'rfr': {(100000, 10): 1.1727137565612793, (100000, 100): 4.804195880889893, (100000, 250): 11.621357917785645, (250000, 10): 1.8899295330047607, (250000, 100): 9.16961407661438, (250000, 250): 21.12194561958313, (500000, 10): 3.2937560081481934, (500000, 100): 15.66197681427002, (500000, 250): 36.6080117225647}``` 3. KMeans Normal memory: ```{(100000, 10): 0.11008882522583008, (100000, 100): 0.15475797653198242, (100000, 250): 0.15683507919311523, (250000, 10): 0.18775177001953125, (250000, 100): 0.25696277618408203, (250000, 250): 0.40389132499694824, (500000, 10): 0.4578282833099365, (500000, 100): 0.3917391300201416, (500000, 250): 0.6426849365234375}``` Pinned memory: ```'kmeans': {(100000, 10): 0.11982870101928711, (100000, 100): 0.16992664337158203, (100000, 250): 0.1021108627319336, (250000, 10): 0.16021251678466797, (250000, 100): 0.31025242805480957, (250000, 250): 0.298201322555542, (500000, 10): 0.21084189414978027, (500000, 100): 0.50473952293396, (500000, 250): 0.6191830635070801}``` 4. DBSCAN Normal memory: ```'dbscan': {(100000, 10): 0.4957292079925537, (100000, 100): 0.8680248260498047, (100000, 250): 1.585218906402588, (250000, 10): 4.52524995803833, (250000, 100): 7.175846099853516, (250000, 250): 12.135416269302368, (500000, 10): 26.427770853042603, (500000, 100): 37.57275915145874, (500000, 250): 57.98261737823486}}``` Pinned memory: ```'dbscan': {(100000, 10): 0.49578166007995605, (100000, 100): 0.8678708076477051, (100000, 250): 1.5854766368865967, (250000, 10): 4.526952505111694, (250000, 100): 7.172863006591797, (250000, 250): 12.145166397094727, (500000, 10): 26.422622680664062, (500000, 100): 37.56665277481079, (500000, 250): 58.02563738822937}}``` Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4215
[gpuCI] Forward-merge branch-21.10 to branch-21.12 [skip gpuci]
Changes to be in-line with: rapidsai/cudf#9286 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4229
[gpuCI] Forward-merge branch-21.10 to branch-21.12 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Changes to be in-line with: rapidsai/cudf#9734 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - AJ Schmidt (https://github.com/ajschmidt8) URL: #4390
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
cc @robertmaynard @quasiben @raydouglass Authors: - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) URL: #4392
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Authors: - Philip Hyunsu Cho (https://github.com/hcho3) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4398
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
PR uses project flash to build the cuML Python package mirroring what the C++ flow looks like. Note: Currently only changed for the CUDA 11.0 GPU test since that one uses Python 3.7, to do the other jobs we need to build the python package twice on the CPU job.
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4396
Suggest using LinearSVM when the user chooses to use the linear kernel in SVM. The reason is that LinearSVM uses a specialized faster solver. Closes #1664 Also partially addresses #2857 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4382
Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4373
There were actuall 2 minor issues that prevented `UMAPAlgo::Optimize::find_params_ab()` from being ASAN-clean at the moment: - One is the mem leaks, of course - Another one is the `malloc()`-`delete` mismatch -- only memory allocated using `new` or equivalent should be freed with operator `delete` or `delete[]` Another issue that was also addressed here: exception safety (i.e., by using `make_unique` from C++-14) Signed-off-by: Yitao Li <[email protected]> Authors: - Yitao Li (https://github.com/yitao-li) Approvers: - Zach Bjornson (https://github.com/zbjornson) - Corey J. Nolet (https://github.com/cjnolet) URL: #4405
P_sum is equal to n. See #2622 where I made this change once before. #4208 changed it back while consolidating code. Authors: - Zach Bjornson (https://github.com/zbjornson) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #4425
This PR separates the Decision tree kernels into separate Translation Units (TU) and explicitly instantiates templates. This is helpful in 2 ways: 1. refactoring top-level RF/DT code now would not require recompilation of the kernels 2. Since they are separated into different TUs and linked, they can leverage build parallelism (4x improvement in rebuild times after touching kernel definitions) Rebuilding by running `time ./build.sh libcuml -v -n PARALLEL_LEVEL=20` after touching RF kernels comparison: (Note: using `--ccache` doesn't matter here, assuming after touching RF kernels the state of the code-base is completely new and not part of ccache's hashed index) <details><summary>This PR</summary> ``` real 0m20.054s user 2m28.436s sys 0m14.241s ``` </details> <details><summary>branch-21.12</summary> ``` real 1m21.197s user 2m5.751s sys 0m6.050s ``` </details> Some other changes include renaming and reorganizing files, pruning headers and cleaning up some code Things to do: - [x] split DT Kernels - [x] benchmark for regressions Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4299
Answers #4203 Just set in stone the warning filter for "Numerical issues". Authors: - Victor Lafargue (https://github.com/viclafargue) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4408
Closes #4047 Authors: - Philip Hyunsu Cho (https://github.com/hcho3) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4402
Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4389
* FIX Remove hard sklearn imports * FIX Missing whitespace * FIX minor error * FIX PEP8 fixes
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Authors: - Divye Gala (https://github.com/divyegala) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #4313
Update `ucx-py` version on release using `rvc` Authors: - Jordan Jacobelli (https://github.com/Ethyling) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) URL: #4411
This PR updates the pinnings of the conda environment for CUDA 11.5 to use 22.02 packages. This resolves conflicts between a5e7cfb and #4364. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - AJ Schmidt (https://github.com/ajschmidt8) URL: #4450
Authors: - Corey J. Nolet (https://github.com/cjnolet) - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - Divye Gala (https://github.com/divyegala) - Dante Gama Dessavre (https://github.com/dantegd) - Jordan Jacobelli (https://github.com/Ethyling) URL: #4302
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.