Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand hypothesis testing to all linear models #4974

Draft
wants to merge 10,000 commits into
base: branch-23.02
Choose a base branch
from

Conversation

csadorf
Copy link
Contributor

@csadorf csadorf commented Nov 4, 2022

Should be merged after #4952 and #4973.

daxiongshu and others added 30 commits February 22, 2022 18:29
This PR adds `get_params()` member function to `TargetEncoder`. Hopefully it can resolve the issue rapidsai#4574

Authors:
  - Jiwei Liu (https://github.com/daxiongshu)

Approvers:
  - Victor Lafargue (https://github.com/viclafargue)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4588
The 2.3.0 version of Treelite incorporates the following improvements:
* GTIL optimization using multiple CPU threads (dmlc/treelite#353, dmlc/treelite#355, dmlc/treelite#357, dmlc/treelite#358, dmlc/treelite#362, dmlc/treelite#367)
* dmlc/treelite#365
* dmlc/treelite#366
* dmlc/treelite#368

Requires rapidsai/integration#436

Authors:
  - Philip Hyunsu Cho (https://github.com/hcho3)

Approvers:
  - William Hicks (https://github.com/wphicks)
  - Corey J. Nolet (https://github.com/cjnolet)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: rapidsai#4590
Depends on rapidsai#4295 

PR allows `libcuml++` to be built with individual algorithms, or individual families of algorithms with the argument `CUML_ALGORITHMS`. It defaults to `ALL`, and can take multiple options like:

```
cmake .. -DCUML_ALGORITHMS="FIL;TREESHAP"
```

which will build a `libcuml++` only containing FIL and GPUTreeSHAP components. 

PR to update build documentation will follow up.

Authors:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - William Hicks (https://github.com/wphicks)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4296
This PR allows `TargetEncoder` to encode the `variance` of the target as requested by rapidsai#4440

Authors:
  - Jiwei Liu (https://github.com/daxiongshu)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4483
Closes rapidsai#4341.
The `classmethod` decorator seems to not be useful here and is blocking the serialization of SimpleImputer.

Authors:
  - Micka (https://github.com/lowener)

Approvers:
  - Victor Lafargue (https://github.com/viclafargue)
  - William Hicks (https://github.com/wphicks)

URL: rapidsai#4439
Fix rapidsai#4525 as well as a hard crash in c++ benchmarks due to some recent changes in raft.

Authors:
  - Rory Mitchell (https://github.com/RAMitchell)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4594
…#4601)

This is the continuation of PR rapidsai#4588 to resolve issue rapidsai#4574

Authors:
  - Jiwei Liu (https://github.com/daxiongshu)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4601
Closes rapidsai#1666.
The implementation of this variant is straightforward and matches sklearn.

Authors:
  - Micka (https://github.com/lowener)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4595
See rapidsai#3569. XFailing right now to unblock CI.

Authors:
  - Micka (https://github.com/lowener)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4621
Using a brute force approach compared to sklearn's kd/ball tree. 

Todo:
- [x] Implement sample method
- [x] Sample weights
- [x] Evaluate which metrics are missing
- [x] Tests for sample
- [x] Docstrings

Authors:
  - Rory Mitchell (https://github.com/RAMitchell)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Micka (https://github.com/lowener)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4545
RAFT PR 513 changed meaning of probability for Bernoulli and Scaled Bernoulli distribution. This PR does corresponding change in cuML.

Authors:
  - Vinay Deshpande (https://github.com/vinaydes)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4628
In the near future, the [rapidsai/ops-bot](https://github.com/rapidsai/ops-bot) GitHub application that we use for GitHub automation will be enabled on all repositories in the `rapidsai` GitHub organization. Since not all features of the application are applicable to all repositories, this PR adds a new file, `.github/ops-bot.yaml`, which can configure which features are enabled per repository.

Authors:
  - AJ Schmidt (https://github.com/ajschmidt8)

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)

URL: rapidsai#4630
Templatize FIL types to add float64 support.

This is based on the work by @levsnv, specifically rapidsai#4569. This supersedes rapidsai#4569.

Authors:
  - Andy Adinets (https://github.com/canonizer)
  - Levs Dolgovs (https://github.com/levsnv)

Approvers:
  - Divye Gala (https://github.com/divyegala)

URL: rapidsai#4625
…i#4633)

Add explicit option similar to FAISS and Treelite to be able to build a single `libcuml++` with all RAFT binary dependencies.

Authors:
  - Dante Gama Dessavre (https://github.com/dantegd)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4633
Rapids is upgrading to `2022.02.1` minimum version of dask. This PR updates those pinnings.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4632
[gpuCI] Forward-merge branch-22.04 to branch-22.06 [skip gpuci]
[gpuCI] Forward-merge branch-22.04 to branch-22.06 [skip gpuci]
This nanoPR fixes performance regression caused due to improper stream assignments to the decision trees.


Before fix:

 | sno |  algo | input | cu_time | cpu_time | cuml_acc | cpu_acc | speedup | n_samples | n_features | max_depth | n_estimators | n_bins | n_streams | n_jobs | n_classes | 
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
 | 0 | RandomForestClassifier | numpy | 32.635321855545044 | 0.0 | 0.99468 | 0.0 | 0.0 | 800000 | 64 | 8 | 500 | 128 | 4 | -1 | 2 | 
 | 1 | RandomForestClassifier | numpy | 40.36453413963318 | 0.0 | 0.994855 | 0.0 | 0.0 | 800000 | 64 | 10 | 500 | 128 | 4 | -1 | 2 | 
 | 2 | RandomForestClassifier | numpy | 61.35148477554321 | 0.0 | 0.99504 | 0.0 | 0.0 | 800000 | 64 | 16 | 500 | 128 | 4 | -1 | 2 | 

After fix:

| sno | algo | input | cu_time | cpu_time | cuml_acc | cpu_acc | speedup | n_samples | n_features | max_depth | n_estimators | n_bins | n_streams | n_jobs | n_classes
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
0 | RandomForestClassifier | numpy | 28.637776374816895 | 0.0 | 0.99468 | 0.0 | 0.0 | 800000 | 64 | 8 | 500 | 128 | 4 | -1 | 2
1 | RandomForestClassifier | numpy | 34.11380743980408 | 0.0 | 0.994855 | 0.0 | 0.0 | 800000 | 64 | 10 | 500 | 128 | 4 | -1 | 2
2 | RandomForestClassifier | numpy | 47.153409481048584 | 0.0 | 0.99504 | 0.0 | 0.0 | 800000 | 64 | 16 | 500 | 128 | 4 | -1 | 2

Command run in `cuml/`
```
python python/cuml/run_benchmarks.py--num-rows 800000 --num-features 64 --skip-cpu --test-split 0.2 --cuml-param-sweep "n_bins=[128]" "n_streams=[4]" --cpu-param-sweep "n_jobs=[-1]" --param-sweep "max_depth=[8,10,16]" "n_estimators=[500]" --n-reps 1 --csv pool-2112-cls-800000.csv --dataset-param-sweep "n_classes=[2]" --dtype "fp32" --dataset classification -- RandomForestClassifier
```

Authors:
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4644
[gpuCI] Forward-merge branch-22.04 to branch-22.06 [skip gpuci]
@csadorf csadorf force-pushed the fea-expand-hypothesis-testing-to-all-linear-models branch from 6849d71 to a09b558 Compare December 6, 2022 12:12
- The strategy chooses the n_informative default strategy more smartly
to satisfy the inequality assumption between the numnber of classes and
clusters per class and number of informative features.
- The strategy tries to prevent a more informative error message in case
that the assumption cannot be met with the given parameters arguments.
@csadorf csadorf force-pushed the fea-expand-hypothesis-testing-to-all-linear-models branch from c3149ad to b8baffe Compare December 6, 2022 20:20
@csadorf csadorf changed the title Fea expand hypothesis testing to all linear models Expand hypothesis testing to all linear models Dec 6, 2022
@csadorf csadorf added 0 - Blocked Cannot progress due to external reasons and removed 2 - In Progress Currenty a work in progress labels Dec 15, 2022
@csadorf
Copy link
Contributor Author

csadorf commented Dec 15, 2022

We should merge #5065 and address #4963 before moving forward with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Blocked Cannot progress due to external reasons Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.