-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix categorical test in python #4326
Fix categorical test in python #4326
Conversation
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, provided that the comments are addressed.
There are still many comments, but they are mostly technical and about documentation.
rerun tests |
I need to fix the lightgbm test failures |
rerun tests |
…on-categorical-test
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #4326 +/- ##
===============================================
Coverage ? 85.99%
===============================================
Files ? 231
Lines ? 18714
Branches ? 0
===============================================
Hits ? 16093
Misses ? 2621
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
@gpucibot merge |
…provided to FIL model (#4314) Fix potential CUDA context poison due to invalid global read when negative categories provided at inference: now equivalent to non-matching. FIL now converts dummy nodes to numerical on import and never generates max_matching == -1 categorical features in test. FIL will still generate empty categorical nodes in test (a non-empty bits vector which contains only zeros), export them as dummy numerical nodes and import again as dummy numerical nodes. If a feature only contains dummy numerical nodes, it will be deemed a numerical feature (same as for non-dummy numerical nodes or a mix thereof). Therefore, categorical feature max_matching == -1 is still prevented. CI failures ``` Test Result (2 failures / +2) cuml.test.test_fil.test_lightgbm[5-2] cuml.test.test_fil.test_lightgbm[5-5] ``` will be resolved by #4326 Authors: - Levs Dolgovs (https://github.com/levsnv) Approvers: - Andy Adinets (https://github.com/canonizer) - William Hicks (https://github.com/wphicks) URL: #4314
Current test has three issues in categorical data generation: 1. train data and test data are numerically very different. In this case, during testing, only the first few categories are exercised (low sensitivity) 2. during categorical conversion, columns are normalized row-wise instead of feature-wise. It doesn't lead to significantly different results, but just doesn't make sense 3. test data does not contain invalid categories This PR fully fixes 1. and 2 and partially fixes 3. Since FIL currently does not handle all kinds of invalid categories gracefully, only test ones it can so far. The test can be tested by the following changes: ```diff --- a/cpp/src/fil/internal.cuh +++ b/cpp/src/fil/internal.cuh @@ -348,7 +348,7 @@ struct categorical_sets { // features with similar categorical feature count, we may consider // storing node ID within nodes with same feature ID and look up // {.max_matching, .first_node_offset} = ...[feature_id] - return category <= max_matching[node.fid()] && fetch_bit(bits + node.set(), category); + return category <= max_matching[node.fid()] ? fetch_bit(bits + node.set(), category) : 1; } static int sizeof_mask_from_max_matching(int max_matching) { ``` This will help test rapidsai#4314 Authors: - Levs Dolgovs (https://github.com/levsnv) Approvers: - Andy Adinets (https://github.com/canonizer) - William Hicks (https://github.com/wphicks) URL: rapidsai#4326
…provided to FIL model (rapidsai#4314) Fix potential CUDA context poison due to invalid global read when negative categories provided at inference: now equivalent to non-matching. FIL now converts dummy nodes to numerical on import and never generates max_matching == -1 categorical features in test. FIL will still generate empty categorical nodes in test (a non-empty bits vector which contains only zeros), export them as dummy numerical nodes and import again as dummy numerical nodes. If a feature only contains dummy numerical nodes, it will be deemed a numerical feature (same as for non-dummy numerical nodes or a mix thereof). Therefore, categorical feature max_matching == -1 is still prevented. CI failures ``` Test Result (2 failures / +2) cuml.test.test_fil.test_lightgbm[5-2] cuml.test.test_fil.test_lightgbm[5-5] ``` will be resolved by rapidsai#4326 Authors: - Levs Dolgovs (https://github.com/levsnv) Approvers: - Andy Adinets (https://github.com/canonizer) - William Hicks (https://github.com/wphicks) URL: rapidsai#4314
Current test has three issues in categorical data generation:
This PR fully fixes 1. and 2 and partially fixes 3.
Since FIL currently does not handle all kinds of invalid categories gracefully, only test ones it can so far.
The test can be tested by the following changes:
This will help test #4314