Fix categorical test in python #4326

levsnv · 2021-11-04T07:23:09Z

Current test has three issues in categorical data generation:

train data and test data are numerically very different. In this case, during testing, only the first few categories are exercised (low sensitivity)
during categorical conversion, columns are normalized row-wise instead of feature-wise. It doesn't lead to significantly different results, but just doesn't make sense
test data does not contain invalid categories

This PR fully fixes 1. and 2 and partially fixes 3.
Since FIL currently does not handle all kinds of invalid categories gracefully, only test ones it can so far.

The test can be tested by the following changes:

--- a/cpp/src/fil/internal.cuh
+++ b/cpp/src/fil/internal.cuh
@@ -348,7 +348,7 @@ struct categorical_sets {
     // features with similar categorical feature count, we may consider
     // storing node ID within nodes with same feature ID and look up
     // {.max_matching, .first_node_offset} = ...[feature_id]
-    return category <= max_matching[node.fid()] && fetch_bit(bits + node.set(), category);
+    return category <= max_matching[node.fid()] ? fetch_bit(bits + node.set(), category) : 1;
   }
   static int sizeof_mask_from_max_matching(int max_matching)
   {

This will help test #4314

python/cuml/test/test_fil.py

dantegd · 2021-11-04T14:45:15Z

rerun tests

python/cuml/test/test_fil.py

canonizer

Approved, provided that the comments are addressed.

There are still many comments, but they are mostly technical and about documentation.

python/cuml/test/test_fil.py

dantegd · 2021-11-10T17:47:27Z

rerun tests

levsnv · 2021-11-11T08:33:25Z

I need to fix the lightgbm test failures

…on-categorical-test

levsnv · 2021-11-12T06:10:24Z

rerun tests

…on-categorical-test

codecov-commenter · 2021-11-13T01:40:41Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.12@cfd536c). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.12    #4326   +/-   ##
===============================================
  Coverage                ?   85.99%           
===============================================
  Files                   ?      231           
  Lines                   ?    18714           
  Branches                ?        0           
===============================================
  Hits                    ?    16093           
  Misses                  ?     2621           
  Partials                ?        0

Flag	Coverage Δ
dask	`46.96% <0.00%> (?)`
non-dask	`78.68% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cfd536c...39429db. Read the comment docs.

wphicks · 2021-11-15T14:04:49Z

@gpucibot merge

…provided to FIL model (#4314) Fix potential CUDA context poison due to invalid global read when negative categories provided at inference: now equivalent to non-matching. FIL now converts dummy nodes to numerical on import and never generates max_matching == -1 categorical features in test. FIL will still generate empty categorical nodes in test (a non-empty bits vector which contains only zeros), export them as dummy numerical nodes and import again as dummy numerical nodes. If a feature only contains dummy numerical nodes, it will be deemed a numerical feature (same as for non-dummy numerical nodes or a mix thereof). Therefore, categorical feature max_matching == -1 is still prevented. CI failures ``` Test Result (2 failures / +2) cuml.test.test_fil.test_lightgbm[5-2] cuml.test.test_fil.test_lightgbm[5-5] ``` will be resolved by #4326 Authors: - Levs Dolgovs (https://github.com/levsnv) Approvers: - Andy Adinets (https://github.com/canonizer) - William Hicks (https://github.com/wphicks) URL: #4314

Current test has three issues in categorical data generation: 1. train data and test data are numerically very different. In this case, during testing, only the first few categories are exercised (low sensitivity) 2. during categorical conversion, columns are normalized row-wise instead of feature-wise. It doesn't lead to significantly different results, but just doesn't make sense 3. test data does not contain invalid categories This PR fully fixes 1. and 2 and partially fixes 3. Since FIL currently does not handle all kinds of invalid categories gracefully, only test ones it can so far. The test can be tested by the following changes: ```diff --- a/cpp/src/fil/internal.cuh +++ b/cpp/src/fil/internal.cuh @@ -348,7 +348,7 @@ struct categorical_sets { // features with similar categorical feature count, we may consider // storing node ID within nodes with same feature ID and look up // {.max_matching, .first_node_offset} = ...[feature_id] - return category <= max_matching[node.fid()] && fetch_bit(bits + node.set(), category); + return category <= max_matching[node.fid()] ? fetch_bit(bits + node.set(), category) : 1; } static int sizeof_mask_from_max_matching(int max_matching) { ``` This will help test rapidsai#4314 Authors: - Levs Dolgovs (https://github.com/levsnv) Approvers: - Andy Adinets (https://github.com/canonizer) - William Hicks (https://github.com/wphicks) URL: rapidsai#4326

…provided to FIL model (rapidsai#4314) Fix potential CUDA context poison due to invalid global read when negative categories provided at inference: now equivalent to non-matching. FIL now converts dummy nodes to numerical on import and never generates max_matching == -1 categorical features in test. FIL will still generate empty categorical nodes in test (a non-empty bits vector which contains only zeros), export them as dummy numerical nodes and import again as dummy numerical nodes. If a feature only contains dummy numerical nodes, it will be deemed a numerical feature (same as for non-dummy numerical nodes or a mix thereof). Therefore, categorical feature max_matching == -1 is still prevented. CI failures ``` Test Result (2 failures / +2) cuml.test.test_fil.test_lightgbm[5-2] cuml.test.test_fil.test_lightgbm[5-5] ``` will be resolved by rapidsai#4326 Authors: - Levs Dolgovs (https://github.com/levsnv) Approvers: - Andy Adinets (https://github.com/canonizer) - William Hicks (https://github.com/wphicks) URL: rapidsai#4314

make categorical test predict on categorical data

05e781e

levsnv requested a review from a team as a code owner November 4, 2021 07:23

levsnv requested a review from canonizer November 4, 2021 07:23

github-actions bot added the Cython / Python Cython or Python issue label Nov 4, 2021

levsnv added non-breaking Non-breaking change tests Unit testing for project labels Nov 4, 2021

levsnv added 2 commits November 4, 2021 00:27

added invalid categories (only small out of range ones) to test

51f8ff2

style

7803ed8

levsnv added Tech Debt Issues related to debt bug Something isn't working labels Nov 4, 2021

levsnv commented Nov 4, 2021

View reviewed changes

python/cuml/test/test_fil.py Outdated Show resolved Hide resolved

python/cuml/test/test_fil.py Outdated Show resolved Hide resolved

levsnv mentioned this pull request Nov 4, 2021

Fix potential CUDA context poison when negative (invalid) categories provided to FIL model [21.12] #4314

Merged

levsnv added the 4 - Waiting on Reviewer Waiting for reviewer to review or respond label Nov 4, 2021

levsnv mentioned this pull request Nov 4, 2021

Fix potential CUDA context poison when negative (invalid) categories provided to FIL model [21.10] #4315

Merged

canonizer reviewed Nov 5, 2021

View reviewed changes

levsnv added 4 - Waiting on Author Waiting for author to respond to review and removed 4 - Waiting on Reviewer Waiting for reviewer to review or respond labels Nov 8, 2021

moved to local generator that's passed in a thread-safe manner

88b6543

levsnv requested a review from canonizer November 9, 2021 04:09

levsnv removed the 4 - Waiting on Author Waiting for author to respond to review label Nov 9, 2021

remove extra change

fa368e4

levsnv added the 4 - Waiting on Reviewer Waiting for reviewer to review or respond label Nov 9, 2021

canonizer approved these changes Nov 9, 2021

View reviewed changes

canonizer reviewed Nov 9, 2021

View reviewed changes

python/cuml/test/test_fil.py Outdated Show resolved Hide resolved

levsnv added 2 commits November 10, 2021 18:28

addressed review comments

8960f30

style

6aa848d

levsnv added 3 - Ready for Review Ready for review by team 4 - Waiting on Author Waiting for author to respond to review and removed 4 - Waiting on Reviewer Waiting for reviewer to review or respond 3 - Ready for Review Ready for review by team labels Nov 11, 2021

levsnv added 2 commits November 11, 2021 14:45

Merge branch 'branch-21.12' of github.com:rapidsai/cuml into fix-pyth…

977098a

…on-categorical-test

concatenate

98ea0cb

levsnv removed the 4 - Waiting on Author Waiting for author to respond to review label Nov 12, 2021

Merge branch 'branch-21.12' of github.com:rapidsai/cuml into fix-pyth…

39429db

…on-categorical-test

wphicks approved these changes Nov 15, 2021

View reviewed changes

rapids-bot bot merged commit f4c098c into rapidsai:branch-21.12 Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix categorical test in python #4326

Fix categorical test in python #4326

levsnv commented Nov 4, 2021 •

edited

Loading

dantegd commented Nov 4, 2021

canonizer left a comment

dantegd commented Nov 10, 2021

levsnv commented Nov 11, 2021

levsnv commented Nov 12, 2021

codecov-commenter commented Nov 13, 2021

wphicks commented Nov 15, 2021

Fix categorical test in python #4326

Fix categorical test in python #4326

Conversation

levsnv commented Nov 4, 2021 • edited Loading

dantegd commented Nov 4, 2021

canonizer left a comment

Choose a reason for hiding this comment

dantegd commented Nov 10, 2021

levsnv commented Nov 11, 2021

levsnv commented Nov 12, 2021

codecov-commenter commented Nov 13, 2021

Codecov Report

wphicks commented Nov 15, 2021

levsnv commented Nov 4, 2021 •

edited

Loading