Add categorical detection to be coverage based in addition to unique count based #473

michaelweilsalesforce · 2020-04-21T22:21:56Z

Related issues
Currently SmartTextVectorizer and SmartTextMapVectorizer will count the number of unique entries in a text field (up to a threshold, currently 50) and treat the feature as categorical if it has < 50 unique entries.
You can still run into features that are effectively categorical, but may have a long tail of low-frequency entries. We would get better signal extraction if we treated these as categorical instead of hashing them.

Describe the proposed solution
Adding an extra check for Text(Map) features in order to become categoricals. This only applies to features that have a cardinality higher than the threshold and therefore would be hashed.

A better approach to detecting text features that are really categorical would be to use a coverage criteria. For example, the topK entries with minimum support cover at least 90% of the entries, then this would be a good feature to pivot by entry instead of hash by token. The value of 90% can be tuned by the user thanks to a param.

Extra checks need to be passed :

Cardinality must be greater than maxCard (already mentioned above).
Cardinality must also be greater than topK.
Finally, the computed coverage of the topK with minimum support must be > 0.

If there is m < TopK elements with the required minimum support, then we are looking at the coverage of these m elements.

Describe alternatives you've considered
I've considered using Algebird Count Min Sketch in order to compute the current TextStats.
However I ran into multiple issue :

Lack of transparency: TopNCMS only returns the "HeavyHitters" however you need much more than that(e.g. cardinality) in order to use the coverage method.
Serialization issues when writing to JSON

A branch still exists : mw/coverage, but it is in shambles.

Additional context
Some criticism regarding TextStats. It seems not to be a semi group as it is not associative. Was it intended?

codecov · 2020-04-21T22:41:16Z

Codecov Report

Merging #473 into master will increase coverage by 0.01%.
The diff coverage is 95.00%.

@@            Coverage Diff             @@
##           master     #473      +/-   ##
==========================================
+ Coverage   86.99%   87.00%   +0.01%     
==========================================
  Files         345      345              
  Lines       11624    11643      +19     
  Branches      386      604     +218     
==========================================
+ Hits        10112    10130      +18     
- Misses       1512     1513       +1

Impacted Files	Coverage Δ
...e/op/stages/impl/feature/SmartTextVectorizer.scala	`95.17% <90.00%> (-0.42%)`	⬇️
...p/stages/impl/feature/SmartTextMapVectorizer.scala	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d0c33e...ee1bdb2. Read the comment docs.

leahmcguire · 2020-04-22T15:50:46Z

So is this a WIP or something? :-P

michaelweilsalesforce · 2020-04-29T18:11:19Z

@leahmcguire @Jauntbox No longer WIPWIPWIPWIPWIPWIPWIPWIPWIP

leahmcguire

Please expand the comment on your case a bit and then LGTM

leahmcguire · 2020-05-07T17:04:36Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextMapVectorizer.scala

        val vecMethod: TextVectorizationMethod = textStats match {
+          // If cardinality not respect, but coverage is, then pivot the feature
+          case _ if textStats.valueCounts.size > maxCard && textStats.valueCounts.size > topKValue && coverage > 0 &&


Even with the comment I am having trouble parsing which conditions you are looking for - can you add a but more?

can you just change it to coverage > $(coveragePct) and get rid of the additional coverage > 0 check?

mweilsalesforce added 4 commits April 20, 2020 15:36

First Logic

6028868

First tests

b26a3e3

Second Tests

a32a43d

Removing prin ts

6f89a48

michaelweilsalesforce added the work in progress label Apr 21, 2020

michaelweilsalesforce requested review from gerashegalov, Jauntbox, leahmcguire, tovbinm and wsuchy as code owners April 21, 2020 22:21

mweilsalesforce added 3 commits April 22, 2020 13:35

fix test

94b9486

Fix 2

671304e

Adding comments

05f56ca

michaelweilsalesforce changed the title ~~[WIP][WIP][WIP][WIP][WIP][WIP][WIP][WIP][WIP][WIP][WIP]Mw/increase card and coverage~~ Add categorical detection to be coverage based in addition to unique count based Apr 22, 2020

michaelweilsalesforce added ready for review and removed work in progress labels Apr 22, 2020

Line change

c10aa55

leahmcguire approved these changes May 7, 2020

View reviewed changes

mweilsalesforce added 4 commits May 11, 2020 14:29

Merge branch 'master' into mw/IncreaseCardAndCoverage

40a3dc5

Adding more comments

af5ef0b

Removing useless condition

a812fd0

Fixing tests

ee1bdb2

michaelweilsalesforce merged commit 24cdbc4 into master May 14, 2020

nicodv mentioned this pull request Jun 11, 2020

0.7.0 release #481

Merged

tovbinm deleted the mw/IncreaseCardAndCoverage branch June 12, 2020 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add categorical detection to be coverage based in addition to unique count based #473

Add categorical detection to be coverage based in addition to unique count based #473

michaelweilsalesforce commented Apr 21, 2020 •

edited

Loading

codecov bot commented Apr 21, 2020 •

edited

Loading

leahmcguire commented Apr 22, 2020

michaelweilsalesforce commented Apr 29, 2020

leahmcguire left a comment

leahmcguire May 7, 2020

leahmcguire May 13, 2020

Add categorical detection to be coverage based in addition to unique count based #473

Add categorical detection to be coverage based in addition to unique count based #473

Conversation

michaelweilsalesforce commented Apr 21, 2020 • edited Loading

codecov bot commented Apr 21, 2020 • edited Loading

Codecov Report

leahmcguire commented Apr 22, 2020

michaelweilsalesforce commented Apr 29, 2020

leahmcguire left a comment

Choose a reason for hiding this comment

leahmcguire May 7, 2020

Choose a reason for hiding this comment

leahmcguire May 13, 2020

Choose a reason for hiding this comment

michaelweilsalesforce commented Apr 21, 2020 •

edited

Loading

codecov bot commented Apr 21, 2020 •

edited

Loading