Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add categorical detection to be coverage based in addition to unique count based #473

Merged
merged 12 commits into from
May 14, 2020

Conversation

michaelweilsalesforce
Copy link
Contributor

@michaelweilsalesforce michaelweilsalesforce commented Apr 21, 2020

Related issues
Currently SmartTextVectorizer and SmartTextMapVectorizer will count the number of unique entries in a text field (up to a threshold, currently 50) and treat the feature as categorical if it has < 50 unique entries.
You can still run into features that are effectively categorical, but may have a long tail of low-frequency entries. We would get better signal extraction if we treated these as categorical instead of hashing them.

Describe the proposed solution
Adding an extra check for Text(Map) features in order to become categoricals. This only applies to features that have a cardinality higher than the threshold and therefore would be hashed.

A better approach to detecting text features that are really categorical would be to use a coverage criteria. For example, the topK entries with minimum support cover at least 90% of the entries, then this would be a good feature to pivot by entry instead of hash by token. The value of 90% can be tuned by the user thanks to a param.

Extra checks need to be passed :

  • Cardinality must be greater than maxCard (already mentioned above).
  • Cardinality must also be greater than topK.
  • Finally, the computed coverage of the topK with minimum support must be > 0.

If there is m < TopK elements with the required minimum support, then we are looking at the coverage of these m elements.

Describe alternatives you've considered
I've considered using Algebird Count Min Sketch in order to compute the current TextStats.
However I ran into multiple issue :

  • Lack of transparency: TopNCMS only returns the "HeavyHitters" however you need much more than that(e.g. cardinality) in order to use the coverage method.
  • Serialization issues when writing to JSON

A branch still exists : mw/coverage, but it is in shambles.

Additional context
Some criticism regarding TextStats. It seems not to be a semi group as it is not associative. Was it intended?

@codecov
Copy link

codecov bot commented Apr 21, 2020

Codecov Report

Merging #473 into master will increase coverage by 0.01%.
The diff coverage is 95.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #473      +/-   ##
==========================================
+ Coverage   86.99%   87.00%   +0.01%     
==========================================
  Files         345      345              
  Lines       11624    11643      +19     
  Branches      386      604     +218     
==========================================
+ Hits        10112    10130      +18     
- Misses       1512     1513       +1     
Impacted Files Coverage Δ
...e/op/stages/impl/feature/SmartTextVectorizer.scala 95.17% <90.00%> (-0.42%) ⬇️
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d0c33e...ee1bdb2. Read the comment docs.

@leahmcguire
Copy link
Collaborator

So is this a WIP or something? :-P

@michaelweilsalesforce michaelweilsalesforce changed the title [WIP][WIP][WIP][WIP][WIP][WIP][WIP][WIP][WIP][WIP][WIP]Mw/increase card and coverage Add categorical detection to be coverage based in addition to unique count based Apr 22, 2020
@michaelweilsalesforce
Copy link
Contributor Author

@leahmcguire @Jauntbox No longer WIPWIPWIPWIPWIPWIPWIPWIPWIP

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand the comment on your case a bit and then LGTM

val vecMethod: TextVectorizationMethod = textStats match {
// If cardinality not respect, but coverage is, then pivot the feature
case _ if textStats.valueCounts.size > maxCard && textStats.valueCounts.size > topKValue && coverage > 0 &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with the comment I am having trouble parsing which conditions you are looking for - can you add a but more?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just change it to coverage > $(coveragePct) and get rid of the additional coverage > 0 check?

@michaelweilsalesforce michaelweilsalesforce merged commit 24cdbc4 into master May 14, 2020
@nicodv nicodv mentioned this pull request Jun 11, 2020
@tovbinm tovbinm deleted the mw/IncreaseCardAndCoverage branch June 12, 2020 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants