-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add categorical detection to be coverage based in addition to unique count based #473
Conversation
Codecov Report
@@ Coverage Diff @@
## master #473 +/- ##
==========================================
+ Coverage 86.99% 87.00% +0.01%
==========================================
Files 345 345
Lines 11624 11643 +19
Branches 386 604 +218
==========================================
+ Hits 10112 10130 +18
- Misses 1512 1513 +1
Continue to review full report at Codecov.
|
So is this a WIP or something? :-P |
@leahmcguire @Jauntbox No longer WIPWIPWIPWIPWIPWIPWIPWIPWIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please expand the comment on your case a bit and then LGTM
val vecMethod: TextVectorizationMethod = textStats match { | ||
// If cardinality not respect, but coverage is, then pivot the feature | ||
case _ if textStats.valueCounts.size > maxCard && textStats.valueCounts.size > topKValue && coverage > 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with the comment I am having trouble parsing which conditions you are looking for - can you add a but more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just change it to coverage > $(coveragePct)
and get rid of the additional coverage > 0
check?
Related issues
Currently SmartTextVectorizer and SmartTextMapVectorizer will count the number of unique entries in a text field (up to a threshold, currently 50) and treat the feature as categorical if it has < 50 unique entries.
You can still run into features that are effectively categorical, but may have a long tail of low-frequency entries. We would get better signal extraction if we treated these as categorical instead of hashing them.
Describe the proposed solution
Adding an extra check for Text(Map) features in order to become categoricals. This only applies to features that have a cardinality higher than the threshold and therefore would be hashed.
A better approach to detecting text features that are really categorical would be to use a coverage criteria. For example, the topK entries with minimum support cover at least 90% of the entries, then this would be a good feature to pivot by entry instead of hash by token. The value of 90% can be tuned by the user thanks to a param.
Extra checks need to be passed :
If there is m < TopK elements with the required minimum support, then we are looking at the coverage of these m elements.
Describe alternatives you've considered
I've considered using Algebird Count Min Sketch in order to compute the current
TextStats
.However I ran into multiple issue :
TopNCMS
only returns the "HeavyHitters" however you need much more than that(e.g. cardinality) in order to use the coverage method.A branch still exists : mw/coverage, but it is in shambles.
Additional context
Some criticism regarding
TextStats
. It seems not to be a semi group as it is not associative. Was it intended?