[ENH]Extend default text vectorizer pre-processing to remove tokens of length 1. #2588

VibhuJawa · 2020-07-22T17:33:08Z

Is your feature request related to a problem? Please describe.

We should extend default text vectorizer pre-processing to remove tokens of length 1 following scikit-learn.

Currently, we don't remove them but now that rapidsai/cudf#5658 is in we should able to handle this too.

Scikit-learn docs on default token extractor:

The default regexp select tokens of 2 or more alphanumeric characters (
punctuation is completely ignored and always treated as a token separator).

Divergence Example: https://gist.github.com/VibhuJawa/5f8e2666da9ecbb63d0398277fc93ac5

Token Level Preprocessing:

Another good thing to benchmark will be switching all of the pre-processing at a token level which the rapidsai/cudf#5739 enables. I expect to see improvements especially for datasets with divergent length characteristics.

The text was updated successfully, but these errors were encountered:

VibhuJawa · 2020-07-22T17:34:11Z

I plan to start a pr benchmarking and addressing this soon. 😊

VibhuJawa · 2020-08-28T22:02:58Z

Just waiting for rapidsai/cudf#5975 to land.

Will update this and #2590 together once rapidsai/cudf#5975 lands.

VibhuJawa added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jul 22, 2020

VibhuJawa changed the title ~~[FEA]Extend default text vectorizer pre-processing to remove tokens of length 1.~~ [ENH]Extend default text vectorizer pre-processing to remove tokens of length 1. Jul 22, 2020

VibhuJawa mentioned this issue Sep 4, 2020

[REVIEW]Remove tokens of length 1 by default for text vectorizers #2796

Merged

dantegd closed this as completed in #2796 Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH]Extend default text vectorizer pre-processing to remove tokens of length 1. #2588

[ENH]Extend default text vectorizer pre-processing to remove tokens of length 1. #2588

VibhuJawa commented Jul 22, 2020

VibhuJawa commented Jul 22, 2020

VibhuJawa commented Aug 28, 2020

[ENH]Extend default text vectorizer pre-processing to remove tokens of length 1. #2588

[ENH]Extend default text vectorizer pre-processing to remove tokens of length 1. #2588

Comments

VibhuJawa commented Jul 22, 2020

VibhuJawa commented Jul 22, 2020

VibhuJawa commented Aug 28, 2020