Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH]Extend default text vectorizer pre-processing to remove tokens of length 1. #2588

Closed
VibhuJawa opened this issue Jul 22, 2020 · 2 comments · Fixed by #2796
Closed

[ENH]Extend default text vectorizer pre-processing to remove tokens of length 1. #2588

VibhuJawa opened this issue Jul 22, 2020 · 2 comments · Fixed by #2796
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@VibhuJawa
Copy link
Member

Is your feature request related to a problem? Please describe.

We should extend default text vectorizer pre-processing to remove tokens of length 1 following scikit-learn.

Currently, we don't remove them but now that rapidsai/cudf#5658 is in we should able to handle this too.

Scikit-learn docs on default token extractor:

The default regexp select tokens of 2 or more alphanumeric characters (
punctuation is completely ignored and always treated as a token separator).

Divergence Example: https://gist.github.com/VibhuJawa/5f8e2666da9ecbb63d0398277fc93ac5

Token Level Preprocessing:

Another good thing to benchmark will be switching all of the pre-processing at a token level which the rapidsai/cudf#5739 enables. I expect to see improvements especially for datasets with divergent length characteristics.

@VibhuJawa VibhuJawa added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jul 22, 2020
@VibhuJawa
Copy link
Member Author

I plan to start a pr benchmarking and addressing this soon. 😊

@VibhuJawa VibhuJawa changed the title [FEA]Extend default text vectorizer pre-processing to remove tokens of length 1. [ENH]Extend default text vectorizer pre-processing to remove tokens of length 1. Jul 22, 2020
@VibhuJawa
Copy link
Member Author

Just waiting for rapidsai/cudf#5975 to land.

Will update this and #2590 together once rapidsai/cudf#5975 lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant