You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Another good thing to benchmark will be switching all of the pre-processing at a token level which the rapidsai/cudf#5739 enables. I expect to see improvements especially for datasets with divergent length characteristics.
The text was updated successfully, but these errors were encountered:
I plan to start a pr benchmarking and addressing this soon. 😊
VibhuJawa
changed the title
[FEA]Extend default text vectorizer pre-processing to remove tokens of length 1.
[ENH]Extend default text vectorizer pre-processing to remove tokens of length 1.
Jul 22, 2020
Is your feature request related to a problem? Please describe.
We should extend default text vectorizer pre-processing to remove tokens of length 1 following scikit-learn.
Currently, we don't remove them but now that rapidsai/cudf#5658 is in we should able to handle this too.
Scikit-learn docs on default token extractor:
Divergence Example: https://gist.github.com/VibhuJawa/5f8e2666da9ecbb63d0398277fc93ac5
Token Level Preprocessing:
Another good thing to benchmark will be switching all of the pre-processing at a token level which the rapidsai/cudf#5739 enables. I expect to see improvements especially for datasets with divergent length characteristics.
The text was updated successfully, but these errors were encountered: