-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow tokenizer is used by default #105
Comments
Hi @felixgwu, Thanks for a prompt reply. So the issue is with backward compatibility with previous results or rescaling, right? But if it's the case when, for example, few different summaries are compared (e.g. produced by different models) without rescaling it should be fine? If so, it'd be useful to have bert score parametrized with tokenizer type without the need to "hack" the library code. Update: p.s.: there is an unrelated failing test |
Hi @nikitajz, |
hi @felixgwu, I've updated the mentioned files, hopefully didn't miss anything. P.s.: |
Great, thanks! I'll take a look soon. |
Closing it since it is resolved in #106. |
First of all, thanks for a useful metric along with the code!
I've used to evaluate a big set of internal documents and it takes a while. Upon a review of your code, I've noticed that by default slow tokenizers (python version) is used (
use_fast=False
).It would be great either to switch the default to fast tokenizers (written in RUST) or at least parametrize it to avoid hacking lib code to make it work for a massive set of documents (currently it's a bottleneck, especially on GPU-powered machines).
Thanks!
The text was updated successfully, but these errors were encountered: