-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Whisper Tokenizer] Encode timestamps #26054
[Whisper Tokenizer] Encode timestamps #26054
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this only changes the test, and this is due to the Hub files are changed, so LGTM (I trust you the changes on Hub is necessary and desired!).
Thank you!
BTW, do we expect this changes showing up on users' code results? I think it's yes, but wondering why it is OK? |
Alright refactored the tokenizer a bit to maintain the same results as what we had before! Essentially, we only output the timestamp tokens if the user passes Previously, when the timestamps were not in the vocab:
Now, the timestamps are in the vocab:
How does this look to you @ArthurZucker @ydshieh? |
Sound good to me. Arthur will know better and can provide better comments in any I believe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Let's try to move the logic outside _decode
and should be good
* skip failing tests until #26054 is merged * fixup
f820391
to
86c5e6d
Compare
@@ -605,21 +638,19 @@ def decode( | |||
) | |||
# retrieve offsets | |||
if output_offsets: | |||
offsets = None | |||
offsets = self._compute_offsets(token_ids, time_precision=time_precision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to use token_ids
here, not filtered_ids
, since we need the timestamp ids to be present so that we can compute the offsets
We later strip the timestamp ids from the chunk outputs in the _compute_offsets
method
The tests pass for me locally, but a small subset of the fast tokenizer tests timeout on the CI: link => any reason these tests should timeout now that we have the expanded vocabulary with the new added tokens? It seems to me that it's the same stage that always gets stuck: transformers/src/transformers/tokenization_utils_fast.py Lines 281 to 282 in 8f609ab
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be the 1500 tokens
|
||
if not decode_with_timestamps: | ||
# filter timestamp tokens if they are contained in the vocab | ||
timestamp_ids = self.convert_tokens_to_ids([("<|%.2f|>" % (i * time_precision)) for i in range(1500 + 1)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably super slow. We can / should cache it wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout! Resolved in a81a1c9
Does the PR look good to you now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I think the lru cache is elegant, hope performance wise it's also good!
CI seems to like it 🤗
If precision doesn't change this is never recomputed / automatically optimised by LRU cache (never used it!)
Yeah works pretty well: after the first cache step, tokenizing decode is on-par with what we had before. That's correct regarding only computing the cache once: we'll likely never actually have to re-compute the cache, since in practice, everyone will used a fixed time-precision of |
…gingface#26063) * skip failing tests until huggingface#26054 is merged * fixup
* [Whisper Tokenizer] Fix tests after adding timestamps * fix s2t tokenizer tests * fix vocab test * backwards comp * fix tests * comment * style * fix last test * fix fast * make faster * move logic to decode * remove skip test * fix decode with offsets * fix special tokens * empty commit to re-trigger ci * use lru cache
…gingface#26063) * skip failing tests until huggingface#26054 is merged * fixup
* [Whisper Tokenizer] Fix tests after adding timestamps * fix s2t tokenizer tests * fix vocab test * backwards comp * fix tests * comment * style * fix last test * fix fast * make faster * move logic to decode * remove skip test * fix decode with offsets * fix special tokens * empty commit to re-trigger ci * use lru cache
…gingface#26063) * skip failing tests until huggingface#26054 is merged * fixup
* [Whisper Tokenizer] Fix tests after adding timestamps * fix s2t tokenizer tests * fix vocab test * backwards comp * fix tests * comment * style * fix last test * fix fast * make faster * move logic to decode * remove skip test * fix decode with offsets * fix special tokens * empty commit to re-trigger ci * use lru cache
What does this PR do?
As described in #24476, we have uploaded the Whisper timestamp tokens to the tokenizers on the Hub. This requires updating of the Whisper tokenizer and tokenizer tests to handle the new added tokens
cc @ydshieh @ArthurZucker