Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Whisper Tokenizer] Encode timestamps #26054

Merged
merged 16 commits into from
Sep 14, 2023

Conversation

sanchit-gandhi
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi commented Sep 8, 2023

What does this PR do?

As described in #24476, we have uploaded the Whisper timestamp tokens to the tokenizers on the Hub. This requires updating of the Whisper tokenizer and tokenizer tests to handle the new added tokens

cc @ydshieh @ArthurZucker

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 8, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this only changes the test, and this is due to the Hub files are changed, so LGTM (I trust you the changes on Hub is necessary and desired!).

Thank you!

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 8, 2023

BTW, do we expect this changes showing up on users' code results? I think it's yes, but wondering why it is OK?

@sanchit-gandhi
Copy link
Contributor Author

sanchit-gandhi commented Sep 8, 2023

Alright refactored the tokenizer a bit to maintain the same results as what we had before! Essentially, we only output the timestamp tokens if the user passes decode_with_timestamps=True. Otherwise, we filter them out from the token ids, maintaining the behaviour we had before where the .decode method skipped them since they were OOV.

Previously, when the timestamps were not in the vocab:

  • decode_with_timestamps=False: timestamp tokens skipped from the .decode method since they're OOV
  • decode_with_timestamps=True: timestamp tokens manually added by the ._decode_with_timestamps method

Now, the timestamps are in the vocab:

  • decode_with_timestamps=False: timestamp tokens filtered out from within the .decode method (they're in-vocabulary now, so aren't automatically skipped)
  • decode_with_timestamps=True: timestamp tokens added automatically in the .decode method

How does this look to you @ArthurZucker @ydshieh?

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 8, 2023

Sound good to me. Arthur will know better and can provide better comments in any I believe.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Let's try to move the logic outside _decode and should be good

ArthurZucker added a commit to ArthurZucker/transformers that referenced this pull request Sep 9, 2023
ArthurZucker added a commit that referenced this pull request Sep 9, 2023
* skip failing tests until #26054 is merged

* fixup
@@ -605,21 +638,19 @@ def decode(
)
# retrieve offsets
if output_offsets:
offsets = None
offsets = self._compute_offsets(token_ids, time_precision=time_precision)
Copy link
Contributor Author

@sanchit-gandhi sanchit-gandhi Sep 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to use token_ids here, not filtered_ids, since we need the timestamp ids to be present so that we can compute the offsets

We later strip the timestamp ids from the chunk outputs in the _compute_offsets method

@sanchit-gandhi
Copy link
Contributor Author

The tests pass for me locally, but a small subset of the fast tokenizer tests timeout on the CI: link

=> any reason these tests should timeout now that we have the expanded vocabulary with the new added tokens? It seems to me that it's the same stage that always gets stuck:

def _convert_token_to_id_with_added_voc(self, token: str) -> int:
index = self._tokenizer.token_to_id(token)

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be the 1500 tokens


if not decode_with_timestamps:
# filter timestamp tokens if they are contained in the vocab
timestamp_ids = self.convert_tokens_to_ids([("<|%.2f|>" % (i * time_precision)) for i in range(1500 + 1)])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably super slow. We can / should cache it wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout! Resolved in a81a1c9

Does the PR look good to you now?

@sanchit-gandhi sanchit-gandhi changed the title [Whisper Tokenizer] Fix tests after adding timestamps [Whisper Tokenizer] Encode timestamps Sep 13, 2023
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think the lru cache is elegant, hope performance wise it's also good!
CI seems to like it 🤗
If precision doesn't change this is never recomputed / automatically optimised by LRU cache (never used it!)

@sanchit-gandhi
Copy link
Contributor Author

Yeah works pretty well: after the first cache step, tokenizing decode is on-par with what we had before.

That's correct regarding only computing the cache once: we'll likely never actually have to re-compute the cache, since in practice, everyone will used a fixed time-precision of 0.02. However, the code can handle multiple values of time-precision, which is stay consistent with the .decode method where we allow time_precision as an arg

@sanchit-gandhi sanchit-gandhi merged commit ac957f6 into huggingface:main Sep 14, 2023
@sanchit-gandhi sanchit-gandhi deleted the whisper-fix-tests branch September 14, 2023 11:00
parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023
parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023
* [Whisper Tokenizer] Fix tests after adding timestamps

* fix s2t tokenizer tests

* fix vocab test

* backwards comp

* fix tests

* comment

* style

* fix last test

* fix fast

* make faster

* move logic to decode

* remove skip test

* fix decode with offsets

* fix special tokens

* empty commit to re-trigger ci

* use lru cache
blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023
blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023
* [Whisper Tokenizer] Fix tests after adding timestamps

* fix s2t tokenizer tests

* fix vocab test

* backwards comp

* fix tests

* comment

* style

* fix last test

* fix fast

* make faster

* move logic to decode

* remove skip test

* fix decode with offsets

* fix special tokens

* empty commit to re-trigger ci

* use lru cache
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 18, 2023
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 18, 2023
* [Whisper Tokenizer] Fix tests after adding timestamps

* fix s2t tokenizer tests

* fix vocab test

* backwards comp

* fix tests

* comment

* style

* fix last test

* fix fast

* make faster

* move logic to decode

* remove skip test

* fix decode with offsets

* fix special tokens

* empty commit to re-trigger ci

* use lru cache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants