[Whisper Tokenizer] Encode timestamps #26054

sanchit-gandhi · 2023-09-08T13:51:55Z

What does this PR do?

As described in #24476, we have uploaded the Whisper timestamp tokens to the tokenizers on the Hub. This requires updating of the Whisper tokenizer and tokenizer tests to handle the new added tokens

cc @ydshieh @ArthurZucker

tests/models/whisper/test_tokenization_whisper.py

HuggingFaceDocBuilderDev · 2023-09-08T14:25:34Z

The documentation is not available anymore as the PR was closed or merged.

ydshieh

As this only changes the test, and this is due to the Hub files are changed, so LGTM (I trust you the changes on Hub is necessary and desired!).

Thank you!

ydshieh · 2023-09-08T14:31:25Z

BTW, do we expect this changes showing up on users' code results? I think it's yes, but wondering why it is OK?

sanchit-gandhi · 2023-09-08T15:42:55Z

Alright refactored the tokenizer a bit to maintain the same results as what we had before! Essentially, we only output the timestamp tokens if the user passes decode_with_timestamps=True. Otherwise, we filter them out from the token ids, maintaining the behaviour we had before where the .decode method skipped them since they were OOV.

Previously, when the timestamps were not in the vocab:

decode_with_timestamps=False: timestamp tokens skipped from the .decode method since they're OOV
decode_with_timestamps=True: timestamp tokens manually added by the ._decode_with_timestamps method

Now, the timestamps are in the vocab:

decode_with_timestamps=False: timestamp tokens filtered out from within the .decode method (they're in-vocabulary now, so aren't automatically skipped)
decode_with_timestamps=True: timestamp tokens added automatically in the .decode method

How does this look to you @ArthurZucker @ydshieh?

ydshieh · 2023-09-08T16:34:45Z

Sound good to me. Arthur will know better and can provide better comments in any I believe.

src/transformers/models/whisper/tokenization_whisper_fast.py

ArthurZucker

Thanks! Let's try to move the logic outside _decode and should be good

src/transformers/models/whisper/tokenization_whisper_fast.py

* skip failing tests until #26054 is merged * fixup

sanchit-gandhi · 2023-09-09T08:41:10Z

src/transformers/models/whisper/tokenization_whisper.py

@@ -605,21 +638,19 @@ def decode(
            )
        # retrieve offsets
        if output_offsets:
-            offsets = None
            offsets = self._compute_offsets(token_ids, time_precision=time_precision)


We want to use token_ids here, not filtered_ids, since we need the timestamp ids to be present so that we can compute the offsets

We later strip the timestamp ids from the chunk outputs in the _compute_offsets method

sanchit-gandhi · 2023-09-12T17:17:04Z

The tests pass for me locally, but a small subset of the fast tokenizer tests timeout on the CI: link

=> any reason these tests should timeout now that we have the expanded vocabulary with the new added tokens? It seems to me that it's the same stage that always gets stuck:

transformers/src/transformers/tokenization_utils_fast.py

Lines 281 to 282 in 8f609ab

    
           def _convert_token_to_id_with_added_voc(self, token: str) -> int: 
        
               index = self._tokenizer.token_to_id(token)

ArthurZucker

Might be the 1500 tokens

ArthurZucker · 2023-09-12T20:00:04Z

src/transformers/models/whisper/tokenization_whisper.py

+
+        if not decode_with_timestamps:
+            # filter timestamp tokens if they are contained in the vocab
+            timestamp_ids = self.convert_tokens_to_ids([("<|%.2f|>" % (i * time_precision)) for i in range(1500 + 1)])


this is probably super slow. We can / should cache it wdyt?

Good shout! Resolved in a81a1c9

Does the PR look good to you now?

ArthurZucker

LGTM, I think the lru cache is elegant, hope performance wise it's also good!
CI seems to like it 🤗
If precision doesn't change this is never recomputed / automatically optimised by LRU cache (never used it!)

sanchit-gandhi · 2023-09-14T11:00:37Z

Yeah works pretty well: after the first cache step, tokenizing decode is on-par with what we had before.

That's correct regarding only computing the cache once: we'll likely never actually have to re-compute the cache, since in practice, everyone will used a fixed time-precision of 0.02. However, the code can handle multiple values of time-precision, which is stay consistent with the .decode method where we allow time_precision as an arg

…gingface#26063) * skip failing tests until huggingface#26054 is merged * fixup

* [Whisper Tokenizer] Fix tests after adding timestamps * fix s2t tokenizer tests * fix vocab test * backwards comp * fix tests * comment * style * fix last test * fix fast * make faster * move logic to decode * remove skip test * fix decode with offsets * fix special tokens * empty commit to re-trigger ci * use lru cache

…gingface#26063) * skip failing tests until huggingface#26054 is merged * fixup

* [Whisper Tokenizer] Fix tests after adding timestamps * fix s2t tokenizer tests * fix vocab test * backwards comp * fix tests * comment * style * fix last test * fix fast * make faster * move logic to decode * remove skip test * fix decode with offsets * fix special tokens * empty commit to re-trigger ci * use lru cache

…gingface#26063) * skip failing tests until huggingface#26054 is merged * fixup

* [Whisper Tokenizer] Fix tests after adding timestamps * fix s2t tokenizer tests * fix vocab test * backwards comp * fix tests * comment * style * fix last test * fix fast * make faster * move logic to decode * remove skip test * fix decode with offsets * fix special tokens * empty commit to re-trigger ci * use lru cache

sanchit-gandhi commented Sep 8, 2023

View reviewed changes

tests/models/whisper/test_tokenization_whisper.py Outdated Show resolved Hide resolved

tests/models/whisper/test_tokenization_whisper.py Show resolved Hide resolved

ydshieh approved these changes Sep 8, 2023

View reviewed changes

sanchit-gandhi commented Sep 8, 2023

View reviewed changes

src/transformers/models/whisper/tokenization_whisper_fast.py Outdated Show resolved Hide resolved

ArthurZucker reviewed Sep 8, 2023

View reviewed changes

src/transformers/models/whisper/tokenization_whisper_fast.py Outdated Show resolved Hide resolved

ArthurZucker added a commit to ArthurZucker/transformers that referenced this pull request Sep 9, 2023

skip failing tests until huggingface#26054 is merged

eda6dbe

ArthurZucker mentioned this pull request Sep 9, 2023

[CITests] skip failing tests until #26054 is merged #26063

Merged

ArthurZucker added a commit that referenced this pull request Sep 9, 2023

[CITests] skip failing tests until #26054 is merged (#26063)

95b3749

* skip failing tests until #26054 is merged * fixup

sanchit-gandhi added 12 commits September 9, 2023 09:39

[Whisper Tokenizer] Fix tests after adding timestamps

4302ab1

fix s2t tokenizer tests

78e3a6a

fix vocab test

b8695c8

backwards comp

75c772a

fix tests

d6e6243

comment

9e326b9

style

6e65cf2

fix last test

f37b14a

fix fast

d0c0ca6

make faster

292ef78

move logic to decode

aaa94cf

remove skip test

86c5e6d

sanchit-gandhi force-pushed the whisper-fix-tests branch from f820391 to 86c5e6d Compare September 9, 2023 08:40

sanchit-gandhi commented Sep 9, 2023

View reviewed changes

sanchit-gandhi added 3 commits September 9, 2023 10:14

fix decode with offsets

e21b787

fix special tokens

4d9287e

empty commit to re-trigger ci

e7a96e8

ArthurZucker reviewed Sep 12, 2023

View reviewed changes

use lru cache

a81a1c9

sanchit-gandhi changed the title ~~[Whisper Tokenizer] Fix tests after adding timestamps~~ [Whisper Tokenizer] Encode timestamps Sep 13, 2023

ArthurZucker approved these changes Sep 13, 2023

View reviewed changes

sanchit-gandhi merged commit ac957f6 into huggingface:main Sep 14, 2023

sanchit-gandhi deleted the whisper-fix-tests branch September 14, 2023 11:00

sanchit-gandhi mentioned this pull request Sep 20, 2023

[Whisper Tokenizer] Make decoding faster after adding timestamps #26299

Merged

parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023

[CITests] skip failing tests until huggingface#26054 is merged (hug…

ea1e4a0

…gingface#26063) * skip failing tests until huggingface#26054 is merged * fixup

versae mentioned this pull request Oct 19, 2023

Whisper tokenizer decode function ignores timestamp tokens after v4.34.0 (the big refactor) #26934

Closed

4 tasks

blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023

[CITests] skip failing tests until huggingface#26054 is merged (hug…

d485af9

…gingface#26063) * skip failing tests until huggingface#26054 is merged * fixup

EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 18, 2023

[CITests] skip failing tests until huggingface#26054 is merged (hug…

f96c846

…gingface#26063) * skip failing tests until huggingface#26054 is merged * fixup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Whisper Tokenizer] Encode timestamps #26054

[Whisper Tokenizer] Encode timestamps #26054

sanchit-gandhi commented Sep 8, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 8, 2023 •

edited

Loading

ydshieh left a comment

ydshieh commented Sep 8, 2023

sanchit-gandhi commented Sep 8, 2023 •

edited

Loading

ydshieh commented Sep 8, 2023

ArthurZucker left a comment

sanchit-gandhi Sep 9, 2023 •

edited

Loading

sanchit-gandhi commented Sep 12, 2023

ArthurZucker left a comment

ArthurZucker Sep 12, 2023

sanchit-gandhi Sep 13, 2023

ArthurZucker left a comment

sanchit-gandhi commented Sep 14, 2023

[Whisper Tokenizer] Encode timestamps #26054

[Whisper Tokenizer] Encode timestamps #26054

Conversation

sanchit-gandhi commented Sep 8, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Sep 8, 2023 • edited Loading

ydshieh left a comment

Choose a reason for hiding this comment

ydshieh commented Sep 8, 2023

sanchit-gandhi commented Sep 8, 2023 • edited Loading

ydshieh commented Sep 8, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

sanchit-gandhi Sep 9, 2023 • edited Loading

Choose a reason for hiding this comment

sanchit-gandhi commented Sep 12, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Sep 12, 2023

Choose a reason for hiding this comment

sanchit-gandhi Sep 13, 2023

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

sanchit-gandhi commented Sep 14, 2023

sanchit-gandhi commented Sep 8, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 8, 2023 •

edited

Loading

sanchit-gandhi commented Sep 8, 2023 •

edited

Loading

sanchit-gandhi Sep 9, 2023 •

edited

Loading