[Whisper] Fix word-level timestamps for audio < 30 seconds #25607

xenova · 2023-08-18T22:16:34Z

What does this PR do?

In OpenAI's original implementation for word-level timestamps, they crop the cross attentions before perform dynamic time warping (to only run the algorithm on valid audio; this prevents getting stuck when backtracking). The current transformers implementation misses this, so this PR fixes that.

Testing code:

from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", "openai/whisper-base")
url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/japanese-audio.wav'
output = pipe(url, return_timestamps="word", chunk_length_s=30, generate_kwargs={'language': 'japanese'})
print(output)

Fixed:

{
'text': '森長の美味しい牛乳は濃い青いように牛乳ビーンを足らった絶対にのパック牛乳である',
'chunks': [
  {'text': '森', 'timestamp': (0.18, 0.64)},
  {'text': '長', 'timestamp': (0.64, 0.82)},
  {'text': 'の', 'timestamp': (0.82, 1.04)},
  {'text': '美味', 'timestamp': (1.04, 1.2)},
  {'text': 'しい', 'timestamp': (1.2, 1.46)},
  {'text': '牛', 'timestamp': (1.46, 1.68)},
  {'text': '乳', 'timestamp': (1.68, 1.92)},
  {'text': 'は', 'timestamp': (1.92, 2.14)},
  {'text': '濃', 'timestamp': (2.14, 2.32)},
  {'text': 'い', 'timestamp': (2.32, 2.44)},
  {'text': '青', 'timestamp': (2.44, 2.64)},
  {'text': 'い', 'timestamp': (2.64, 2.76)},
  {'text': 'ように', 'timestamp': (2.76, 2.92)},
  {'text': '牛', 'timestamp': (2.92, 3.16)},
  {'text': '乳', 'timestamp': (3.16, 3.36)},
  {'text': 'ビ', 'timestamp': (3.36, 3.58)},
  {'text': 'ーン', 'timestamp': (3.58, 3.66)},
  {'text': 'を', 'timestamp': (3.66, 3.82)},
  {'text': '足', 'timestamp': (3.82, 4.0)},
  {'text': 'ら', 'timestamp': (4.0, 4.12)},
  {'text': 'った', 'timestamp': (4.12, 4.3)},
  {'text': '絶', 'timestamp': (4.3, 4.52)},
  {'text': '対', 'timestamp': (4.52, 4.68)},
  {'text': 'に', 'timestamp': (4.68, 4.78)},
  {'text': 'の', 'timestamp': (4.78, 4.94)},
  {'text': 'パ', 'timestamp': (4.94, 5.1)},
  {'text': 'ック', 'timestamp': (5.1, 5.2)},
  {'text': '牛', 'timestamp': (5.2, 5.44)},
  {'text': '乳', 'timestamp': (5.44, 5.64)},
  {'text': 'で', 'timestamp': (5.64, 5.84)},
  {'text': 'ある', 'timestamp': (5.84, 6.04)}
]
}

Previous (broken):

{
'text': '森長の美味しい牛乳は濃い青いように牛乳ビーンを足らった絶対にのパック牛乳である',
'chunks': [
  {'text': '森', 'timestamp': (29.98, 29.98)},
  {'text': '長', 'timestamp': (29.98, 29.98)},
  {'text': 'の', 'timestamp': (29.98, 29.98)},
  {'text': '美味', 'timestamp': (29.98, 29.98)},
  {'text': 'しい', 'timestamp': (29.98, 29.98)},
  {'text': '牛', 'timestamp': (29.98, 29.98)},
  {'text': '乳', 'timestamp': (29.98, 29.98)}, 
  'text': 'は', 'timestamp': (29.98, 29.98)},
  {'text': '濃', 'timestamp': (29.98, 29.98)},
  {'text': 'い', 'timestamp': (29.98, 29.98)},
  {'text': '青', 'timestamp': (29.98, 29.98)},
  {'text': 'い', 'timestamp': (29.98, 29.98)},
  {'text': 'ように', 'timestamp': (29.98, 29.98)},
  {'text': '牛', 'timestamp': (29.98, 29.98)},
  {'text': '乳', 'timestamp': (29.98, 29.98)},
  {'text': 'ビ', 'timestamp': (29.98, 29.98)},
  {'text': 'ーン', 'timestamp': (29.98, 29.98)},
  {'text': 'を', 'timestamp': (29.98, 29.98)},
  {'text': '足', 'timestamp': (29.98, 29.98)},
  {'text': 'ら', 'timestamp': (29.98, 29.98)},
  {'text': 'った', 'timestamp': (29.98, 29.98)},
  {'text': '絶', 'timestamp': (29.98, 29.98)},
  {'text': '対', 'timestamp': (29.98, 29.98)},
  {'text': 'に', 'timestamp': (29.98, 29.98)},
  {'text': 'の', 'timestamp': (29.98, 29.98)},
  {'text': 'パ', 'timestamp': (29.98, 29.98)},
  {'text': 'ック', 'timestamp': (29.98, 29.98)},
  {'text': '牛', 'timestamp': (29.98, 29.98)},
  {'text': '乳', 'timestamp': (29.98, 29.98)},
  {'text': 'で', 'timestamp': (29.98, 29.98)},
  {'text': 'ある', 'timestamp': (29.98, 29.98)}
]
}

Fixes #25605 (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sanchit-gandhi @ArthurZucker

HuggingFaceDocBuilderDev · 2023-08-18T22:40:58Z

The documentation is not available anymore as the PR was closed or merged.

xenova · 2023-08-19T00:04:21Z

Looks like there were some failing unit tests, but they were actually wrong 😅

Original unit test:

{
    "text": " Conquered returned to its place amidst the tents.",
    "chunks": [
        {"text": " Conquered", "timestamp": (29.78, 29.9)},
        {"text": " returned", "timestamp": (29.9, 29.9)},
        {"text": " to", "timestamp": (29.9, 29.9)},
        {"text": " its", "timestamp": (29.9, 29.9)},
        {"text": " place", "timestamp": (29.9, 29.9)},
        {"text": " amidst", "timestamp": (29.9, 29.9)},
        {"text": " the", "timestamp": (29.9, 29.9)},
        {"text": " tents.", "timestamp": (29.9, 29.9)}
    ]
}

New (fixed) unit test:

{
    "text": " Conquered returned to its place amidst the tents.",
    "chunks": [
        {"text": " Conquered", "timestamp": (0.5, 1.2)},
        {"text": " returned", "timestamp": (1.2, 1.64)},
        {"text": " to", "timestamp": (1.64, 1.84)},
        {"text": " its", "timestamp": (1.84, 2.02)},
        {"text": " place", "timestamp": (2.02, 2.28)},
        {"text": " amidst", "timestamp": (2.28, 2.78)},
        {"text": " the", "timestamp": (2.78, 2.96)},
        {"text": " tents.", "timestamp": (2.96, 3.48)},
    ],
},

ArthurZucker

Very nice catch! Will leave @sanchit-gandhi have a look!

src/transformers/models/whisper/modeling_whisper.py

sanchit-gandhi

Looks really nice already @xenova! Thanks for the fix! Echo'ing @ArthurZucker's suggestion about using the generation config, but otherwise looks top 👌

src/transformers/models/whisper/modeling_whisper.py

src/transformers/pipelines/automatic_speech_recognition.py

tests/pipelines/test_pipelines_automatic_speech_recognition.py

Co-authored-by: Arthur <[email protected]>

sanchit-gandhi

Lovely thanks @xenova! LGTM! Feel free to merge @ArthurZucker if you're happy with the PR

* Fix language detection * Remove debug statement * Fix punctuation regex for whisper decoding (Closes #223) * Fix word-level timestamps for audio < 30 seconds Issue in python library: huggingface/transformers#25605 PR for above: huggingface/transformers#25607 * Add multilingual transcription w/ word-level timestamps unit test * Fix unit tests

sanchit-gandhi · 2023-09-08T09:29:40Z

Gently pinging @ArthurZucker for the final 👍 before merge - thank you again for the PR @xenova!

ArthurZucker · 2023-09-08T20:44:58Z

Sorry for the miss reviewing now!

ArthurZucker

Sorry for the wait! Looks super good to me!
Left a very small nit 😉
Thanks for fixing

src/transformers/pipelines/automatic_speech_recognition.py

aramfaghfouri · 2023-09-12T13:31:49Z

Thank you for your contribution @xenova!
Is this PR approved and merged in the latest version? I just installed Transformers and I am still getting the old results:
{'text': ' Okay, you ready?', 'chunks': [{'text': ' Okay,', 'timestamp': (29.98, 29.98)}, {'text': ' you', 'timestamp': (29.98, 29.98)}, {'text': ' ready?', 'timestamp': (29.98, 29.98)}]}

Thanks!

xenova · 2023-09-12T13:33:01Z

Hi there - It's not yet merged, but will hopefully be soon!

aramfaghfouri · 2023-09-12T18:18:58Z

Do you happen to know when the next version will be out?

sanchit-gandhi

Quickly pushed the last requested changes! Will merge when the CI is green!

src/transformers/pipelines/automatic_speech_recognition.py

aramfaghfouri · 2023-09-15T14:06:39Z

Hi @xenova,

I just upgraded to the latest version of Transformers (4.33.1) and tried the following. It seems that the word timestamps are still incorrect.
What am I doing wrong?

Thanks!

###########

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-large",
  chunk_length_s=4,
  device=device,
)
vx0 = pipe(x0, batch_size=4, return_timestamps="word")

vx0['chunks']:
{'text': ' Okay,', 'timestamp': (29.98, 29.98)},
{'text': ' you', 'timestamp': (29.98, 29.98)},
{'text': ' ready?', 'timestamp': (29.98, 29.98)},
{'text': ' So', 'timestamp': (32.65, 32.65)},
{'text': ' now', 'timestamp': (32.65, 32.65)},
{'text': ' it', 'timestamp': (35.31, 35.31)},
{'text': ' is', 'timestamp': (35.31, 35.31)},
{'text': " it's", 'timestamp': (35.31, 35.31)},
{'text': ' a', 'timestamp': (35.31, 35.31)},

xenova · 2023-09-15T14:08:09Z

This is because a full release hasn't come out yet. To fix it, you can install from source (see docs):

pip install --upgrade git+https://github.com/huggingface/transformers

…ce#25607) * Fix word-level timestamps for audio < 30 seconds * Fix code quality * fix unit tests * Fix unit tests * Fix unit test * temp: print out result * temp: set max diff to None * fix unit tests * fix typo * Fix typo Co-authored-by: Arthur <[email protected]> * Use generation config for `num_frames` * fix docs * Move `num_frames` to kwargs * compute stride/attn_mask once * mark test as slow --------- Co-authored-by: Arthur <[email protected]> Co-authored-by: sanchit-gandhi <[email protected]>

Starlento · 2024-01-20T10:07:29Z

Hi guys, I just tried whisper v3 and find that your updated code is gone in the current main branch.
And it gives me 29.98... There are multiple commits after this merge, can someone check what is going on?

Fix word-level timestamps for audio < 30 seconds

65e4748

xenova requested review from sanchit-gandhi and ArthurZucker August 18, 2023 22:17

xenova mentioned this pull request Aug 18, 2023

[Bug] Non-english word-level timestamps huggingface/transformers.js#252

Closed

xenova added 3 commits August 18, 2023 22:48

Fix code quality

48b4393

fix unit tests

545770c

Fix unit tests

4504455

xenova added 5 commits August 19, 2023 00:09

Fix unit test

9c6df2c

temp: print out result

414077b

temp: set max diff to None

18cea34

fix unit tests

a53d8f8

fix typo

259f3f2

ArthurZucker reviewed Aug 21, 2023

View reviewed changes

src/transformers/models/whisper/modeling_whisper.py Outdated Show resolved Hide resolved

src/transformers/models/whisper/modeling_whisper.py Show resolved Hide resolved

sanchit-gandhi reviewed Aug 21, 2023

View reviewed changes

xenova and others added 4 commits August 21, 2023 14:43

Fix typo

689dfb9

Co-authored-by: Arthur <[email protected]>

Use generation config for num_frames

03ec0eb

fix docs

2605178

Move num_frames to kwargs

0304231

xenova requested a review from sanchit-gandhi August 21, 2023 15:08

sanchit-gandhi approved these changes Aug 21, 2023

View reviewed changes

ArthurZucker approved these changes Sep 9, 2023

View reviewed changes

src/transformers/pipelines/automatic_speech_recognition.py Outdated Show resolved Hide resolved

compute stride/attn_mask once

0694dbe

sanchit-gandhi reviewed Sep 14, 2023

View reviewed changes

src/transformers/pipelines/automatic_speech_recognition.py Show resolved Hide resolved

mark test as slow

c7e7457

sanchit-gandhi merged commit 95fe0f5 into huggingface:main Sep 14, 2023
3 checks passed

BjoernRave mentioned this pull request Jan 30, 2024

Whisper model word-level timestamps broken huggingface/transformers.js#551

Open

5 tasks

kyle-v6x mentioned this pull request Mar 7, 2024

Whisper Word-level Timestamps broken on some inputs #29502

Closed

4 tasks

sanchit-gandhi mentioned this pull request Apr 12, 2024

[Whisper] Word-level timestamps broken for short-form audio #30224

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Whisper] Fix word-level timestamps for audio < 30 seconds #25607

[Whisper] Fix word-level timestamps for audio < 30 seconds #25607

xenova commented Aug 18, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 18, 2023 •

edited

Loading

xenova commented Aug 19, 2023 •

edited

Loading

ArthurZucker left a comment

sanchit-gandhi left a comment

sanchit-gandhi left a comment •

edited

Loading

sanchit-gandhi commented Sep 8, 2023

ArthurZucker commented Sep 8, 2023

ArthurZucker left a comment

aramfaghfouri commented Sep 12, 2023

xenova commented Sep 12, 2023

aramfaghfouri commented Sep 12, 2023

sanchit-gandhi left a comment

aramfaghfouri commented Sep 15, 2023 •

edited

Loading

xenova commented Sep 15, 2023

Starlento commented Jan 20, 2024

[Whisper] Fix word-level timestamps for audio < 30 seconds #25607

[Whisper] Fix word-level timestamps for audio < 30 seconds #25607

Conversation

xenova commented Aug 18, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Aug 18, 2023 • edited Loading

xenova commented Aug 19, 2023 • edited Loading

Original unit test:

New (fixed) unit test:

ArthurZucker left a comment

Choose a reason for hiding this comment

sanchit-gandhi left a comment

Choose a reason for hiding this comment

sanchit-gandhi left a comment • edited Loading

Choose a reason for hiding this comment

sanchit-gandhi commented Sep 8, 2023

ArthurZucker commented Sep 8, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

aramfaghfouri commented Sep 12, 2023

xenova commented Sep 12, 2023

aramfaghfouri commented Sep 12, 2023

sanchit-gandhi left a comment

Choose a reason for hiding this comment

aramfaghfouri commented Sep 15, 2023 • edited Loading

xenova commented Sep 15, 2023

Starlento commented Jan 20, 2024

xenova commented Aug 18, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 18, 2023 •

edited

Loading

xenova commented Aug 19, 2023 •

edited

Loading

sanchit-gandhi left a comment •

edited

Loading

aramfaghfouri commented Sep 15, 2023 •

edited

Loading