Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Whisper] Fix word-level timestamps for audio < 30 seconds #25607

Merged
merged 15 commits into from
Sep 14, 2023

Conversation

xenova
Copy link
Contributor

@xenova xenova commented Aug 18, 2023

What does this PR do?

In OpenAI's original implementation for word-level timestamps, they crop the cross attentions before perform dynamic time warping (to only run the algorithm on valid audio; this prevents getting stuck when backtracking). The current transformers implementation misses this, so this PR fixes that.

Testing code:

from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", "openai/whisper-base")
url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/japanese-audio.wav'
output = pipe(url, return_timestamps="word", chunk_length_s=30, generate_kwargs={'language': 'japanese'})
print(output)

Fixed:

{
'text': '森長の美味しい牛乳は濃い青いように牛乳ビーンを足らった絶対にのパック牛乳である',
'chunks': [
  {'text': '森', 'timestamp': (0.18, 0.64)},
  {'text': '長', 'timestamp': (0.64, 0.82)},
  {'text': 'の', 'timestamp': (0.82, 1.04)},
  {'text': '美味', 'timestamp': (1.04, 1.2)},
  {'text': 'しい', 'timestamp': (1.2, 1.46)},
  {'text': '牛', 'timestamp': (1.46, 1.68)},
  {'text': '乳', 'timestamp': (1.68, 1.92)},
  {'text': 'は', 'timestamp': (1.92, 2.14)},
  {'text': '濃', 'timestamp': (2.14, 2.32)},
  {'text': 'い', 'timestamp': (2.32, 2.44)},
  {'text': '青', 'timestamp': (2.44, 2.64)},
  {'text': 'い', 'timestamp': (2.64, 2.76)},
  {'text': 'ように', 'timestamp': (2.76, 2.92)},
  {'text': '牛', 'timestamp': (2.92, 3.16)},
  {'text': '乳', 'timestamp': (3.16, 3.36)},
  {'text': 'ビ', 'timestamp': (3.36, 3.58)},
  {'text': 'ーン', 'timestamp': (3.58, 3.66)},
  {'text': 'を', 'timestamp': (3.66, 3.82)},
  {'text': '足', 'timestamp': (3.82, 4.0)},
  {'text': 'ら', 'timestamp': (4.0, 4.12)},
  {'text': 'った', 'timestamp': (4.12, 4.3)},
  {'text': '絶', 'timestamp': (4.3, 4.52)},
  {'text': '対', 'timestamp': (4.52, 4.68)},
  {'text': 'に', 'timestamp': (4.68, 4.78)},
  {'text': 'の', 'timestamp': (4.78, 4.94)},
  {'text': 'パ', 'timestamp': (4.94, 5.1)},
  {'text': 'ック', 'timestamp': (5.1, 5.2)},
  {'text': '牛', 'timestamp': (5.2, 5.44)},
  {'text': '乳', 'timestamp': (5.44, 5.64)},
  {'text': 'で', 'timestamp': (5.64, 5.84)},
  {'text': 'ある', 'timestamp': (5.84, 6.04)}
]
}

Previous (broken):

{
'text': '森長の美味しい牛乳は濃い青いように牛乳ビーンを足らった絶対にのパック牛乳である',
'chunks': [
  {'text': '森', 'timestamp': (29.98, 29.98)},
  {'text': '長', 'timestamp': (29.98, 29.98)},
  {'text': 'の', 'timestamp': (29.98, 29.98)},
  {'text': '美味', 'timestamp': (29.98, 29.98)},
  {'text': 'しい', 'timestamp': (29.98, 29.98)},
  {'text': '牛', 'timestamp': (29.98, 29.98)},
  {'text': '乳', 'timestamp': (29.98, 29.98)}, 
  'text': 'は', 'timestamp': (29.98, 29.98)},
  {'text': '濃', 'timestamp': (29.98, 29.98)},
  {'text': 'い', 'timestamp': (29.98, 29.98)},
  {'text': '青', 'timestamp': (29.98, 29.98)},
  {'text': 'い', 'timestamp': (29.98, 29.98)},
  {'text': 'ように', 'timestamp': (29.98, 29.98)},
  {'text': '牛', 'timestamp': (29.98, 29.98)},
  {'text': '乳', 'timestamp': (29.98, 29.98)},
  {'text': 'ビ', 'timestamp': (29.98, 29.98)},
  {'text': 'ーン', 'timestamp': (29.98, 29.98)},
  {'text': 'を', 'timestamp': (29.98, 29.98)},
  {'text': '足', 'timestamp': (29.98, 29.98)},
  {'text': 'ら', 'timestamp': (29.98, 29.98)},
  {'text': 'った', 'timestamp': (29.98, 29.98)},
  {'text': '絶', 'timestamp': (29.98, 29.98)},
  {'text': '対', 'timestamp': (29.98, 29.98)},
  {'text': 'に', 'timestamp': (29.98, 29.98)},
  {'text': 'の', 'timestamp': (29.98, 29.98)},
  {'text': 'パ', 'timestamp': (29.98, 29.98)},
  {'text': 'ック', 'timestamp': (29.98, 29.98)},
  {'text': '牛', 'timestamp': (29.98, 29.98)},
  {'text': '乳', 'timestamp': (29.98, 29.98)},
  {'text': 'で', 'timestamp': (29.98, 29.98)},
  {'text': 'ある', 'timestamp': (29.98, 29.98)}
]
}

Fixes #25605 (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sanchit-gandhi @ArthurZucker

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 18, 2023

The documentation is not available anymore as the PR was closed or merged.

@xenova
Copy link
Contributor Author

xenova commented Aug 19, 2023

Looks like there were some failing unit tests, but they were actually wrong 😅

Original unit test:

{
    "text": " Conquered returned to its place amidst the tents.",
    "chunks": [
        {"text": " Conquered", "timestamp": (29.78, 29.9)},
        {"text": " returned", "timestamp": (29.9, 29.9)},
        {"text": " to", "timestamp": (29.9, 29.9)},
        {"text": " its", "timestamp": (29.9, 29.9)},
        {"text": " place", "timestamp": (29.9, 29.9)},
        {"text": " amidst", "timestamp": (29.9, 29.9)},
        {"text": " the", "timestamp": (29.9, 29.9)},
        {"text": " tents.", "timestamp": (29.9, 29.9)}
    ]
}

New (fixed) unit test:

{
    "text": " Conquered returned to its place amidst the tents.",
    "chunks": [
        {"text": " Conquered", "timestamp": (0.5, 1.2)},
        {"text": " returned", "timestamp": (1.2, 1.64)},
        {"text": " to", "timestamp": (1.64, 1.84)},
        {"text": " its", "timestamp": (1.84, 2.02)},
        {"text": " place", "timestamp": (2.02, 2.28)},
        {"text": " amidst", "timestamp": (2.28, 2.78)},
        {"text": " the", "timestamp": (2.78, 2.96)},
        {"text": " tents.", "timestamp": (2.96, 3.48)},
    ],
},

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice catch! Will leave @sanchit-gandhi have a look!

src/transformers/models/whisper/modeling_whisper.py Outdated Show resolved Hide resolved
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really nice already @xenova! Thanks for the fix! Echo'ing @ArthurZucker's suggestion about using the generation config, but otherwise looks top 👌

@xenova xenova requested a review from sanchit-gandhi August 21, 2023 15:08
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely thanks @xenova! LGTM! Feel free to merge @ArthurZucker if you're happy with the PR

xenova added a commit to huggingface/transformers.js that referenced this pull request Aug 22, 2023
* Fix language detection

* Remove debug statement

* Fix punctuation regex for whisper decoding (Closes #223)

* Fix word-level timestamps for audio < 30 seconds

Issue in python library: huggingface/transformers#25605
PR for above: huggingface/transformers#25607

* Add multilingual transcription w/ word-level timestamps unit test

* Fix unit tests
@sanchit-gandhi
Copy link
Contributor

Gently pinging @ArthurZucker for the final 👍 before merge - thank you again for the PR @xenova!

@ArthurZucker
Copy link
Collaborator

Sorry for the miss reviewing now!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the wait! Looks super good to me!
Left a very small nit 😉
Thanks for fixing

src/transformers/pipelines/automatic_speech_recognition.py Outdated Show resolved Hide resolved
@aramfaghfouri
Copy link

Thank you for your contribution @xenova!
Is this PR approved and merged in the latest version? I just installed Transformers and I am still getting the old results:
{'text': ' Okay, you ready?', 'chunks': [{'text': ' Okay,', 'timestamp': (29.98, 29.98)}, {'text': ' you', 'timestamp': (29.98, 29.98)}, {'text': ' ready?', 'timestamp': (29.98, 29.98)}]}

Thanks!

@xenova
Copy link
Contributor Author

xenova commented Sep 12, 2023

Hi there - It's not yet merged, but will hopefully be soon!

@aramfaghfouri
Copy link

Do you happen to know when the next version will be out?

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quickly pushed the last requested changes! Will merge when the CI is green!

@sanchit-gandhi sanchit-gandhi merged commit 95fe0f5 into huggingface:main Sep 14, 2023
3 checks passed
@aramfaghfouri
Copy link

aramfaghfouri commented Sep 15, 2023

Hi @xenova,

I just upgraded to the latest version of Transformers (4.33.1) and tried the following. It seems that the word timestamps are still incorrect.
What am I doing wrong?

Thanks!

###########

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-large",
  chunk_length_s=4,
  device=device,
)
vx0 = pipe(x0, batch_size=4, return_timestamps="word")

vx0['chunks']:
{'text': ' Okay,', 'timestamp': (29.98, 29.98)},
{'text': ' you', 'timestamp': (29.98, 29.98)},
{'text': ' ready?', 'timestamp': (29.98, 29.98)},
{'text': ' So', 'timestamp': (32.65, 32.65)},
{'text': ' now', 'timestamp': (32.65, 32.65)},
{'text': ' it', 'timestamp': (35.31, 35.31)},
{'text': ' is', 'timestamp': (35.31, 35.31)},
{'text': " it's", 'timestamp': (35.31, 35.31)},
{'text': ' a', 'timestamp': (35.31, 35.31)},

@xenova
Copy link
Contributor Author

xenova commented Sep 15, 2023

This is because a full release hasn't come out yet. To fix it, you can install from source (see docs):

pip install --upgrade git+https://github.com/huggingface/transformers

parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023
…ce#25607)

* Fix word-level timestamps for audio < 30 seconds

* Fix code quality

* fix unit tests

* Fix unit tests

* Fix unit test

* temp: print out result

* temp: set max diff to None

* fix unit tests

* fix typo

* Fix typo

Co-authored-by: Arthur <[email protected]>

* Use generation config for `num_frames`

* fix docs

* Move `num_frames` to kwargs

* compute stride/attn_mask once

* mark test as slow

---------

Co-authored-by: Arthur <[email protected]>
Co-authored-by: sanchit-gandhi <[email protected]>
blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023
…ce#25607)

* Fix word-level timestamps for audio < 30 seconds

* Fix code quality

* fix unit tests

* Fix unit tests

* Fix unit test

* temp: print out result

* temp: set max diff to None

* fix unit tests

* fix typo

* Fix typo

Co-authored-by: Arthur <[email protected]>

* Use generation config for `num_frames`

* fix docs

* Move `num_frames` to kwargs

* compute stride/attn_mask once

* mark test as slow

---------

Co-authored-by: Arthur <[email protected]>
Co-authored-by: sanchit-gandhi <[email protected]>
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 18, 2023
…ce#25607)

* Fix word-level timestamps for audio < 30 seconds

* Fix code quality

* fix unit tests

* Fix unit tests

* Fix unit test

* temp: print out result

* temp: set max diff to None

* fix unit tests

* fix typo

* Fix typo

Co-authored-by: Arthur <[email protected]>

* Use generation config for `num_frames`

* fix docs

* Move `num_frames` to kwargs

* compute stride/attn_mask once

* mark test as slow

---------

Co-authored-by: Arthur <[email protected]>
Co-authored-by: sanchit-gandhi <[email protected]>
@Starlento
Copy link

Hi guys, I just tried whisper v3 and find that your updated code is gone in the current main branch.
And it gives me 29.98... There are multiple commits after this merge, can someone check what is going on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect whisper word-level timestamps
6 participants