Allow dict input for audio classification pipeline #23445

sanchit-gandhi · 2023-05-18T08:44:17Z

What does this PR do?

Allow dictionary inputs for the audio classification pipeline

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2023-05-18T08:59:07Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks for your PR. Since you didn't put a description, I don't really understand what this tries to solve while making the API more complex.

sgugger · 2023-06-01T15:53:26Z

src/transformers/pipelines/audio_classification.py

+
+            _inputs = inputs.pop("raw", None)
+            if _inputs is None:
+                # Remove path which will not be used from `datasets`.


I have no idea what this means.

sgugger · 2023-06-01T15:54:10Z

src/transformers/pipelines/audio_classification.py

+                    - `dict` form can be used to pass raw audio sampled at arbitrary `sampling_rate` and let this
+                      pipeline do the resampling. The dict must be either be in the format `{"sampling_rate": int,
+                      "raw": np.array}`, or `{"sampling_rate": int, "array": np.array}`, where the key `"raw"` or
+                      `"array"` is used to denote the raw audio waveform.


Why accept both raw and array? Seems very brittle as an API. Also why add a new argument type instead of just accepting a new sampling_rate argument?

Copied one-for-one from:

transformers/src/transformers/pipelines/automatic_speech_recognition.py

Line 235 in 5dfd407

inputs (`np.ndarray` or `bytes` or `str` or `dict`):

Originally, the ASR pipeline only accepted they raw for the input waveform, but this was updated to accept both raw and array to bring the pipeline into alignment with datasets, where the 1-d audio arrays go under the dict key array (see comment below and motivations for this consistency in #20414 (comment))

Saw the raw key on the ASR pipeline was kept for backward compatibility. Do we really need to introduce it there?

Indeed! I'm agnostic - I kept it in to have consistency across the pipeline classes (e.g. if a user typically passes the raw key in the ASR pipeline, then they would expect it to work for the audio classification pipeline), but can simplify it to just accept the array key if we don't mind about this

Let's see what @Narsil thinks.

Gently pinging @Narsil - would be nice to have this ready in transformers for the next release (unblocks huggingface/audio-transformers-course#25)

Hey, sorry for this.

Usually I'm kind of against accepting widely different types, this case is different since it's about our ecosystem and making datasets + pipeline work together nicer.

If we're all happy I'll keep it as is then for now and we can explore a joint refactor of the ASR + audio class pipelines in the future?

sgugger · 2023-06-01T15:56:10Z

src/transformers/pipelines/audio_classification.py

+            inputs = _inputs
+            if in_sampling_rate != self.feature_extractor.sampling_rate:
+                import torch
+                from torchaudio import functional as F


This adds a soft dep on torchaudio which is not necessary otherwise, no? Might be worth detecting if it's available and throwing a helpful error message?

Also copied from

transformers/src/transformers/pipelines/automatic_speech_recognition.py

Line 352 in 5dfd407

from torchaudio import functional as F

Will update with an error message and propagate the changes here 👍

Updated for the ASR pipeline in #23953 and this PR in 06751d4

sanchit-gandhi · 2023-06-02T10:04:20Z

Apologies @sgugger! To clarify, the changes in this PR are one-for-one copied from the input arguments in https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/automatic_speech_recognition.py

Essentially, the PR allows users to input a dictionary of inputs to the pipeline. This aligns the pipeline with datasets, where the audio column returns a dict with array (the 1-d audio array) and sampling_rate (the sampling rate of the audio):

from datasets import load_dataset

librispeech = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
librispeech[0]["audio"]

Output:

{'path': '/Users/sanchitgandhi/.cache/huggingface/datasets/downloads/extracted/aad76e6f21870761d7a8b9b34436f6f8db846546c68cb2d9388598d7a164fa4b/dev_clean/1272/128104/1272-128104-0000.flac',
 'array': array([0.00238037, 0.0020752 , 0.00198364, ..., 0.00042725, 0.00057983,
        0.0010376 ]),
 'sampling_rate': 16000}

(the path column is deprecated an no longer required, but retained for backwards compatibility. This is what removing path refers to in the PR)

This PR enables the dict to be passed directly to the pipeline, in the same way that we do for the ASR pipeline and the transformers feature extractors:

pred_labels = pipe(librispeech[0]["audio"])

If there are any API decisions you feel require changing, I'd be happy to update these in the original code before propagating to this file.

sgugger

Thanks for the explanations!

sgugger · 2023-06-02T11:47:27Z

src/transformers/pipelines/audio_classification.py

+                    - `dict` form can be used to pass raw audio sampled at arbitrary `sampling_rate` and let this
+                      pipeline do the resampling. The dict must be either be in the format `{"sampling_rate": int,
+                      "raw": np.array}`, or `{"sampling_rate": int, "array": np.array}`, where the key `"raw"` or
+                      `"array"` is used to denote the raw audio waveform.


Saw the raw key on the ASR pipeline was kept for backward compatibility. Do we really need to introduce it there?

Narsil · 2023-06-07T06:32:19Z

I think what you're trying to do is already supported, but the sampling rate needs to be in the same dict as the array (both are needed to represent a single audio).

That being said, the errors raised when misusing this feature could probably be largely improved (to guide users towards the correct form).

sanchit-gandhi · 2023-06-12T16:21:46Z

src/transformers/pipelines/audio_classification.py

+                _inputs = inputs.pop("array", None)
+            in_sampling_rate = inputs.pop("sampling_rate")
+            inputs = _inputs
+            if in_sampling_rate != self.feature_extractor.sampling_rate:


Are you sure about that @Narsil? It is indeed the case that the ASR pipeline respects the sampling_rate argument, but not in audio classification. Note that the resampling operation is new, there is currently no sampling rate check or operation performed. This PR adds it

Ohh thanks for the ping, not sure how I missed notifications several times here. You're indeed correct, audio-classification didn't have support.

Narsil

LGTM (just some fix on error message).

Sorry for the late review, I'm not sure how I missed those notifications.

Overall there might be room to refactor and abstract this for both pipelines so that we can easily reuse later, but it's good enough for now.

Narsil · 2023-06-13T09:34:02Z

src/transformers/pipelines/audio_classification.py

+                _inputs = inputs.pop("array", None)
+            in_sampling_rate = inputs.pop("sampling_rate")
+            inputs = _inputs
+            if in_sampling_rate != self.feature_extractor.sampling_rate:


Ohh thanks for the ping, not sure how I missed notifications several times here. You're indeed correct, audio-classification didn't have support.

src/transformers/pipelines/audio_classification.py

Narsil · 2023-06-13T09:35:46Z

src/transformers/pipelines/audio_classification.py

+                    - `dict` form can be used to pass raw audio sampled at arbitrary `sampling_rate` and let this
+                      pipeline do the resampling. The dict must be either be in the format `{"sampling_rate": int,
+                      "raw": np.array}`, or `{"sampling_rate": int, "array": np.array}`, where the key `"raw"` or
+                      `"array"` is used to denote the raw audio waveform.


Hey, sorry for this.

Usually I'm kind of against accepting widely different types, this case is different since it's about our ecosystem and making datasets + pipeline work together nicer.

Co-authored-by: Sylvain <[email protected]>

Co-authored-by: Nicolas Patry <[email protected]>

sanchit-gandhi mentioned this pull request May 18, 2023

U2 updates huggingface/audio-transformers-course#25

Closed

sanchit-gandhi force-pushed the audio-class-pipeline branch from 3eb2035 to ae50bbb Compare May 31, 2023 14:16

sanchit-gandhi requested review from Narsil and sgugger June 1, 2023 15:36

sgugger reviewed Jun 1, 2023

View reviewed changes

sanchit-gandhi mentioned this pull request Jun 2, 2023

[ASR pipeline] Check for torchaudio #23953

Merged

sgugger approved these changes Jun 2, 2023

View reviewed changes

sanchit-gandhi commented Jun 12, 2023

View reviewed changes

Narsil approved these changes Jun 13, 2023

View reviewed changes

sanchit-gandhi closed this in #23953 Jun 22, 2023

sanchit-gandhi reopened this Jun 22, 2023

sanchit-gandhi and others added 9 commits June 22, 2023 15:19

Allow dict input for audio classification pipeline

8b26d04

make style

3f1cfc0

Empty commit to trigger CI

888c7a5

Empty commit to trigger CI

c3f676a

check for torchaudio

729cfd9

add pip instructions

499b09e

Co-authored-by: Sylvain <[email protected]>

Update src/transformers/pipelines/audio_classification.py

69f4a19

Co-authored-by: Nicolas Patry <[email protected]>

asr -> audio class

991a90c

asr -> audio class

fc5fa01

sanchit-gandhi force-pushed the audio-class-pipeline branch from 4b8c7cd to fc5fa01 Compare June 22, 2023 17:55

sanchit-gandhi mentioned this pull request Jun 22, 2023

[U4] Install transformers from main huggingface/audio-transformers-course#58

Merged

sanchit-gandhi merged commit 8767958 into huggingface:main Jun 23, 2023

sanchit-gandhi deleted the audio-class-pipeline branch June 23, 2023 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow dict input for audio classification pipeline #23445

Allow dict input for audio classification pipeline #23445

sanchit-gandhi commented May 18, 2023

HuggingFaceDocBuilderDev commented May 18, 2023 •

edited

Loading

sgugger left a comment

sgugger Jun 1, 2023

sgugger Jun 1, 2023

sanchit-gandhi Jun 2, 2023 •

edited

Loading

sgugger Jun 2, 2023

sanchit-gandhi Jun 2, 2023

sgugger Jun 2, 2023

sanchit-gandhi Jun 6, 2023

Narsil Jun 13, 2023 •

edited

Loading

sanchit-gandhi Jun 22, 2023

Narsil Jun 22, 2023

sgugger Jun 1, 2023

sanchit-gandhi Jun 2, 2023

sanchit-gandhi Jun 2, 2023

sanchit-gandhi commented Jun 2, 2023 •

edited

Loading

sgugger left a comment

sgugger Jun 2, 2023

Narsil commented Jun 7, 2023

sanchit-gandhi Jun 12, 2023 •

edited

Loading

Narsil Jun 13, 2023

Narsil left a comment

Narsil Jun 13, 2023

Narsil Jun 13, 2023 •

edited

Loading

Allow dict input for audio classification pipeline #23445

Allow dict input for audio classification pipeline #23445

Conversation

sanchit-gandhi commented May 18, 2023

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented May 18, 2023 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanchit-gandhi Jun 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Narsil Jun 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanchit-gandhi commented Jun 2, 2023 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Narsil commented Jun 7, 2023

sanchit-gandhi Jun 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Narsil Jun 13, 2023 • edited Loading

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 18, 2023 •

edited

Loading

sanchit-gandhi Jun 2, 2023 •

edited

Loading

Narsil Jun 13, 2023 •

edited

Loading

sanchit-gandhi commented Jun 2, 2023 •

edited

Loading

sanchit-gandhi Jun 12, 2023 •

edited

Loading

Narsil Jun 13, 2023 •

edited

Loading