-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASR Pipeline is not super user-friendly #20414
Comments
One additional point! We can't pass generation kwargs to the transformers/src/transformers/pipelines/automatic_speech_recognition.py Lines 369 to 372 in 0ee7118
This means our stdout is bombarded with UserWarnings from the
Would be nice to be able to override generation kwargs to prevent these messages and have flexibility over max length, beams, temperature, length penalty, etc cc @Vaibhavs10 |
Just went through the code in more-detail and found that "array" is pop'd from the input dict! transformers/src/transformers/pipelines/automatic_speech_recognition.py Lines 278 to 280 in 0ee7118
Maybe we can add this to the docstring to highlight! |
Multiple points:
as far as I remember we can also accept
This is not going to happen for reasons I'll explain in following points
We can add it as a We could also make the warning appear only once. @sgugger since reducing noise seems something desirable. Being able to send
Totally !
I highly recommend NOT loading the entire array of the datasets in memory when working on datasets. That means NOT passing around lists, and not being able to batch with That because objects are meant to be consumed one by one in an iterable fashion. Using generator and streams is much more efficient (and the pipeline will actually do the batching too, passing around lists to the pipeline will NOT batch things. ( More on batching in pipelines : https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching) Here is the recommendation from the docs: https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline (Still need to upgrade that part to make it to the tutorial). Here is a gist of few several examples: https://gist.github.com/Narsil/4f5b088f4dd23200d16dd2cc575fdc16 Method 1 (pipe) 0:00:00.294485
Method 2 (dataset) 0:00:00.308238
Method 3 (raw file) 0:00:00.635527 The 5% speedup is pretty consistent on this smallish data. Method 3 is slower, but because you don't need to decode the audio files within the dataset, this can save some disk space (at a compute cost). Keep in mind the I tried actually batching inputs, but it seems it's detrimental in this case (just add https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching for more info on why batching can hurt. I had to add a "warmup" to do fair comparisons, it seems Happy to discuss further how to make the integration easier. I should mention that from transformers.pipelines.pt_utils import KeyDataset
...
for out in pipe(KeyDataset(dataset, "audio")):
pass It has the same performance as method1 but plays better with |
Thanks for the super in-depth explanation, @Narsil! Incredibly helpful and much appreciated 🤗 Maybe I'm missing the point a bit with why pipelines exist - are they geared more towards maximising performance for inference (or at least giving you the option to)? Rather than just being a nice wrapper around the feature extractor, model and tokenizer? Sounds good regarding:
Thanks for explaining why
Is there a tutorial that's been published or a WIP? That'll be super handy!
Super comprehensive, thanks for these benchmarks! Interesting to see how
Thanks for flagging this! I had a follow-up question - are there docs / examples for using pipe when loading a dataset in streaming mode? Here, we can't use KeyDataset (as we can't index a streamed dataset):
Is the best option just to go for a generator here? def data():
for i, sample in enumerate(dataset):
yield sample["audio"]
output = []
for out in pipe(data(), batch_size=2):
output.append(out["text"]) With this generator method, we currently (this is the actual example I'm working with: https://github.com/sanchit-gandhi/codesnippets/blob/main/benchmark_inference_whisper.ipynb) |
Well you initial issue was also pretty comprehensive, so thanks for creating it.
Pipeline started without any real guidelines into what they should or should not do. Let's say there are 2 kinds of performance:
We're only doing the first kind here. (Maybe a little of 2 for the GPU feeding that needs to be as fast as possible because CPU-GPU is a bottleneck really quick otherwise)
There this tutorial https://huggingface.co/docs/transformers/pipeline_tutorial which I find less comprehensive than this https://huggingface.co/docs/transformers/main_classes/pipelines unfortunatly. I'm in the process of rewriting it, as it seems most people read only that. And you're not the first person to not be aware of those cool features, so I'd say it's a doc problem.
Can't tell you why there is a difference, but I can tell you I went to great length to optimize everything I could in the pipeline directly. (Only the first kind of optimization, and it's still written in Python so far from perfect but hey ... :) )
Actually if you pass along other keys in your data, they should be passed along all the way to the result with the asr pipeline. def data():
for item in streaming_data:
yield {**item["audio"], "expected": item["text"]}
for out in pipe(data()):
generated = out["text"]
expected = out["expected"]
# Do you WER thing. Would that work ? (I haven't tested this) If it wasn't you could do GLOBAL_INDEX = {}
def data():
for i, item in enumerate(streaming_data):
GLOBAL_INDEX[i] = item["text"]
yield item
for i, out in enumerate(pipe(data())):
generated = out["text"]
expected = GLOBAL_INDEX.pop(i) # Pop will remove it enabling releasing memory
# Do you WER thing. |
Thank you again for the super comprehensive reply, really appreciate the time given to answering this thread!
Awesome! Think it's fantastic in this regard. Having some easy examples that show you how to run pipeline in different scenarios / tasks like a little 'recipe' book would be great to further this.
Did someone say Rust 👀 Thanks for linking the tutorials - I learnt quite a lot from this thread + docs after knowing where to look. I guess you have two camps of people that will be using pipeline:
For me, it was making the link between my transformers approach and pipeline that made the penny drop. There's a bit of a different mindset which you have to adopt vs the usual datasets
It did indeed work, thanks 🙌 |
I think we should definitely try to avoid by default displaying warnings when running the ASRPipeline. Short term:
to
to just allow both use cases? What is the big drawback of this? Mid/Long term This would then also render the ASR pipeline much easier IMO. |
This is already done, it's a doc issue. And specifically for sanchit, datasets are using
True, I have potential suggestions for it, which mainly are going full on Processor/StoppingCriteria route. This is what was necessary to enable complex batching within bloom inference. |
Maybe a bigger discussion, but could it make sense to move some more complicated tasks such as real-time speech recognition to something like: https://github.com/huggingface/speechbox ? |
For cases like realtime ASR more optimized methods, for example as rust modules, would be super cool. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Feature request
Firstly, thank you to @Narsil for developing a the speech recognition pipeline - it's incredibly helpful for running the full speech-to-text mapping in one call, pre and post-processing included.
There are a couple of things that currently mean the pipeline is not super compatible with 🤗 Datasets. I'll motivate them below with an example.
Motivation
Let's take the example of evaluating a (dummy) Wav2Vec2 checkpoint on the (dummy) LibriSpeech ASR dataset:
Printing the first audio sample of the dataset:
Print Output:
So the audio's are in the format:
{"path": str, "array": np.array, "sampling_rate": int}
. The np audio array values are stored under the key "array". This format is ubiquitous across audio datasets in 🤗 Datasets: all audio datasets take this format.However, pipeline expects the audio samples in the format
{"sampling_rate": int, "raw": np.array}
:transformers/src/transformers/pipelines/automatic_speech_recognition.py
Lines 209 to 211 in 0ee7118
This means we have to do some hacking around to get the audio samples into the right format for pipeline:
And then apply the function to our dataset using the
map
method:If pipeline's
__call__
method was matched to Datasets' audio features, we'd be able to use any audio dataset directly with pipeline (no hacky feature renaming):This would be very nice for the user!
Furthermore, the outputs returned by pipeline are a list of dicts (
List[Dict]
):transformers/src/transformers/pipelines/automatic_speech_recognition.py
Line 477 in 0ee7118
This means we have to unpack and index them before we can use them for any downstream use (such as WER calculations).
It would be nice if pipeline returned a
ModelOutput
class. That way, we could index the text column directly from the returned object:IMO this is more intuitive to the user than renaming their audio column and then iterating over the returned Dict object to get the predicted text.
Your contribution
WDYT @Narsil @patrickvonplaten? Happy to add these changes to smooth out the user experience!
The text was updated successfully, but these errors were encountered: