Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move evaluation scripts #842

Merged
merged 11 commits into from
Sep 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions AudioQnA/benchmark/accuracy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# AudioQnA accuracy Evaluation

AudioQnA is an example that demonstrates the integration of Generative AI (GenAI) models for performing question-answering (QnA) on audio scene, which contains Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The following is the piepline for evaluating the ASR accuracy.

## Dataset

We evaluate the ASR accuracy on the test set of librispeech [dataset](https://huggingface.co/datasets/andreagasparini/librispeech_test_only), which contains 2620 records of audio and texts.

## Metrics

We evaluate the WER (Word Error Rate) metric of the ASR microservice.

## Evaluation

### Launch ASR microservice

Launch the ASR microserice with the following commands. For more details please refer to [doc](https://github.com/opea-project/GenAIComps/tree/main/comps/asr).

```bash
git clone https://github.com/opea-project/GenAIComps
cd GenAIComps
docker build -t opea/whisper:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/asr/whisper/Dockerfile .
# change the name of model by editing model_name_or_path you want to evaluate
docker run -p 7066:7066 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/whisper:latest --model_name_or_path "openai/whisper-tiny"
```

### Evaluate

Install dependencies:

```
pip install -r requirements.txt
```

Evaluate the performance with the LLM:

```py
# validate the offline model
# python offline_evaluate.py
# validate the online asr microservice accuracy
python online_evaluate.py
```

### Performance Result

Here is the tested result for your reference
|| WER |
| --- | ---- |
|whisper-large-v2| 2.87|
|whisper-large| 2.7 |
|whisper-medium| 3.45 |
35 changes: 35 additions & 0 deletions AudioQnA/benchmark/accuracy/local_eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import torch
from datasets import load_dataset
from evaluate import load
from transformers import WhisperForConditionalGeneration, WhisperProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

MODEL_NAME = "openai/whisper-large-v2"

librispeech_test_clean = load_dataset(
"andreagasparini/librispeech_test_only", "clean", split="test", trust_remote_code=True
)
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to(device)


def map_to_pred(batch):
audio = batch["audio"]
input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
batch["reference"] = processor.tokenizer._normalize(batch["text"])

with torch.no_grad():
predicted_ids = model.generate(input_features.to(device))[0]
transcription = processor.decode(predicted_ids)
batch["prediction"] = processor.tokenizer._normalize(transcription)
return batch


result = librispeech_test_clean.map(map_to_pred)

wer = load("wer")
print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
56 changes: 56 additions & 0 deletions AudioQnA/benchmark/accuracy/online_eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import base64
import json

import requests
import torch
from datasets import load_dataset
from evaluate import load
from pydub import AudioSegment
from transformers import WhisperForConditionalGeneration, WhisperProcessor

MODEL_NAME = "openai/whisper-large-v2"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)

librispeech_test_clean = load_dataset(
"andreagasparini/librispeech_test_only", "clean", split="test", trust_remote_code=True
)


def map_to_pred(batch):
batch["reference"] = processor.tokenizer._normalize(batch["text"])

file_path = batch["file"]
# process the file_path
pidx = file_path.rfind("/")
sidx = file_path.rfind(".")

file_path_prefix = file_path[: pidx + 1]
file_path_suffix = file_path[sidx:]
file_path_mid = file_path[pidx + 1 : sidx]
splits = file_path_mid.split("-")
file_path_mid = f"LibriSpeech/test-clean/{splits[0]}/{splits[1]}/{file_path_mid}"

file_path = file_path_prefix + file_path_mid + file_path_suffix

audio = AudioSegment.from_file(file_path)
audio.export("tmp.wav")
with open("tmp.wav", "rb") as f:
test_audio_base64_str = base64.b64encode(f.read()).decode("utf-8")

inputs = {"audio": test_audio_base64_str}
endpoint = "http://localhost:7066/v1/asr"
response = requests.post(url=endpoint, data=json.dumps(inputs), proxies={"http": None})

result_str = response.json()["asr_result"]

batch["prediction"] = processor.tokenizer._normalize(result_str)
return batch


result = librispeech_test_clean.map(map_to_pred)

wer = load("wer")
print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
8 changes: 8 additions & 0 deletions AudioQnA/benchmark/accuracy/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
datasets
evaluate
jiwer
librosa
pydub
soundfile
torch
transformers
Loading
Loading