Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyannote 3.1.0 still on CPU only? #1563

Closed
fablau opened this issue Nov 25, 2023 · 18 comments
Closed

Pyannote 3.1.0 still on CPU only? #1563

fablau opened this issue Nov 25, 2023 · 18 comments

Comments

@fablau
Copy link

fablau commented Nov 25, 2023

I am sorry top open this issue again, but I am still experiencing Pyannote version 3.1.0 running on CPU only.

I just installed the latest version with:

pip3 install pyannote.audio

And I can confirm I have the latest version installed with:

pip list

And yet, I see my program using just the CPU. I am testing it with an RTX A5000.

Here is my code:

import sys
from pyannote.audio import Pipeline
import torch

fileOutWav = sys.argv[1] 
spkrsNo = int(sys.argv[2])
fileDiary = sys.argv[3]

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization", use_auth_token="xxxxxxxxxxxxx")

pipeline.to(torch.device("cuda"))

# 4. apply pretrained pipeline
diarization = pipeline(fileOutWav, num_speakers=spkrsNo)

# 5. print the result

with open(fileDiary, mode='w') as file_object:
	for turn, _, speaker in diarization.itertracks(yield_label=True):
		#print(f"start={turn.start:.2f}s stop={turn.end:.2f}s speaker_{speaker}")
		print(f"start={turn.start:.2f}s stop={turn.end:.2f}s speaker_{speaker}", file=file_object)

Is there anything wrong with my code? Or any other steps I might have missed?

I am using the latest version of torch on Linux.

Copy link

Thank you for your issue.You might want to check the FAQ if you haven't done so already.

Feel free to close this issue if you found an answer in the FAQ.

If your issue is a feature request, please read this first and update your request accordingly, if needed.

If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:

  • installation
  • data preparation
  • model download
  • etc.

Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).

Companies relying on pyannote.audio in production may contact me via email regarding:

  • paid scientific consulting around speaker diarization and speech processing in general;
  • custom models and tailored features (via the local tech transfer office).

This is an automated reply, generated by FAQtory

@hbredin
Copy link
Member

hbredin commented Nov 26, 2023

You are using the wrong pretrained pipeline.
Switch from pyannote/speaker-diarization to pyannote/speaker-diarization-3.1.

@fablau
Copy link
Author

fablau commented Nov 27, 2023

Thank you, I tried that and got this error:

pipeline.to(torch.device("cuda"))
2023-11-27T06:25:21.370318815Z AttributeError: 'NoneType' object has no attribute 'to'

Do I need to remove that line? Is that no longer needed?

@hbredin
Copy link
Member

hbredin commented Nov 27, 2023

Looks like you forgot to request access to this new pipeline on HuggingFace model hub.

@arnavmehta7
Copy link

arnavmehta7 commented Nov 27, 2023

Hi @hbredin I also tried the latest 3.1.0 version with 3.1 model. However, it's also extremely slow for me.
5min of audio takes around ~5min to just diarize.

@pourmand1376
Copy link

I am having the same problem here. It is extremely slow.

@hbredin
Copy link
Member

hbredin commented Nov 27, 2023

Tagging this issue as cannot reproduce.
Please provide a minimal reproducible example on Google Colab.

@hbredin
Copy link
Member

hbredin commented Nov 27, 2023

You can also upload your audio file here to get an idea of the expected processing speed on a T4 GPU.

@pourmand1376
Copy link

pourmand1376 commented Nov 27, 2023

It seems that the problem was in my installation.

I used this as requirements.txt (found from here):

gradio==3.38.0
--extra-index-url https://download.pytorch.org/whl/cu113
torch==2.0.1
pyannote-audio==3.1.0

And this for Dockerfile.

FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y --no-install-recommends \
    git \
    git-lfs \
    wget \
    curl \
    # python build dependencies \
    build-essential \
    libssl-dev \
    zlib1g-dev \
    libbz2-dev \
    libreadline-dev \
    libsqlite3-dev \
    libncursesw5-dev \
    xz-utils \
    tk-dev \
    libxml2-dev \
    libxmlsec1-dev \
    libffi-dev \
    liblzma-dev \
    # gradio dependencies \
    ffmpeg \
    ca-certificates \
    # fairseq2 dependencies \
    libsndfile-dev && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN useradd -m -u 1000 user
USER user
ENV HOME=/home/user \
    PATH=/home/user/.local/bin:${PATH}

WORKDIR ${HOME}

RUN git clone https://github.com/yyuu/pyenv.git .pyenv

ENV PATH=${HOME}/.pyenv/shims:${HOME}/.pyenv/bin:${PATH}

ARG PYTHON_VERSION=3.10
RUN pyenv install ${PYTHON_VERSION} && \
    pyenv global ${PYTHON_VERSION} && \
    pyenv rehash && \
    pip install --no-cache-dir -U pip setuptools wheel

COPY --chown=1000 ./requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /tmp/requirements.txt

COPY --chown=1000 . ${HOME}/app
ENV PYTHONPATH=${HOME}/app \
    PYTHONUNBUFFERED=1 \
    GRADIO_ALLOW_FLAGGING=never \
    GRADIO_NUM_PORTS=1 \
    GRADIO_SERVER_NAME=0.0.0.0 \
    GRADIO_THEME=huggingface \
    SYSTEM=spaces \
    GRADIO_SERVER_PORT=7860
EXPOSE 7860
WORKDIR ${HOME}/app
CMD ["python", "app.py"]

I do not know if it is using GPU or not. But without this, It took around 90 minutes to process a 110 minute file. Now, It takes around 1~2 minutes.

@fablau
Copy link
Author

fablau commented Nov 27, 2023

@pourmand1376 thank you for providing your Docker code, could you also please provide the python code you used for the diarization using Pyannote?

@fablau
Copy link
Author

fablau commented Nov 27, 2023

Looks like you forgot to request access to this new pipeline on HuggingFace model hub.

How to do that?

@hbredin
Copy link
Member

hbredin commented Nov 27, 2023

The same way you already did for the old pipeline. By visiting hf.co/pyannote/speaker-diarization-3.1 and agreeing on the terms.

@fablau
Copy link
Author

fablau commented Nov 27, 2023

Thanks, I could fix the error I posted above by simply re-accepting the terms at the links below:

https://hf.co/pyannote/segmentation-3.0
https://hf.co/pyannote/speaker-diarization-3.1

So that my authorization code worked again.

I am still investigating the missing GPU usage... I'll be back as soon as I find out more.

@fablau
Copy link
Author

fablau commented Nov 27, 2023

Yes! It looks like the requirements @pourmand1376 posted above fixed the problem! Now I see the GPU being used ;)

My guess, in particular, is the following one:

--extra-index-url https://download.pytorch.org/whl/cu113

Because I tried the other ones singularly and didn't do the trick.

@pourmand1376
Copy link

pourmand1376 commented Nov 28, 2023

@pourmand1376 thank you for providing your Docker code, could you also please provide the python code you used for the diarization using Pyannote?

Here (this is not a Miminal Example but rather it splits the file and creates a zip file for the user):

import gradio as gr
import os
from dotenv import load_dotenv
from pydub import AudioSegment
from pathlib import Path
import torch
from pyannote.audio import Pipeline

load_dotenv()

HF_API = os.getenv("HF_API")

print(f"HF API Length: {len(HF_API)}")
DESCRIPTION = """
# Speaker Diarization v3.1.0
"""


pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1", use_auth_token=HF_API
)
pipeline.to(torch.device("cuda"))


import os
import zipfile


def zip_folder(folder_path):
    folder_name = os.path.basename(folder_path)
    zip_path = f"{folder_name}.zip"
    zip_file = zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED)
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            zip_file.write(os.path.join(root, file))
    zip_file.close()
    return zip_path


import os
import shutil


def rmrf(path):
    if os.path.isfile(path):
        os.remove(path)
    elif os.path.isdir(path):
        shutil.rmtree(path)


def predict(number_of_speakers, audio_source, input_audio_mic, input_audio_file):
    if audio_source == "microphone":
        input_data = input_audio_mic
    else:
        input_data = input_audio_file

    print(input_data)

    if number_of_speakers == 0:
        diarization = pipeline(input_data)
    else:
        diarization = pipeline(input_data, num_speakers=number_of_speakers)

    text_output = ""
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        print(f"start={turn.start}s stop={turn.end}s speaker_{speaker}")
        text_output = (
            text_output
            + f"start={turn.start}s stop={turn.end}s speaker_{speaker}"
            + "\n"
        )

    song = AudioSegment.from_wav(input_data)
    rmrf("files")
    print(Path("files").absolute)
    Path("files").mkdir(exist_ok=True, parents=True)
    for i, (turn, _, speaker) in enumerate(diarization.itertracks(yield_label=True)):
        try:
            clipped = song[turn.start * 1000 : turn.end * 1000]
            clipped.export(f"files/{i:03}.wav", format="wav", bitrate=16000)

        except Exception as e:
            print(e)

    output_path = zip_folder("files")
    return (text_output, output_path)


def update_audio_ui(audio_source: str) -> tuple[dict, dict]:
    mic = audio_source == "microphone"
    return (
        gr.update(visible=mic, value=None),  # input_audio_mic
        gr.update(visible=not mic, value=None),  # input_audio_file
    )


with gr.Blocks(css="style.css") as demo:
    gr.Markdown(DESCRIPTION)
    with gr.Group():
        with gr.Row():
            number_of_speakers = gr.Number(
                label="Number of Speakers",
                info="Keep it zero, if you want the model to automatically detect the number of speakers",
            )
        with gr.Row() as audio_box:
            audio_source = gr.Radio(
                choices=["file", "microphone"], value="file", interactive=True
            )
            input_audio_mic = gr.Audio(
                label="Input speech",
                type="filepath",
                source="microphone",
                visible=False,
            )
            input_audio_file = gr.Audio(
                label="Input speech",
                type="filepath",
                source="upload",
                visible=True,
            )
            final_audio = gr.Audio(label="Output", visible=False)
        audio_source.change(
            fn=update_audio_ui,
            inputs=audio_source,
            outputs=[input_audio_mic, input_audio_file],
            queue=False,
            api_name=False,
        )
        input_audio_mic.change(lambda x: x, input_audio_mic, final_audio)
        input_audio_file.change(lambda x: x, input_audio_file, final_audio)
        submit = gr.Button("Submit")
        text_output = gr.Textbox(
            label="Transcribed Text",
            value="",
            interactive=False,
            lines=10,
            scale=10,
            max_lines=10,
        )
        file_output = gr.File(label="output")

        submit.click(
            fn=predict,
            inputs=[
                number_of_speakers,
                audio_source,
                input_audio_mic,
                input_audio_file,
            ],
            outputs=[text_output, file_output],
            api_name="predict",
        )


demo.queue(max_size=50).launch()

@EarningsCall
Copy link

It seems that the problem was in my installation.

I used this as requirements.txt (found from here):

gradio==3.38.0
--extra-index-url https://download.pytorch.org/whl/cu113
torch==2.0.1
pyannote-audio==3.1.0

And this for Dockerfile.

FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y --no-install-recommends \
    git \
    git-lfs \
    wget \
    curl \
    # python build dependencies \
    build-essential \
    libssl-dev \
    zlib1g-dev \
    libbz2-dev \
    libreadline-dev \
    libsqlite3-dev \
    libncursesw5-dev \
    xz-utils \
    tk-dev \
    libxml2-dev \
    libxmlsec1-dev \
    libffi-dev \
    liblzma-dev \
    # gradio dependencies \
    ffmpeg \
    ca-certificates \
    # fairseq2 dependencies \
    libsndfile-dev && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN useradd -m -u 1000 user
USER user
ENV HOME=/home/user \
    PATH=/home/user/.local/bin:${PATH}

WORKDIR ${HOME}

RUN git clone https://github.com/yyuu/pyenv.git .pyenv

ENV PATH=${HOME}/.pyenv/shims:${HOME}/.pyenv/bin:${PATH}

ARG PYTHON_VERSION=3.10
RUN pyenv install ${PYTHON_VERSION} && \
    pyenv global ${PYTHON_VERSION} && \
    pyenv rehash && \
    pip install --no-cache-dir -U pip setuptools wheel

COPY --chown=1000 ./requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /tmp/requirements.txt

COPY --chown=1000 . ${HOME}/app
ENV PYTHONPATH=${HOME}/app \
    PYTHONUNBUFFERED=1 \
    GRADIO_ALLOW_FLAGGING=never \
    GRADIO_NUM_PORTS=1 \
    GRADIO_SERVER_NAME=0.0.0.0 \
    GRADIO_THEME=huggingface \
    SYSTEM=spaces \
    GRADIO_SERVER_PORT=7860
EXPOSE 7860
WORKDIR ${HOME}/app
CMD ["python", "app.py"]

I do not know if it is using GPU or not. But without this, It took around 90 minutes to process a 110 minute file. Now, It takes around 1~2 minutes.

this worked for me too.

specifically, what i did was create a requirements.txt file:

with the contents:

gradio==3.38.0
--extra-index-url https://download.pytorch.org/whl/cu113
torch==2.0.1
pyannote-audio==3.1.0

Then install it with pip install -r requirements.txt.

Now, I can run some simple code:

In [1]: from pyannote.audio import Pipeline

In [2]: import torch

In [3]: pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
torchvision is not available - cannot save figures

In [4]: pipeline.to(torch.device("cuda"))
Out[4]: <pyannote.audio.pipelines.speaker_diarization.SpeakerDiarization at 0x7f2ce8f143d0>

In [5]: diarization = pipeline("/tmp/tmphgpfklya.wav")

And $ nvidia-smi -l 1 shows:

image

It took me quite a while to find this solution. Should it be added to README? Why is this version of torch required for the GPU to be properly utilized?

Copy link

stale bot commented Jun 1, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jun 1, 2024
@stale stale bot closed this as completed Jul 1, 2024
@helLf1nGer
Copy link

What is more weird on my side that 3.1 model works sometimes on GPU, sometimes on CPU, but 3.0 model always works on GPU. So, I specifically wrote a bit of code to choose between models. I always start with 3.1 because it does the segmentation faster. But then, if I see within 5 seconds that it's using CPU instead of GPU, I cancel that and re-run it with 3.0. Who knows...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants