Merge pull request PaddlePaddle#126 from luyao-cv/develop

[AppFlow] fix audio2img readme and add missing files
WAYKEN-TSE · Sep 8, 2023 · 7fe27b5 · 7fe27b5
2 parents bfa38e1 + 85125bb
commit 7fe27b5
Show file tree

Hide file tree

Showing 6 changed files with 146 additions and 51 deletions.
diff --git a/applications/Audio2Caption/README.md b/applications/Audio2Caption/README.md
@@ -1,16 +1,18 @@
-# Audio2Caption
+### 音频描述（Audio-to-Caption Generation）
 
-## 1. 应用简介
+
+
+#### 1. Application introduction
 
 Enter audio and prompt words for question and answer.
 
 *****
 - No training is need.
-- Integration with the moedel of 🤗  [whisper](), [chatglm]().
+- Integration with the moedel of [whisper](), [chatglm]().
 
 ----
 
-## 2. Demo
+#### 2. Demo
 *****
 example:
 
@@ -37,7 +39,11 @@ print(result)
 
 ```
 
-|  输入音频 | 输入prompt | 输出识别 | 输出结果 |
+<div align="center">
+
+|  Input Audio | Input Prompt | Output ASR | Output Text |
 | --- | --- | ---  | --- | 
 |[zh.wav](https://github.com/luyao-cv/file_download/blob/main/assets/zh.wav) | "描述这段话." |"我认为跑步最重要的就是给我带来了身体健康" |这段话表达了作者认为跑步最重要的好处之一是身体健康。作者认为,通过跑步,身体得到了良好的锻炼,身体健康得到了改善。作者还强调了跑步对身体健康的重要性,并认为这是最值得投资的运动之一。 |
 
+<div>
+
diff --git a/applications/Audio2Img/README.md b/applications/Audio2Img/README.md
@@ -1,31 +1,31 @@
-# Audio To Image
+### 音频生成图像（Audio-to-Image Generation）
 
-## 1. 应用简介
+#### 1. Application introduction
 
 *****
 
 Generate image from audio(w/ prompt or image) with [ImageBind](https://facebookresearch.github.io/ImageBind/paper)'s unified latent space and stable-diffusion-2-1-unclip.
 
 - No training is need.
-- Integration with 🤗  [ppdiffusers](https://github.com/PaddlePaddle/PaddleMIX/tree/develop/ppdiffusers).
+- Integration with [ppdiffusers](https://github.com/PaddlePaddle/PaddleMIX/tree/develop/ppdiffusers).
 
 ----
 
 **Support Tasks**
 
 - [Audio To Image](#audio-to-image)
-  - [1. 应用简介](#1-应用简介)
-  - [2. 运行](#2-运行)
-  - [3. 可视化](#3-可视化)
+  - [1. Application Introduction](#1-Application)
+  - [2. Run](#2-Run)
+  - [3. Visualization](#3-Visualization)
     - [Audio to Image](#audio-to-image-1)
-      - [3.1.1 命令](#311-命令)
-      - [3.1.2 效果](#312-效果)
+      - [3.1.1 Instruction](#311-Instruction)
+      - [3.1.2 Result](#312-Result)
     - [Audio+Text to Image](#audiotext-to-image)
-      - [3.2.1 命令](#321-命令)
-      - [3.2.2 效果](#322-效果)
+      - [3.2.1 Instruction](#321-Instruction)
+      - [3.2.2 Result](#322-Result)
     - [Audio+Image to Image](#audioimage-to-image)
-      - [3.3.1 命令](#331-命令)
-      - [3.3.2 效果](#332-效果)
+      - [3.3.1 Instruction](#331-Instruction)
+      - [3.3.2 Result](#332-Result)
 
 ----
 
@@ -35,7 +35,7 @@ Generate image from audio(w/ prompt or image) with [ImageBind](https://facebookr
 - [v0.0]: Support fusing audio, text(prompt) and imnage in ImageBind latent space.
 
 
-## 2. 运行
+#### 2. Run
 *****
 
 example: Use audio generate image across modalities (e.g. Image, Text and Audio) with the model of ImageBind and StableUnCLIPImg2ImgPipeline.
@@ -50,11 +50,11 @@ python audio2img_imagebind.py \
 ```
 
 ----
-## 3. 可视化
+#### 3. Visualization
 ----
 
-### Audio to Image
-#### 3.1.1 命令
+#### Audio to Image
+#### 3.1.1 Instruction
 
 ```python
 cd applications/Audio2Img
@@ -64,14 +64,14 @@ python audio2img_imagebind.py \
 --stable_unclip_model_name_or_path The dir name of StableUnCLIPImg2ImgPipeline pretrained checkpoint. \
 --input_audio bird_audio.wav  \
 ```
-#### 3.1.2 效果
-|  输入音频 | 输出图像 |
+#### 3.1.2 Result
+|  Input Audio | Output Image |
 | --- | --- | 
 |[bird_audio.wav](https://github.com/luyao-cv/file_download/blob/main/assets/bird_audio.wav)| ![audio2img_output_bird](https://github.com/luyao-cv/file_download/blob/main/vis_audio2img/audio2img_output_bird.jpg)  |
 
 
-### Audio+Text to Image
-#### 3.2.1 命令
+#### Audio+Text to Image
+#### 3.2.1 Instruction
 ```python
 cd applications/Audio2Img
 
@@ -81,14 +81,14 @@ python audio2img_imagebind.py \
 --input_audio bird_audio.wav  \
 --input_text 'A photo.' \
 ```
-#### 3.2.2 效果
-|  输入音频 | 输入文本 | 输出图像 |
+#### 3.2.2 Result
+|  Input Audio | Input Text | Output Image |
 | --- | --- |  --- | 
 |[bird_audio.wav](https://github.com/luyao-cv/file_download/blob/main/assets/bird_audio.wav) | 'A photo.' | ![audio_text_to_img_output_bird_a_photo](https://github.com/luyao-cv/file_download/blob/main/vis_audio2img/audio_text_to_img_output_bird_a_photo.jpg)
 
 
-### Audio+Image to Image
-#### 3.3.1 命令
+#### Audio+Image to Image
+#### 3.3.1 Instruction
 ```python
 cd applications/Audio2Img
 
@@ -99,9 +99,8 @@ python audio2img_imagebind.py \
 --input_image dog_image.jpg \
 ```
 
-#### 3.3.2 效果
-|  输入音频 | 输入图像 | 输出图像 |
+#### 3.3.2 Result
+|  Input Audio | Input Image | Output Image |
 | --- | --- |  --- | 
 |[wave.wav](https://github.com/luyao-cv/file_download/blob/main/assets/wave.wav) | ![input_dog_image](https://github.com/luyao-cv/file_download/blob/main/assets/dog_image.jpg) | ![audio_img_to_img_output_wave_dog](https://github.com/luyao-cv/file_download/blob/main/vis_audio2img/audio_img_to_img_output_wave_dog.jpg)
 
-
diff --git a/applications/AudioChat/README.md b/applications/AudioChat/README.md
@@ -1,16 +1,16 @@
-# Audio Chat
+### 音频对话（Audio-to-Chat Generation）
 
-## 1. 应用简介
+#### 1. Application introduction
 
 Enter audio and prompt words for question and answer.
 
 *****
 - No training is need.
-- Integration with the moedel of 🤗  [whisper](), [chatglm](). [fastspeech2]().
+- Integration with the moedel of [whisper](), [chatglm](). [fastspeech2]().
 
 ----
 
-## 2. Demo
+#### 2. Demo
 *****
 example:
 
@@ -31,6 +31,6 @@ result = task(audio=audio_file, prompt=prompt, output=output_path)
 
 ```
 
-|  输入音频 | 输入prompt | 输出文本 | 输出结果 |
+|  Input Audio | Input Prompt |Output Text| Output Audio|
 | --- | --- | ---  | --- | 
-|[zh.wav](https://github.com/luyao-cv/file_download/blob/main/assets/zh.wav) | "描述这段话." |"这段话表达了作者认为跑步最重要的好处之一是身体健康。作者认为,通过跑步,身体得到了良好的锻炼,身体健康得到了改善。作者还强调了跑步对身体健康的重要性,并认为这是最值得投资的运动之一。" |[audiochat-result.wav](https://github.com/luyao-cv/file_download/blob/main/assets/zh.wav)|
+|[zh.wav](https://github.com/luyao-cv/file_download/blob/main/assets/zh.wav) | "描述这段话." |"这段话表达了作者认为跑步最重要的好处之一是身体健康。作者认为,通过跑步,身体得到了良好的锻炼,身体健康得到了改善。作者还强调了跑步对身体健康的重要性,并认为这是最值得投资的运动之一。" |[audiochat-result.wav](https://github.com/luyao-cv/file_download/blob/main/assets/audiochat-result.wav)|
diff --git a/applications/MusicGeneration/README.md b/applications/MusicGeneration/README.md
@@ -1,16 +1,16 @@
-# Music Generation
+### 音乐生成（Music Generation）
 
-## 1. 应用简介
+#### 1. Application introduction
 
 Enter audio and prompt words for question and answer.
 
 *****
 - No training is need.
-- Integration with the moedel of 🤗  [minigpt4](), [minigpt4](), [chatglm]().
+- Integration with the moedel of [minigpt4](), [chatglm](), [audioldm]().
 
 ----
 
-## 2. Demo
+#### 2. Demo
 *****
 example:
 
@@ -25,28 +25,41 @@ paddle.seed(1024)
 # Text to music
 task = Appflow(app="music_generation", models=["cvssp/audioldm"])
 prompt = "A classic cocktail lounge vibe with smooth jazz piano and a cool, relaxed atmosphere."
-negative_prompt = "low quality, average quality"
+negative_prompt = 'low quality, average quality, muffled quality, noise interference, poor and low-grade quality, inaudible quality, low-fidelity quality'  
+audio_length_in_s = 5
 num_inference_steps = 20
-audio_length_in_s = 10
 output_path = "tmp.wav"
 result = task(prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=num_inference_steps, audio_length_in_s=audio_length_in_s, generator = paddle.Generator().manual_seed(120))['result']
 scipy.io.wavfile.write(output_path, rate=16000, data=result)
 
 # image to music
 task1 = Appflow(app="music_generation", models=["miniGPT4/MiniGPT4-7B"])
-negative_prompt = "low quality, average quality"
+negative_prompt = 'low quality, average quality, muffled quality, noise interference, poor and low-grade quality, inaudible quality, low-fidelity quality'  
+audio_length_in_s = 5
 num_inference_steps = 20
-audio_length_in_s = 10
 output_path = "tmp.wav"
 minigpt4_text = 'describe the image, '
-image_pil = Image.open("tmp.jpg").convert("RGB")
-result = task1(image=image_pil, minigpt4_text=minigpt4_text, )['result'].split('#')[0]
+image_pil = Image.open("dance.png").convert("RGB")
+result = task1(image=image_pil, minigpt4_text=minigpt4_text )['result'].split('#')[0]
 paddle.device.cuda.empty_cache()
-# miniGPT4 output: The image shows a pineapple cocktail sitting on a table in front of a person. The pineapple is cut in half and the drink is poured into the top half. The person is holding a straw in their hand and appears to be sipping the drink. There are also some other items on the table, such as a plate with food and a glass of water. The background is a marble table with a pattern on it.
-prompt = "Given the scene description in the following paragraph, please create a musical style sentence that fits the scene.Description:{}.".format(result)
+# miniGPT4 output: The image shows a crowded nightclub with people dancing on the dance floor. The lights on the dance floor are green and red, and there are several people on the dance floor. The stage is at the back of the room, and there are several people on stage. The walls of the nightclub are decorated with neon lights and there are several people sitting at tables in the background. The atmosphere is lively and energetic.
+
+prompt = "Given the scene description in the following paragraph, please create a musical style sentence that fits the scene.  Description:{}.".format(result)
 task2 = Appflow(app="music_generation", models=["THUDM/chatglm-6b", "cvssp/audioldm"])
 result = task2(prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=num_inference_steps, audio_length_in_s=audio_length_in_s, generator = paddle.Generator().manual_seed(120))['result']
 scipy.io.wavfile.write(output_path, rate=16000, data=result)
-# chatglm ouptput: The music swells as the image shows the pineapple cocktail on the table, with the drink cut in half and the person sipping it with a straw. The background is a marble table with a pattern, and the other items on the table are a plate with food and a glass of water. The music fades until it disappears, leaving the scene in the person's hand the pineapple drink, with the music once again swelling in the background.
+# chatglm ouptput: The music is playing, and the crowd is dancing like never before. The lights are bright and the atmosphere is electric, with people swaying to the rhythm of the music and the energy of the night. The dance floor is a sea of movement, with people moving to the music and feeling the rhythm of their feet. The stage is a place of magic, with people on it, performing their best. The neon lights of the nightclub are a testament to the energy and excitement of the night, with people's faces lit up as they perform. And as the music continues to play, the crowd continues to dance, never letting up, until the night is over. 
 ```
 
+
+#### Text to music
+|  Input Prompt | Output Music |
+| --- | --- |
+|'A classic cocktail lounge vibe with smooth jazz piano and a cool, relaxed atmosphere.'| [jazz_output.wav](https://github.com/luyao-cv/file_download/blob/main/assets/jazz_output.wav)
+
+---
+
+#### image to music
+|  Input Image | Output Caption | Output Text | Output Music |
+| --- | --- |  --- |  --- | 
+|![dance.png](https://github.com/luyao-cv/file_download/blob/main/vis_music_generation/dance.png) | 'The image shows a crowded nightclub with people dancing on the dance floor. The lights on the dance floor are green and red, and there are several people on the dance floor. The stage is at the back of the room, and there are several people on stage. The walls of the nightclub are decorated with neon lights and there are several people sitting at tables in the background. The atmosphere is lively and energetic.' | 'The music is playing, and the crowd is dancing like never before. The lights are bright and the atmosphere is electric, with people swaying to the rhythm of the music and the energy of the night. The dance floor is a sea of movement, with people moving to the music and feeling the rhythm of their feet. The stage is a place of magic, with people on it, performing their best. The neon lights of the nightclub are a testament to the energy and excitement of the night, with people's faces lit up as they perform. And as the music continues to play, the crowd continues to dance, never letting up, until the night is over.' | [dance_output.wav](https://github.com/luyao-cv/file_download/blob/main/assets/dance_output.wav)
diff --git a/applications/README.md b/applications/README.md
@@ -54,6 +54,12 @@ result = task(prompt=prompt)['result']
 | [文本引导的图像变换（Image-to-Image Text-Guided Generation）](./image2image/README.md/#文本引导的图像变换image-to-image-text-guided-generation)              | `stable-diffusion-v1-5`    |    [fastdeploy](../ppdiffusers/deploy/README.md/#文本引导的图像变换image-to-image-text-guided-generation)    |
 | [文本图像双引导图像生成（Dual Text and Image Guided Generation）](./image2image/README.md/#文本图像双引导图像生成dual-text-and-image-guided-generation)          | `versatile-diffusion`    |    ❌      |
 | [文本条件的视频生成（Text-to-Video Generation）](./text2video/README.md/#文本条件的视频生成text-to-video-generation)      | `text-to-video-ms-1.7b`  |     ❌     |
+| [音频生成图像（Audio-to-Chat Generation）](./Audio2Img/README.md/#audio-to-image)  | `imagebind stable-diffusion-2-1-unclip`  |          |
+| [音频描述（Audio-to-Caption Generation）](./Audio2Caption/README.md/#音频描述audio-to-caption-generation)  | `chatglm-6b whisper`  |          |
+| [音频对话（Audio-to-Chat Generation）](./AudioChat/README.md/#音频对话audio-to-chat-generation)  | `chatglm-6b whisper fastspeech2`  |          |
+| [音乐生成（Music Generation）](./MusicGeneration/README.md/#音乐生成music-generation)  | `chatglm-6b minigpt4 audioldm`  |          |
+
+
 
 更多应用持续开发中......
 

diff --git a/paddlemix/appflow/text2audio_generation.py b/paddlemix/appflow/text2audio_generation.py
@@ -0,0 +1,71 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ppdiffusers import AudioLDMPipeline
+
+from .apptask import AppTask
+
+
+class AudioLDMPipelineTask(AppTask):
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+
+        # Default to static mode
+        self._static_mode = False
+        self._construct_model(model)
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+
+        # bulid model
+        model_instance = AudioLDMPipeline.from_pretrained(model)
+
+        self._model = model_instance
+
+    def _preprocess(self, inputs):
+        """ """
+        prompt = inputs.get("prompt", None) 
+        assert prompt is not None, "The prompt is None"
+
+        return inputs
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_preprocess` function.
+        """
+        _num_inference_steps = inputs.get("num_inference_steps", 10)
+        _audio_length_in_s = inputs.get("audio_length_in_s", 5.0)
+        tmp = inputs["prompt"]
+        print(tmp)
+        result = self._model(
+            prompt=inputs["prompt"],
+            num_inference_steps=_num_inference_steps,
+            audio_length_in_s=_audio_length_in_s,
+        ).audios[0]
+
+        inputs.pop("prompt", None)
+        inputs.pop("_num_inference_steps", None)
+        inputs.pop("_audio_length_in_s", None)
+
+        inputs["result"] = result
+
+        return inputs
+
+    def _postprocess(self, inputs):
+        """
+        The model output is tag ids, this function will convert the model output to raw text.
+        """
+        return inputs