diff --git a/README.md b/README.md
index 6397d14adaadb2..195ae1c03bed11 100644
--- a/README.md
+++ b/README.md
@@ -401,6 +401,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
+1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
diff --git a/README_es.md b/README_es.md
index 899140210cf13d..193e2f5747b9c3 100644
--- a/README_es.md
+++ b/README_es.md
@@ -389,6 +389,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
+1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 
 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
diff --git a/README_hd.md b/README_hd.md
index 826306cf67bf4a..aa5c5969f96ec2 100644
--- a/README_hd.md
+++ b/README_hd.md
@@ -361,6 +361,7 @@ conda install -c huggingface transformers
 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (VinAI Research से) कागज के साथ [PhoBERT: वियतनामी के लिए पूर्व-प्रशिक्षित भाषा मॉडल](https://www .aclweb.org/anthology/2020.findings-emnlp.92/) डैट क्वोक गुयेन और अन्ह तुआन गुयेन द्वारा पोस्ट किया गया।
 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP से) साथ वाला पेपर [प्रोग्राम अंडरस्टैंडिंग एंड जेनरेशन के लिए यूनिफाइड प्री-ट्रेनिंग](https://arxiv .org/abs/2103.06333) वसी उद्दीन अहमद, सैकत चक्रवर्ती, बैशाखी रे, काई-वेई चांग द्वारा।
 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
+1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 
 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (माइक्रोसॉफ्ट रिसर्च से) साथ में पेपर [ProphetNet: प्रेडिक्टिंग फ्यूचर एन-ग्राम फॉर सीक्वेंस-टू-सीक्वेंस प्री-ट्रेनिंग ](https://arxiv.org/abs/2001.04063) यू यान, वीज़ेन क्यूई, येयुन गोंग, दयाहेंग लियू, नान डुआन, जिउशेंग चेन, रुओफ़ेई झांग और मिंग झोउ द्वारा पोस्ट किया गया।
 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA से) साथ वाला पेपर [डीप लर्निंग इंफ़ेक्शन के लिए इंटीजर क्वांटिज़ेशन: प्रिंसिपल्स एंड एम्पिरिकल इवैल्यूएशन](https:// arxiv.org/abs/2004.09602) हाओ वू, पैट्रिक जुड, जिआओजी झांग, मिखाइल इसेव और पॉलियस माइकेविसियस द्वारा।
 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (फेसबुक से) साथ में कागज [रिट्रीवल-ऑगमेंटेड जेनरेशन फॉर नॉलेज-इंटेंसिव एनएलपी टास्क](https://arxiv .org/abs/2005.11401) पैट्रिक लुईस, एथन पेरेज़, अलेक्जेंड्रा पिक्टस, फैबियो पेट्रोनी, व्लादिमीर कारपुखिन, नमन गोयल, हेनरिक कुटलर, माइक लुईस, वेन-ताउ यिह, टिम रॉकटाशेल, सेबस्टियन रिडेल, डौवे कीला द्वारा।
diff --git a/README_ja.md b/README_ja.md
index b45cc68ea6b2b8..7f289adecdc364 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -423,6 +423,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (VinAI Research から) Dat Quoc Nguyen and Anh Tuan Nguyen から公開された研究論文: [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/)
 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP から) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang から公開された研究論文: [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333)
 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (Sea AI Labs から) Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng から公開された研究論文: [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418)
+1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 
 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (Microsoft Research から) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou から公開された研究論文: [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063)
 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA から) Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius から公開された研究論文: [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602)
 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (Facebook から) Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela から公開された研究論文: [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401)
diff --git a/README_ko.md b/README_ko.md
index a5c0b8cf1eee75..fe7b68ee65afb8 100644
--- a/README_ko.md
+++ b/README_ko.md
@@ -338,6 +338,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (VinAI Research 에서) Dat Quoc Nguyen and Anh Tuan Nguyen 의 [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) 논문과 함께 발표했습니다.
 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP 에서) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang 의 [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) 논문과 함께 발표했습니다.
 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (Sea AI Labs 에서) Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng 의 [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) 논문과 함께 발표했습니다.
+1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 
 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (Microsoft Research 에서) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 의 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 논문과 함께 발표했습니다.
 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA 에서) Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 의 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 논문과 함께 발표했습니다.
 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (Facebook 에서) Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela 의 [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) 논문과 함께 발표했습니다.
diff --git a/README_zh-hans.md b/README_zh-hans.md
index 9ae3bc24494f47..9a63bb249090bb 100644
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -362,6 +362,7 @@ conda install -c huggingface transformers
 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (来自 VinAI Research) 伴随论文 [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) 由 Dat Quoc Nguyen and Anh Tuan Nguyen 发布。
 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (来自 UCLA NLP) 伴随论文 [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) 由 Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang 发布。
 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (来自 Sea AI Labs) 伴随论文 [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) 由 Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng 发布。
+1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 
 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。
 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (来自 NVIDIA) 伴随论文 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 由 Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 发布。
 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (来自 Facebook) 伴随论文 [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) 由 Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela 发布。
diff --git a/README_zh-hant.md b/README_zh-hant.md
index 53847bf6739ac4..f7b64aaaf075c5 100644
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -374,6 +374,7 @@ conda install -c huggingface transformers
 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
+1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 
 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index 7586985b111221..8769450f8b8bed 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -519,6 +519,8 @@
         title: Hubert
       - local: model_doc/mctct
         title: MCTCT
+      - local: model_doc/pop2piano
+        title: Pop2Piano
       - local: model_doc/sew
         title: SEW
       - local: model_doc/sew-d
diff --git a/docs/source/en/index.mdx b/docs/source/en/index.mdx
index ea1ab27e7970fa..1c032dffe64c7b 100644
--- a/docs/source/en/index.mdx
+++ b/docs/source/en/index.mdx
@@ -175,6 +175,7 @@ The documentation is organized into five sections:
 1. **[PhoBERT](model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
 1. **[PLBart](model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
 1. **[PoolFormer](model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
+1. **[Pop2Piano](model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
 1. **[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
 1. **[QDQBert](model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
 1. **[RAG](model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
@@ -366,6 +367,7 @@ Flax), PyTorch, and/or TensorFlow.
 |           Perceiver           |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
 |            PLBart             |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
 |          PoolFormer           |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|           Pop2Piano           |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |          ProphetNet           |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
 |            QDQBert            |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |              RAG              |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
diff --git a/docs/source/en/model_doc/pop2piano.mdx b/docs/source/en/model_doc/pop2piano.mdx
index 75d8280f2bcb68..61722c4475b956 100644
--- a/docs/source/en/model_doc/pop2piano.mdx
+++ b/docs/source/en/model_doc/pop2piano.mdx
@@ -32,11 +32,13 @@ a piano cover from pop audio without melody and chord extraction
 modules. We show that Pop2Piano trained with our dataset can
 generate plausible piano covers.*
 
-<Check on how to implement (show) examples>
 
 Tips:
 
-<INSERT TIPS ABOUT MODEL HERE>
+1. Pop2Piano is an Encoder-Decoder based model like T5.
+2. Pop2Piano can be used to generate midi-audio files for a given audio sequence. This HuggingFace implementation allows to save midi_output as well as stereo-mix output of the audio sequence.
+3. Choosing different composers in Pop2PianoForConditionalGeneration.generate can lead to variety of different results. 
+4. Please note that  HuggingFace implementation of Pop2Piano(both Pop2PianoForConditionalGeneration and Pop2PianoFeatureExtractor) can only work with one raw_audio sequence at a time. So if you want to process multiple files, please feed them one by one.  
 
 This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
 The original code can be found [here](https://github.com/sweetcocoa/pop2piano).
@@ -48,7 +50,7 @@ The original code can be found [here](https://github.com/sweetcocoa/pop2piano).
 
 ## Pop2PianoFeatureExtractor
 
-[[autodoc]] WhisperFeatureExtractor
+[[autodoc]] Pop2PianoFeatureExtractor
     - __call__
 
 ## Pop2PianoForConditionalGeneration
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index 5a6846d3b70e73..243128d98af934 100644
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -405,6 +405,11 @@
     "models.phobert": ["PhobertTokenizer"],
     "models.plbart": ["PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP", "PLBartConfig"],
     "models.poolformer": ["POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "PoolFormerConfig"],
+    "models.pop2piano": [
+        "POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "Pop2PianoConfig",
+        "Pop2PianoFeatureExtractor",
+    ],
     "models.prophetnet": ["PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "ProphetNetConfig", "ProphetNetTokenizer"],
     "models.qdqbert": ["QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "QDQBertConfig"],
     "models.rag": ["RagConfig", "RagRetriever", "RagTokenizer"],
@@ -524,14 +529,6 @@
         "WhisperFeatureExtractor",
         "WhisperProcessor",
         "WhisperTokenizer",
-    ],
-    "models.pop2piano": [
-        "POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP",
-        "Pop2PianoConfig",
-        "Pop2PianoFeatureExtractor",
-       
-       
-       
     ],
     "models.x_clip": [
         "XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
@@ -2127,6 +2124,13 @@
             "PoolFormerPreTrainedModel",
         ]
     )
+    _import_structure["models.pop2piano"].extend(
+        [
+            "POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "Pop2PianoForConditionalGeneration",
+            "Pop2PianoPreTrainedModel",
+        ]
+    )
     _import_structure["models.prophetnet"].extend(
         [
             "PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -2613,13 +2617,6 @@
             "WhisperPreTrainedModel",
         ]
     )
-    _import_structure["models.pop2piano"].extend(
-        [
-            "POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST",
-            "Pop2PianoForConditionalGeneration",
-            "Pop2PianoPreTrainedModel",
-        ]
-    )
     _import_structure["models.x_clip"].extend(
         [
             "XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -4031,6 +4028,11 @@
     from .models.phobert import PhobertTokenizer
     from .models.plbart import PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP, PLBartConfig
     from .models.poolformer import POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, PoolFormerConfig
+    from .models.pop2piano import (
+        POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        Pop2PianoConfig,
+        Pop2PianoFeatureExtractor,
+    )
     from .models.prophetnet import PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, ProphetNetConfig, ProphetNetTokenizer
     from .models.qdqbert import QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, QDQBertConfig
     from .models.rag import RagConfig, RagRetriever, RagTokenizer
@@ -4133,14 +4135,6 @@
         WhisperFeatureExtractor,
         WhisperProcessor,
         WhisperTokenizer,
-    )
-    from .models.pop2piano import (
-        POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP,
-        Pop2PianoConfig,
-        Pop2PianoFeatureExtractor,
-       
-       
-       
     )
     from .models.x_clip import (
         XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
@@ -5472,6 +5466,11 @@
             PoolFormerModel,
             PoolFormerPreTrainedModel,
         )
+        from .models.pop2piano import (
+            POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST,
+            Pop2PianoForConditionalGeneration,
+            Pop2PianoPreTrainedModel,
+        )
         from .models.prophetnet import (
             PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST,
             ProphetNetDecoder,
@@ -5859,11 +5858,6 @@
             WhisperModel,
             WhisperPreTrainedModel,
         )
-        from .models.pop2piano import (
-            POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST,
-            Pop2PianoForConditionalGeneration,
-            Pop2PianoPreTrainedModel,
-        )
         from .models.x_clip import (
             XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
             XCLIPModel,
diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
index 94ba97db92ed86..5e36c41ae733d4 100644
--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -142,6 +142,7 @@
     phobert,
     plbart,
     poolformer,
+    pop2piano,
     prophetnet,
     qdqbert,
     rag,
@@ -197,7 +198,6 @@
     wav2vec2_with_lm,
     wavlm,
     whisper,
-    pop2piano,
     x_clip,
     xglm,
     xlm,
diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
index c5bf3d2b8a59c5..3ac19982468be4 100755
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -143,6 +143,7 @@
         ("perceiver", "PerceiverConfig"),
         ("plbart", "PLBartConfig"),
         ("poolformer", "PoolFormerConfig"),
+        ("pop2piano", "Pop2PianoConfig"),
         ("prophetnet", "ProphetNetConfig"),
         ("qdqbert", "QDQBertConfig"),
         ("rag", "RagConfig"),
@@ -195,7 +196,6 @@
         ("wav2vec2-conformer", "Wav2Vec2ConformerConfig"),
         ("wavlm", "WavLMConfig"),
         ("whisper", "WhisperConfig"),
-        ("pop2piano", "Pop2PianoConfig"),
         ("xclip", "XCLIPConfig"),
         ("xglm", "XGLMConfig"),
         ("xlm", "XLMConfig"),
@@ -318,6 +318,7 @@
         ("perceiver", "PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("plbart", "PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("poolformer", "POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("pop2piano", "POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("prophetnet", "PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("qdqbert", "QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("realm", "REALM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -361,7 +362,6 @@
         ("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("wav2vec2-conformer", "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("whisper", "WHISPER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
-        ("pop2piano", "POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("xclip", "XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("xglm", "XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("xlm", "XLM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -509,6 +509,7 @@
         ("phobert", "PhoBERT"),
         ("plbart", "PLBart"),
         ("poolformer", "PoolFormer"),
+        ("pop2piano", "Pop2Piano"),
         ("prophetnet", "ProphetNet"),
         ("qdqbert", "QDQBert"),
         ("rag", "RAG"),
@@ -565,7 +566,6 @@
         ("wav2vec2_phoneme", "Wav2Vec2Phoneme"),
         ("wavlm", "WavLM"),
         ("whisper", "Whisper"),
-        ("pop2piano", "Pop2Piano"),
         ("xclip", "X-CLIP"),
         ("xglm", "XGLM"),
         ("xlm", "XLM"),
diff --git a/src/transformers/models/auto/feature_extraction_auto.py b/src/transformers/models/auto/feature_extraction_auto.py
index f8522e3d307b71..542103565698b4 100644
--- a/src/transformers/models/auto/feature_extraction_auto.py
+++ b/src/transformers/models/auto/feature_extraction_auto.py
@@ -71,6 +71,7 @@
         ("owlvit", "OwlViTFeatureExtractor"),
         ("perceiver", "PerceiverFeatureExtractor"),
         ("poolformer", "PoolFormerFeatureExtractor"),
+        ("pop2piano", "Pop2PianoFeatureExtractor"),
         ("regnet", "ConvNextFeatureExtractor"),
         ("resnet", "ConvNextFeatureExtractor"),
         ("segformer", "SegformerFeatureExtractor"),
@@ -95,7 +96,6 @@
         ("wav2vec2-conformer", "Wav2Vec2FeatureExtractor"),
         ("wavlm", "Wav2Vec2FeatureExtractor"),
         ("whisper", "WhisperFeatureExtractor"),
-        ("pop2piano", "Pop2PianoFeatureExtractor"),
         ("xclip", "CLIPFeatureExtractor"),
         ("yolos", "YolosFeatureExtractor"),
     ]
diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
index 12d8d1dc775f30..4d717b7fd2196e 100755
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -311,6 +311,7 @@
         ("openai-gpt", "OpenAIGPTLMHeadModel"),
         ("pegasus_x", "PegasusXForConditionalGeneration"),
         ("plbart", "PLBartForConditionalGeneration"),
+        ("pop2piano", "Pop2PianoForConditionalGeneration"),
         ("qdqbert", "QDQBertForMaskedLM"),
         ("reformer", "ReformerModelWithLMHead"),
         ("rembert", "RemBertForMaskedLM"),
@@ -326,7 +327,6 @@
         ("transfo-xl", "TransfoXLLMHeadModel"),
         ("wav2vec2", "Wav2Vec2ForMaskedLM"),
         ("whisper", "WhisperForConditionalGeneration"),
-        ("pop2piano", "Pop2PianoForConditionalGeneration"),
         ("xlm", "XLMWithLMHeadModel"),
         ("xlm-roberta", "XLMRobertaForMaskedLM"),
         ("xlm-roberta-xl", "XLMRobertaXLForMaskedLM"),
@@ -612,11 +612,11 @@
 
 MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
     [
+        ("pop2piano", "Pop2PianoForConditionalGeneration"),
         ("speech-encoder-decoder", "SpeechEncoderDecoderModel"),
         ("speech_to_text", "Speech2TextForConditionalGeneration"),
         ("speecht5", "SpeechT5ForSpeechToText"),
         ("whisper", "WhisperForConditionalGeneration"),
-        ("pop2piano", "Pop2PianoForConditionalGeneration"),
     ]
 )
 
diff --git a/src/transformers/models/pop2piano/__init__.py b/src/transformers/models/pop2piano/__init__.py
index 0f63bebca64211..f16c76a2dc8065 100644
--- a/src/transformers/models/pop2piano/__init__.py
+++ b/src/transformers/models/pop2piano/__init__.py
@@ -16,14 +16,15 @@
 from ...utils import (
     OptionalDependencyNotAvailable,
     _LazyModule,
-    is_torch_available,
+    is_essentia_available,
     is_librosa_available,
+    is_pretty_midi_available,
     is_scipy_available,
     is_soundfile_availble,
-    is_essentia_available,
-    is_pretty_midi_available,
+    is_torch_available,
 )
 
+
 # Config
 _import_structure = {
     "configuration_pop2piano": ["POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP", "Pop2PianoConfig"],
@@ -44,9 +45,13 @@
 
 # Feature Extractor
 try:
-    if not (is_librosa_available() and is_essentia_available() and
-            is_scipy_available() and is_pretty_midi_available() and
-            is_soundfile_availble()):
+    if not (
+        is_librosa_available()
+        and is_essentia_available()
+        and is_scipy_available()
+        and is_pretty_midi_available()
+        and is_soundfile_availble()
+    ):
         raise OptionalDependencyNotAvailable()
 except OptionalDependencyNotAvailable:
     pass
@@ -73,9 +78,13 @@
 
     # Feature Extractor
     try:
-        if not (is_librosa_available() and is_essentia_available() and
-                is_scipy_available() and is_pretty_midi_available() and
-                is_soundfile_availble()):
+        if not (
+            is_librosa_available()
+            and is_essentia_available()
+            and is_scipy_available()
+            and is_pretty_midi_available()
+            and is_soundfile_availble()
+        ):
             raise OptionalDependencyNotAvailable()
     except OptionalDependencyNotAvailable:
         pass
@@ -84,4 +93,5 @@
 
 else:
     import sys
+
     sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/pop2piano/configuration_pop2piano.py b/src/transformers/models/pop2piano/configuration_pop2piano.py
index 9abf7962c4157d..5307d9b0251ae6 100644
--- a/src/transformers/models/pop2piano/configuration_pop2piano.py
+++ b/src/transformers/models/pop2piano/configuration_pop2piano.py
@@ -12,57 +12,58 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Pop2Piano model configuration """
+""" Pop2Piano model configuration"""
 
-from collections import OrderedDict
-from typing import TYPE_CHECKING, Any, Mapping, Optional, Union
 
 from ...configuration_utils import PretrainedConfig
 from ...utils import logging
 
+
 logger = logging.get_logger(__name__)
 
 POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP = {
-    "susnato/pop2piano_dev": "https://huggingface.co/susnato/pop2piano_dev/blob/main/config.json" # For now
+    "susnato/pop2piano_dev": "https://huggingface.co/susnato/pop2piano_dev/blob/main/config.json"  # For now
 }
 
-COMPOSER_TO_FEATURE_TOKEN = {'composer1': 2052,
-                             'composer2': 2053,
-                             'composer3': 2054,
-                             'composer4': 2055,
-                             'composer5': 2056,
-                             'composer6': 2057,
-                             'composer7': 2058,
-                             'composer8': 2059,
-                             'composer9': 2060,
-                             'composer10': 2061,
-                             'composer11': 2062,
-                             'composer12': 2063,
-                             'composer13': 2064,
-                             'composer14': 2065,
-                             'composer15': 2066,
-                             'composer16': 2067,
-                             'composer17': 2068,
-                             'composer18': 2069,
-                             'composer19': 2070,
-                             'composer20': 2071,
-                             'composer21': 2072
+COMPOSER_TO_FEATURE_TOKEN = {
+    "composer1": 2052,
+    "composer2": 2053,
+    "composer3": 2054,
+    "composer4": 2055,
+    "composer5": 2056,
+    "composer6": 2057,
+    "composer7": 2058,
+    "composer8": 2059,
+    "composer9": 2060,
+    "composer10": 2061,
+    "composer11": 2062,
+    "composer12": 2063,
+    "composer13": 2064,
+    "composer14": 2065,
+    "composer15": 2066,
+    "composer16": 2067,
+    "composer17": 2068,
+    "composer18": 2069,
+    "composer19": 2070,
+    "composer20": 2071,
+    "composer21": 2072,
 }
 
+
 class Pop2PianoConfig(PretrainedConfig):
     r"""
-    This is the configuration class to store the configuration of a [`Pop2PianoForConditionalGeneration`]. It is used to instantiate a
-    Pop2PianoForConditionalGeneration model according to the specified arguments, defining the model architecture. Instantiating a configuration
-    with the defaults will yield a similar configuration to that of the Pop2Piano
-    [sweetcocoa/pop2piano](https://huggingface.co/sweetcocoa/pop2piano) architecture.
+    This is the configuration class to store the configuration of a [`Pop2PianoForConditionalGeneration`]. It is used
+    to instantiate a Pop2PianoForConditionalGeneration model according to the specified arguments, defining the model
+    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the
+    Pop2Piano [sweetcocoa/pop2piano](https://huggingface.co/sweetcocoa/pop2piano) architecture.
 
     Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
     documentation from [`PretrainedConfig`] for more information.
 
     Arguments:
         vocab_size (`int`, *optional*, defaults to 2400):
-            Vocabulary size of the Pop2PianoForConditionalGeneration model. Defines the number of different tokens that can be represented by the
-            `inputs_ids` passed when calling [`Pop2PianoForConditionalGeneration`].
+            Vocabulary size of the Pop2PianoForConditionalGeneration model. Defines the number of different tokens that
+            can be represented by the `inputs_ids` passed when calling [`Pop2PianoForConditionalGeneration`].
         d_model (`int`, *optional*, defaults to 512):
             Size of the encoder layers and the pooler layer.
         d_kv (`int`, *optional*, defaults to 64):
@@ -139,7 +140,6 @@ def __init__(
         dataset_n_bars=2,
         dataset_sampling_rate=22050,
         dataset_mel_is_conditioned=True,
-
         n_fft=4096,
         hop_length=1024,
         f_min=10.0,
@@ -165,10 +165,12 @@ def __init__(
         self.dense_act_fn = dense_act_fn
         self.is_gated_act = act_info[0] == "gated"
         self.composer_to_feature_token = COMPOSER_TO_FEATURE_TOKEN
-        self.dataset = {'target_length': dataset_target_length,
-                        'n_bars': dataset_n_bars,
-                        'sampling_rate': dataset_sampling_rate,
-                        'mel_is_conditioned': dataset_mel_is_conditioned}
+
+        self.dataset_mel_is_conditioned = dataset_mel_is_conditioned
+        self.dataset_target_length = dataset_target_length
+        self.dataset_n_bars = dataset_n_bars
+        self.dataset_sampling_rate = dataset_sampling_rate
+
         self.n_fft = n_fft
         self.hop_length = hop_length
         self.f_min = f_min
diff --git a/src/transformers/models/pop2piano/feature_extraction_pop2piano.py b/src/transformers/models/pop2piano/feature_extraction_pop2piano.py
index 7aa280570304c6..3f267310ac0650 100644
--- a/src/transformers/models/pop2piano/feature_extraction_pop2piano.py
+++ b/src/transformers/models/pop2piano/feature_extraction_pop2piano.py
@@ -12,30 +12,27 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Feature extractor class for Pop2Piano """
-
-import copy
-from typing import Any, Dict, List, Optional, Union
+""" Feature extractor class for Pop2Piano"""
 
 import os
+import warnings
+from typing import List, Optional, Union
 
-import tensorflow
-import torch
-import scipy
-import librosa
-import pathlib
 import essentia
-import warnings
-import pretty_midi
+import essentia.standard
+import librosa
 import numpy as np
+import pretty_midi
+import scipy
 import soundfile as sf
-import essentia.standard
+import tensorflow
+import torch
 from torch.nn.utils.rnn import pad_sequence
-from .configuration_pop2piano import Pop2PianoConfig
 
-from ...utils import TensorType, logging
-from ...feature_extraction_utils import BatchFeature
 from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
+from ...feature_extraction_utils import BatchFeature
+from ...utils import TensorType, logging
+
 
 logger = logging.get_logger(__name__)
 
@@ -50,6 +47,7 @@
 EOS: int = 1
 PAD: int = 0
 
+
 class Pop2PianoFeatureExtractor(SequenceFeatureExtractor):
     r"""
     Constructs a Pop2Piano feature extractor.
@@ -57,45 +55,43 @@ class Pop2PianoFeatureExtractor(SequenceFeatureExtractor):
     This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
     most of the main methods. Users should refer to this superclass for more information regarding those methods.
 
-    This class loads audio, extracts rhythm and does preprocesses before being passed through `LogMelSpectrogram`.
-    This class also contains postprocessing methods to convert model outputs to midi audio and stereo-mix.
     Args:
+    This class loads audio, extracts rhythm and does preprocesses before being passed through `LogMelSpectrogram`. This:
+    class also contains postprocessing methods to convert model outputs to midi audio and stereo-mix.
         n_bars (`int`, *optional*, defaults to 2):
             Determines `n_steps` in method `preprocess_mel`.
         sampling_rate (`int`, *optional*, defaults to 22050):
             Sample rate of audio signal.
         use_mel (`bool`, *optional*, defaults to `True`):
-            Whether to preprocess for `LogMelSpectrogram` or not.
-            For the current implementation this must be `True`.
+            Whether to preprocess for `LogMelSpectrogram` or not. For the current implementation this must be `True`.
         padding_value (`int`, *optional*, defaults to 0):
             Padding value used to pad the audio. Should correspond to silences.
         vocab_size_special (`int`, *optional*, defaults to 4):
             Number of special values.
         vocab_size_note (`int`, *optional*, defaults to 128):
-            This represents the number of Note Values.
-            Note values indicate a pitch event for one of the MIDI pitches. But only the 88 pitches corresponding to
-            piano keys are actually used.
+            This represents the number of Note Values. Note values indicate a pitch event for one of the MIDI pitches.
+            But only the 88 pitches corresponding to piano keys are actually used.
         vocab_size_velocity (`int`, *optional*, defaults to 2):
             Number of Velocity tokens.
         vocab_size_time (`int`, *optional*, defaults to 100):
-            This represents the number of Beat Shifts.
-            Beat Shift [100 values] Indicates the relative time shift within the segment quantized into 8th-note
-            beats(half-beats).
+            This represents the number of Beat Shifts. Beat Shift [100 values] Indicates the relative time shift within
+            the segment quantized into 8th-note beats(half-beats).
     """
     model_input_names = ["input_features"]
 
-    def __init__(self,
-                 n_bars:int = 2,
-                 sampling_rate:int = 22050,
-                 use_mel:int = True,
-                 padding_value:int = 0,
-                 vocab_size_special:int = 4,
-                 vocab_size_note:int = 128,
-                 vocab_size_velocity:int = 2,
-                 vocab_size_time:int = 100,
-                 feature_size=None,
-                 **kwargs
-        ):
+    def __init__(
+        self,
+        n_bars: int = 2,
+        sampling_rate: int = 22050,
+        use_mel: int = True,
+        padding_value: int = 0,
+        vocab_size_special: int = 4,
+        vocab_size_note: int = 128,
+        vocab_size_velocity: int = 2,
+        vocab_size_time: int = 100,
+        feature_size=None,
+        **kwargs,
+    ):
         super().__init__(
             feature_size=feature_size,
             sampling_rate=sampling_rate,
@@ -114,8 +110,8 @@ def __init__(self,
     def extract_rhythm(self, raw_audio):
         """
         This algorithm(`RhythmExtractor2013`) extracts the beat positions and estimates their confidence as well as
-        tempo in bpm for an audio signal.
-        For more information please visit https://essentia.upf.edu/reference/std_RhythmExtractor2013.html .
+        tempo in bpm for an audio signal. For more information please visit
+        https://essentia.upf.edu/reference/std_RhythmExtractor2013.html .
         """
         essentia_tracker = essentia.standard.RhythmExtractor2013(method="multifeature")
         bpm, beat_times, confidence, estimates, essentia_beat_intervals = essentia_tracker(raw_audio)
@@ -130,9 +126,7 @@ def interpolate_beat_times(self, beat_times, steps_per_beat, extend=False):
             fill_value="extrapolate",
         )
         if extend:
-            beat_steps_8th = beat_times_function(
-                np.linspace(0, beat_times.size, beat_times.size * steps_per_beat + 1)
-            )
+            beat_steps_8th = beat_times_function(np.linspace(0, beat_times.size, beat_times.size * steps_per_beat + 1))
         else:
             beat_steps_8th = beat_times_function(
                 np.linspace(0, beat_times.size - 1, beat_times.size * steps_per_beat - 1)
@@ -146,16 +140,18 @@ def extrapolate_beat_times(self, beat_times, n_extend=1):
             bounds_error=False,
             fill_value="extrapolate",
         )
-        ext_beats = beat_times_function(
-            np.linspace(0, beat_times.size + n_extend - 1, beat_times.size + n_extend)
-        )
+        ext_beats = beat_times_function(np.linspace(0, beat_times.size + n_extend - 1, beat_times.size + n_extend))
 
         return ext_beats
 
     def preprocess_mel(
-            self, audio, beatstep, n_bars, padding_value,
-        ):
-        """ Preprocessing for `LogMelSpectrogram` """
+        self,
+        audio,
+        beatstep,
+        n_bars,
+        padding_value,
+    ):
+        """Preprocessing for `LogMelSpectrogram`"""
 
         n_steps = n_bars * 4
         n_target_step = len(beatstep)
@@ -163,9 +159,8 @@ def preprocess_mel(
 
         def split_audio(audio):
             """
-            Split audio corresponding beat intervals.
-            Each audio's lengths are different.
-            Because each corresponding beat interval times are different.
+            Split audio corresponding beat intervals. Each audio's lengths are different. Because each corresponding
+            beat interval times are different.
             """
 
             batch = []
@@ -186,13 +181,13 @@ def split_audio(audio):
         return batch, ext_beatstep
 
     def single_preprocess(
-            self,
-            beatstep,
-            feature_tokens=None,
-            audio=None,
-            n_bars=None,
+        self,
+        beatstep,
+        feature_tokens=None,
+        audio=None,
+        n_bars=None,
     ):
-        """ Preprocessing method for a single sequence. """
+        """Preprocessing method for a single sequence."""
 
         if feature_tokens is None and audio is None:
             raise ValueError("Both `feature_tokens` and `audio` can't be None at the same time!")
@@ -209,27 +204,29 @@ def single_preprocess(
             beatstep = beatstep - beatstep[0]
 
         if self.use_mel:
-            batch, ext_beatstep = self.preprocess_mel(audio,
-                                                      beatstep,
-                                                      n_bars=n_bars,
-                                                      padding_value=self.padding_value,
-                                                      )
+            batch, ext_beatstep = self.preprocess_mel(
+                audio,
+                beatstep,
+                n_bars=n_bars,
+                padding_value=self.padding_value,
+            )
         else:
             raise NotImplementedError("use_mel must be True")
 
         return batch, ext_beatstep
 
-    def __call__(self,
-                 raw_audio:Union[np.ndarray, List[float], List[np.ndarray]],
-                 audio_sr:int,
-                 steps_per_beat:int=2,
-                 return_tensors:Optional[Union[str, TensorType]]="pt",
-                 **kwargs
-                 ) -> BatchFeature:
+    def __call__(
+        self,
+        raw_audio: Union[np.ndarray, List[float], List[np.ndarray]],
+        audio_sr: int,
+        steps_per_beat: int = 2,
+        return_tensors: Optional[Union[str, TensorType]] = "pt",
+        **kwargs,
+    ) -> BatchFeature:
         """
-        Main method to featurize and prepare for the model one sequence.
-        Please note that `Pop2PianoFeatureExtractor` only accepts one raw_audio at a time.
         Args:
+        Main method to featurize and prepare for the model one sequence. Please note that `Pop2PianoFeatureExtractor`
+        only accepts one raw_audio at a time.
             raw_audio (`np.ndarray`, `List`):
                 Denotes the raw_audio.
             audio_sr (`int`):
@@ -242,7 +239,9 @@ def __call__(self,
                 - `'pt'`: Return PyTorch `torch.Tensor` objects.
                 - `'np'`: Return Numpy `np.ndarray` objects.
         """
-        warnings.warn("Pop2PianoFeatureExtractor only takes one raw_audio at a time, if you want to extract features from more than a single audio then you might need to call it multiple times.")
+        warnings.warn(
+            "Pop2PianoFeatureExtractor only takes one raw_audio at a time, if you want to extract features from more than a single audio then you might need to call it multiple times."
+        )
 
         # If it's [np.ndarray]
         if isinstance(raw_audio, list) and isinstance(raw_audio[0], np.ndarray):
@@ -254,7 +253,7 @@ def __call__(self,
         if self.sampling_rate != audio_sr and self.sampling_rate is not None:
             # Change `raw_audio_sr` to `self.sampling_rate`
             raw_audio = librosa.core.resample(
-                raw_audio, orig_sr=audio_sr, target_sr=self.sampling_rate, res_type='kaiser_best'
+                raw_audio, orig_sr=audio_sr, target_sr=self.sampling_rate, res_type="kaiser_best"
             )
         audio_sr = self.sampling_rate
         start_sample = int(beatsteps[0] * audio_sr)
@@ -270,10 +269,13 @@ def __call__(self,
         )
         batch = batch.cpu().numpy()
 
-        output = BatchFeature({"input_features": batch,
-                              "beatsteps": beatsteps,
-                              "ext_beatstep": ext_beatstep,
-                             })
+        output = BatchFeature(
+            {
+                "input_features": batch,
+                "beatsteps": beatsteps,
+                "ext_beatstep": ext_beatstep,
+            }
+        )
 
         if return_tensors is not None:
             output = output.convert_to_tensors(return_tensors)
@@ -281,15 +283,17 @@ def __call__(self,
         return output
 
     def decode(self, token, time_idx_offset):
-        """ Decodes the tokens generated by the transformer """
+        """Decodes the tokens generated by the transformer"""
 
         if token >= (self.vocab_size_special + self.vocab_size_note + self.vocab_size_velocity):
-            type, value =  TOKEN_TIME, ((token - (self.vocab_size_special + self.vocab_size_note + self.vocab_size_velocity)) + time_idx_offset)
+            type, value = TOKEN_TIME, (
+                (token - (self.vocab_size_special + self.vocab_size_note + self.vocab_size_velocity)) + time_idx_offset
+            )
         elif token >= (self.vocab_size_special + self.vocab_size_note):
-            type, value =  TOKEN_VELOCITY, (token - (self.vocab_size_special + self.vocab_size_note))
+            type, value = TOKEN_VELOCITY, (token - (self.vocab_size_special + self.vocab_size_note))
             value = int(value)
         elif token >= self.vocab_size_special:
-            type, value =  TOKEN_NOTE, (token - self.vocab_size_special)
+            type, value = TOKEN_NOTE, (token - self.vocab_size_special)
             value = int(value)
         else:
             type, value = TOKEN_SPECIAL, token
@@ -298,14 +302,14 @@ def decode(self, token, time_idx_offset):
         return [type, value]
 
     def relative_batch_tokens_to_midi(
-            self,
-            tokens,
-            beatstep,
-            beat_offset_idx=None,
-            bars_per_batch=None,
-            cutoff_time_idx=None,
+        self,
+        tokens,
+        beatstep,
+        beat_offset_idx=None,
+        bars_per_batch=None,
+        cutoff_time_idx=None,
     ):
-        """ Converts tokens to midi """
+        """Converts tokens to midi"""
 
         beat_offset_idx = 0 if beat_offset_idx is None else beat_offset_idx
         notes = None
@@ -336,13 +340,15 @@ def relative_batch_tokens_to_midi(
 
     def relative_tokens_to_notes(self, tokens, start_idx, cutoff_time_idx=None):
         # decoding If the first token is an arranger
-        if tokens[0] >= (self.vocab_size_special + self.vocab_size_note + self.vocab_size_velocity + self.vocab_size_time):
+        if tokens[0] >= (
+            self.vocab_size_special + self.vocab_size_note + self.vocab_size_velocity + self.vocab_size_time
+        ):
             tokens = tokens[1:]
 
         words = [self.decode(token, time_idx_offset=0) for token in tokens]
 
         if hasattr(start_idx, "item"):
-            """ if numpy or torch tensor """
+            """if numpy or torch tensor"""
             start_idx = start_idx.item()
 
         current_idx = start_idx
@@ -374,9 +380,7 @@ def relative_tokens_to_notes(self, tokens, start_idx, cutoff_time_idx=None):
                             pass
                         else:
                             offset_idx = current_idx
-                            notes.append(
-                                [onset_idx, offset_idx, pitch, DEFAULT_VELOCITY]
-                            )
+                            notes.append([onset_idx, offset_idx, pitch, DEFAULT_VELOCITY])
                             note_onsets_ready[pitch] = None
                 else:
                     # note_on
@@ -390,9 +394,7 @@ def relative_tokens_to_notes(self, tokens, start_idx, cutoff_time_idx=None):
                             pass
                         else:
                             offset_idx = current_idx
-                            notes.append(
-                                [onset_idx, offset_idx, pitch, DEFAULT_VELOCITY]
-                            )
+                            notes.append([onset_idx, offset_idx, pitch, DEFAULT_VELOCITY])
                             note_onsets_ready[pitch] = current_idx
             else:
                 raise ValueError
@@ -417,7 +419,7 @@ def relative_tokens_to_notes(self, tokens, start_idx, cutoff_time_idx=None):
             return notes
 
     def notes_to_midi(self, notes, beatstep, offset_sec=None):
-        """ Converts notes to midi """
+        """Converts notes to midi"""
 
         new_pm = pretty_midi.PrettyMIDI(resolution=384, initial_tempo=120.0)
         new_inst = pretty_midi.Instrument(program=0)
@@ -439,7 +441,7 @@ def notes_to_midi(self, notes, beatstep, offset_sec=None):
         return new_pm
 
     def get_stereo(self, pop_y, midi_y, pop_scale=0.99):
-        """ Generates stereo audio using `pop audio(`pop_y`)` and `generated midi audio(`midi_y`)` """
+        """Generates stereo audio using `pop audio(`pop_y`)` and `generated midi audio(`midi_y`)`"""
 
         if len(pop_y) > len(midi_y):
             midi_y = np.pad(midi_y, (0, len(pop_y) - len(midi_y)))
@@ -449,7 +451,7 @@ def get_stereo(self, pop_y, midi_y, pop_scale=0.99):
         return stereo
 
     def _to_np(self, tensor):
-        """ Converts tensorflow or pytorch tensor to np.ndarray. """
+        """Converts tensorflow or pytorch tensor to np.ndarray."""
         if isinstance(tensor, np.ndarray):
             return tensor
         elif isinstance(tensor, torch.Tensor):
@@ -457,24 +459,25 @@ def _to_np(self, tensor):
         elif isinstance(tensor, tensorflow.Tensor):
             return tensor.numpy()
 
-    def postprocess(self,
-                    relative_tokens:Union[TensorType],
-                    beatsteps:Union[TensorType],
-                    ext_beatstep:Union[TensorType],
-                    raw_audio:Union[np.ndarray, List[float], List[np.ndarray]],
-                    sampling_rate:int,
-                    mix_sampling_rate=None,
-                    save_path:str=None,
-                    audio_file_name:str=None,
-                    save_midi:bool=False,
-                    save_mix:bool=False,
-                    click_amp:float=0.2,
-                    stereo_amp:float=0.5,
-                    add_click:bool=False,
-        ):
+    def postprocess(
+        self,
+        relative_tokens: Union[TensorType],
+        beatsteps: Union[TensorType],
+        ext_beatstep: Union[TensorType],
+        raw_audio: Union[np.ndarray, List[float], List[np.ndarray]],
+        sampling_rate: int,
+        mix_sampling_rate=None,
+        save_path: str = None,
+        audio_file_name: str = None,
+        save_midi: bool = False,
+        save_mix: bool = False,
+        click_amp: float = 0.2,
+        stereo_amp: float = 0.5,
+        add_click: bool = False,
+    ):
         r"""
-        Postprocess step. It also saves the `"generated midi audio"`, `"stereo-mix"`
         Args:
+        Postprocess step. It also saves the `"generated midi audio"`, `"stereo-mix"`
             relative_tokens ([`~utils.TensorType`]):
                 Output of `Pop2PianoConditionalGeneration` model.
             beatsteps ([`~utils.TensorType`]):
@@ -512,8 +515,10 @@ def postprocess(self,
             raise ValueError("If you want to save any mix or midi file then you must define save_path.")
 
         if save_path and (not save_midi and not save_mix):
-            raise ValueError("You are setting save_path but not saving anything, use save_midi=True to "
-                             "save the midi file and use save_mix to save the mix file or do both!")
+            raise ValueError(
+                "You are setting save_path but not saving anything, use save_midi=True to "
+                "save the midi file and use save_mix to save the mix file or do both!"
+            )
 
         mix_sampling_rate = sampling_rate if mix_sampling_rate is None else mix_sampling_rate
 
@@ -524,11 +529,12 @@ def postprocess(self,
             else:
                 raise ValueError(f"Is {save_path} a directory?")
 
-        pm, notes = self.relative_batch_tokens_to_midi(tokens=relative_tokens,
-                                                       beatstep=ext_beatstep,
-                                                       bars_per_batch=self.n_bars,
-                                                       cutoff_time_idx=(self.n_bars + 1) * 4,
-                                                       )
+        pm, notes = self.relative_batch_tokens_to_midi(
+            tokens=relative_tokens,
+            beatstep=ext_beatstep,
+            bars_per_batch=self.n_bars,
+            cutoff_time_idx=(self.n_bars + 1) * 4,
+        )
         for n in pm.instruments[0].notes:
             n.start += beatsteps[0]
             n.end += beatsteps[0]
@@ -555,4 +561,4 @@ def postprocess(self,
             )
             print(f"stereo-mix file saved at {mix_path}!")
 
-        return pm
\ No newline at end of file
+        return pm
diff --git a/src/transformers/models/pop2piano/modeling_pop2piano.py b/src/transformers/models/pop2piano/modeling_pop2piano.py
index 6d578c5144c92a..8488fe06a7becf 100644
--- a/src/transformers/models/pop2piano/modeling_pop2piano.py
+++ b/src/transformers/models/pop2piano/modeling_pop2piano.py
@@ -17,32 +17,32 @@
 
 import copy
 import math
-import random
 import warnings
-import torchaudio
-from typing import Optional, Tuple, Union, List
+from typing import Optional, Tuple, Union
 
 import numpy as np
 import torch
-from ...feature_extraction_utils import BatchFeature
-import torch.utils.checkpoint
+import torchaudio
 from torch import nn
 from torch.nn import CrossEntropyLoss
-from ...generation.utils import GreedySearchEncoderDecoderOutput
-
-from ...pytorch_utils import ALL_LAYERNORM_LAYERS, find_pruneable_heads_and_indices, prune_linear_layer
+from torch.utils.checkpoint import checkpoint
 
 from ...activations import ACT2FN
-
+from ...feature_extraction_utils import BatchFeature
 from ...modeling_outputs import (
     BaseModelOutput,
     BaseModelOutputWithPastAndCrossAttentions,
     Seq2SeqLMOutput,
-    Seq2SeqModelOutput,
-    BackboneOutput,
 )
 from ...modeling_utils import PreTrainedModel
-from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings, is_torch_fx_proxy
+from ...pytorch_utils import ALL_LAYERNORM_LAYERS, find_pruneable_heads_and_indices, prune_linear_layer
+from ...utils import (
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    is_torch_fx_proxy,
+    logging,
+    replace_return_docstrings,
+)
 from .configuration_pop2piano import Pop2PianoConfig
 
 
@@ -52,7 +52,7 @@
 _CHECKPOINT_FOR_DOC = "susnato/pop2piano_dev"
 
 POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST = [
-    "susnato/pop2piano_dev", # For now
+    "susnato/pop2piano_dev",  # For now
     # See all Pop2Piano models at https://huggingface.co/models?filter=pop2piano
 ]
 
@@ -60,26 +60,22 @@
 Pop2Piano_INPUTS_DOCSTRING = r"""
     Args:
         input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
-            Indices of input sequence tokens in the vocabulary. T5 is a model with relative position embeddings so you
-            should be able to pad the inputs on both the right and the left.
-            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
-            [`PreTrainedTokenizer.__call__`] for detail.
-            [What are input IDs?](../glossary#input-ids)
-            To know more on how to prepare `input_ids` for pretraining take a look a [T5 Training](./t5#training).
+            Indices of input sequence tokens in the vocabulary. Pop2Piano is a model with relative position embeddings
+            so you should be able to pad the inputs on both the right and the left. Indices can be obtained using
+            [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for detail.
+            [What are input IDs?](../glossary#input-ids) To know more on how to prepare `input_ids` for pretraining
+            take a look a [Pop2Pianp Training](./Pop2Piano#training).
         attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
             Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
             - 1 for tokens that are **not masked**,
             - 0 for tokens that are **masked**.
             [What are attention masks?](../glossary#attention-mask)
         decoder_input_ids (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
-            Indices of decoder input sequence tokens in the vocabulary.
-            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
-            [`PreTrainedTokenizer.__call__`] for details.
-            [What are decoder input IDs?](../glossary#decoder-input-ids)
-            T5 uses the `pad_token_id` as the starting token for `decoder_input_ids` generation. If `past_key_values`
-            is used, optionally only the last `decoder_input_ids` have to be input (see `past_key_values`).
-            To know more on how to prepare `decoder_input_ids` for pretraining take a look at [T5
-            Training](./t5#training).
+            Indices of decoder input sequence tokens in the vocabulary. Indices can be obtained using
+            [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for details.
+            [What are decoder input IDs?](../glossary#decoder-input-ids) Pop2Piano uses the `pad_token_id` as the
+            starting token for `decoder_input_ids` generation. If `past_key_values` is used, optionally only the last
+            `decoder_input_ids` have to be input (see `past_key_values`). To know more on how to prepare
         decoder_attention_mask (`torch.BoolTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
             Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also
             be used by default.
@@ -115,9 +111,9 @@
             Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
             representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be
             input (see `past_key_values`). This is useful if you want more control over how to convert
-            `decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
-            If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
-            of `inputs_embeds`.
+            `decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix. If
+            `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value of
+            `inputs_embeds`.
         use_cache (`bool`, *optional*):
             If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
             `past_key_values`).
@@ -131,6 +127,7 @@
             Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
 """
 
+
 class Pop2PianoPreTrainedModel(PreTrainedModel):
     """
     An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
@@ -143,7 +140,7 @@ class Pop2PianoPreTrainedModel(PreTrainedModel):
     supports_gradient_checkpointing = False
     _no_split_modules = None
     _keep_in_fp32_modules = ["wo"]
-    
+
     def _init_weights(self, module):
         """Initialize the weights"""
         factor = self.config.initializer_factor  # Used for testing weights initialization
@@ -198,10 +195,9 @@ def _shift_right(self, input_ids):
         decoder_start_token_id = self.config.decoder_start_token_id
         pad_token_id = self.config.pad_token_id
 
-        assert decoder_start_token_id is not None, (
-            "self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id."
-            " See T5 docs for more information"
-        )
+        assert (
+            decoder_start_token_id is not None
+        ), "self.model.config.decoder_start_token_id has to be defined. In Pop2Piano it is usually set to the pad_token_id."
 
         # shift inputs to the right
         if is_torch_fx_proxy(input_ids):
@@ -219,8 +215,9 @@ def _shift_right(self, input_ids):
 
         return shifted_input_ids
 
+
 class LogMelSpectrogram(nn.Module):
-    """ Generates MelSpectrogram then applies log base e. """
+    """Generates MelSpectrogram then applies log base e."""
 
     def __init__(self, sampling_rate, n_fft, hop_length, f_min, n_mels):
         super(LogMelSpectrogram, self).__init__()
@@ -240,8 +237,9 @@ def forward(self, x):
 
         return X
 
+
 class ConcatEmbeddingToMel(nn.Module):
-    """ Embedding Matrix for `composer` tokens. """
+    """Embedding Matrix for `composer` tokens."""
 
     def __init__(self, embedding_offset, n_vocab, n_dim) -> None:
         super(ConcatEmbeddingToMel, self).__init__()
@@ -254,7 +252,8 @@ def forward(self, feature, index_value):
         inputs_embeds = torch.cat([composer_embedding, feature], dim=1)
         return inputs_embeds
 
-# Copied from transformers.models.t5.T5LayerNorm with T5->Pop2Piano,t5->pop2piano
+
+# Copied from transformers.models.t5.modeling_t5.T5LayerNorm with T5->Pop2Piano
 class Pop2PianoLayerNorm(nn.Module):
     def __init__(self, hidden_size, eps=1e-6):
         """
@@ -296,7 +295,7 @@ def forward(self, hidden_states):
 ALL_LAYERNORM_LAYERS.append(Pop2PianoLayerNorm)
 
 
-# Copied from transformers.models.t5.T5LayerSelfAttention with T5->Pop2Piano,t5->pop2piano
+# Copied from transformers.models.t5.modeling_t5.T5LayerSelfAttention with T5->Pop2Piano,t5->pop2piano
 class Pop2PianoLayerSelfAttention(nn.Module):
     def __init__(self, config, has_relative_attention_bias=False):
         super().__init__()
@@ -328,7 +327,8 @@ def forward(
         outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output them
         return outputs
 
-# Copied from transformers.models.t5.T5LayerCrossAttention with T5->Pop2Piano,t5->pop2piano
+
+# Adapted from transformers.models.t5.modeling_t5.T5Attention with T5->Pop2Piano,t5->pop2piano
 class Pop2PianoAttention(nn.Module):
     def __init__(self, config: Pop2PianoConfig, has_relative_attention_bias=False):
         super().__init__()
@@ -372,19 +372,18 @@ def prune_heads(self, heads):
     @staticmethod
     def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
         """
+        Args:
         Adapted from Mesh Tensorflow:
-        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593
+        https:
+            //github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593
         Translate relative position to a bucket number for relative attention. The relative position is defined as
         memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to
         position. If bidirectional=False, then positive relative positions are invalid. We use smaller buckets for
         small absolute relative_position and larger buckets for larger absolute relative_positions. All relative
-        positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket.
+        positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket.:
         This should allow for more graceful generalization to longer sequences than the model has been trained on
-        Args:
-            relative_position: an int32 Tensor
-            bidirectional: a boolean - whether the attention is bidirectional
-            num_buckets: an integer
-            max_distance: an integer
+            relative_position: an int32 Tensor bidirectional: a boolean - whether the attention is bidirectional
+            num_buckets: an integer max_distance: an integer
         Returns:
             a Tensor with the same shape as relative_position, containing int32 values in the range [0, num_buckets)
         """
@@ -559,11 +558,12 @@ def project(hidden_states, proj_layer, key_value_states, past_key_value):
             outputs = outputs + (attn_weights,)
         return outputs
 
-# Copied from transformers.models.t5.T5Block with T5->Pop2Piano,t5->pop2piano
+
+# Adapted from transformers.models.t5.modeling_t5.T5LayerFF with T5->Pop2Piano,t5->pop2piano
 class Pop2PianoLayerFF(nn.Module):
     def __init__(self, config: Pop2PianoConfig):
         super().__init__()
-        if config.is_gated_act:
+        if config.is_gated_act or config.feed_forward_proj.split("-")[0] == "gated":
             self.DenseReluDense = Pop2PianoDenseGatedActDense(config)
         else:
             self.DenseReluDense = Pop2PianoDenseActDense(config)
@@ -577,7 +577,8 @@ def forward(self, hidden_states):
         hidden_states = hidden_states + self.dropout(forwarded_states)
         return hidden_states
 
-# Copied from transformers.models.t5.T5Block with T5->Pop2Piano,t5->pop2piano
+
+# Copied from transformers.models.t5.modeling_t5.T5DenseActDense with T5->Pop2Piano,t5->pop2piano
 class Pop2PianoDenseActDense(nn.Module):
     def __init__(self, config: Pop2PianoConfig):
         super().__init__()
@@ -590,12 +591,17 @@ def forward(self, hidden_states):
         hidden_states = self.wi(hidden_states)
         hidden_states = self.act(hidden_states)
         hidden_states = self.dropout(hidden_states)
-        if hidden_states.dtype != self.wo.weight.dtype and self.wo.weight.dtype != torch.int8:
+        if (
+            isinstance(self.wo.weight, torch.Tensor)
+            and hidden_states.dtype != self.wo.weight.dtype
+            and self.wo.weight.dtype != torch.int8
+        ):
             hidden_states = hidden_states.to(self.wo.weight.dtype)
         hidden_states = self.wo(hidden_states)
         return hidden_states
 
-# Copied from transformers.models.t5.T5Block with T5->Pop2Piano,t5->pop2piano
+
+# Copied from transformers.models.t5.modeling_t5.T5DenseGatedActDense with T5->Pop2Piano
 class Pop2PianoDenseGatedActDense(nn.Module):
     def __init__(self, config: Pop2PianoConfig):
         super().__init__()
@@ -614,13 +620,18 @@ def forward(self, hidden_states):
         # To make 8bit quantization work for google/flan-t5-xxl, self.wo is kept in float32.
         # See https://github.com/huggingface/transformers/issues/20287
         # we also make sure the weights are not in `int8` in case users will force `_keep_in_fp32_modules` to be `None``
-        if hidden_states.dtype != self.wo.weight.dtype and self.wo.weight.dtype != torch.int8:
+        if (
+            isinstance(self.wo.weight, torch.Tensor)
+            and hidden_states.dtype != self.wo.weight.dtype
+            and self.wo.weight.dtype != torch.int8
+        ):
             hidden_states = hidden_states.to(self.wo.weight.dtype)
 
         hidden_states = self.wo(hidden_states)
         return hidden_states
 
-# Copied from transformers.models.t5.T5Block with T5->Pop2Piano,t5->pop2piano
+
+# Copied from transformers.models.t5.modeling_t5.T5LayerCrossAttention with T5->Pop2Piano,t5->pop2piano
 class Pop2PianoLayerCrossAttention(nn.Module):
     def __init__(self, config):
         super().__init__()
@@ -656,7 +667,8 @@ def forward(
         outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them
         return outputs
 
-# Copied from transformers.models.t5.T5Block with T5->Pop2Piano,t5->pop2piano
+
+# Copied from transformers.models.t5.modeling_t5.T5Block with T5->Pop2Piano,t5->pop2piano
 class Pop2PianoBlock(nn.Module):
     def __init__(self, config, has_relative_attention_bias=False):
         super().__init__()
@@ -713,8 +725,12 @@ def forward(
         attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights
 
         # clamp inf values to enable fp16 training
-        if hidden_states.dtype == torch.float16 and torch.isinf(hidden_states).any():
-            clamp_value = torch.finfo(hidden_states.dtype).max - 1000
+        if hidden_states.dtype == torch.float16:
+            clamp_value = torch.where(
+                torch.isinf(hidden_states).any(),
+                torch.finfo(hidden_states.dtype).max - 1000,
+                torch.finfo(hidden_states.dtype).max,
+            )
             hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
 
         do_cross_attention = self.is_decoder and encoder_hidden_states is not None
@@ -740,8 +756,12 @@ def forward(
             hidden_states = cross_attention_outputs[0]
 
             # clamp inf values to enable fp16 training
-            if hidden_states.dtype == torch.float16 and torch.isinf(hidden_states).any():
-                clamp_value = torch.finfo(hidden_states.dtype).max - 1000
+            if hidden_states.dtype == torch.float16:
+                clamp_value = torch.where(
+                    torch.isinf(hidden_states).any(),
+                    torch.finfo(hidden_states.dtype).max - 1000,
+                    torch.finfo(hidden_states.dtype).max,
+                )
                 hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
 
             # Combine self attn and cross attn key value states
@@ -755,8 +775,12 @@ def forward(
         hidden_states = self.layer[-1](hidden_states)
 
         # clamp inf values to enable fp16 training
-        if hidden_states.dtype == torch.float16 and torch.isinf(hidden_states).any():
-            clamp_value = torch.finfo(hidden_states.dtype).max - 1000
+        if hidden_states.dtype == torch.float16:
+            clamp_value = torch.where(
+                torch.isinf(hidden_states).any(),
+                torch.finfo(hidden_states.dtype).max - 1000,
+                torch.finfo(hidden_states.dtype).max,
+            )
             hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
 
         outputs = (hidden_states,)
@@ -769,7 +793,7 @@ def forward(
         return outputs  # hidden-states, present_key_value_states, (self-attention position bias), (self-attention weights), (cross-attention position bias), (cross-attention weights)
 
 
-# Copied from transformers.models.t5.T5Stack with T5->Pop2Piano,t5->pop2piano
+# Adapted from transformers.models.t5.modeling_t5.T5Stack with T5->Pop2Piano,t5->pop2piano
 class Pop2PianoStack(Pop2PianoPreTrainedModel):
     def __init__(self, config, embed_tokens=None):
         super().__init__(config)
@@ -1015,15 +1039,14 @@ def custom_forward(*inputs):
 
 
 Pop2Piano_START_DOCSTRING = r"""
-    The Pop2PianoForConditionalGeneration model was proposed in [POP2PIANO : POP AUDIO-BASED PIANO COVER GENERATION](https://arxiv.org/pdf/2211.00895) by Jongho Choi, Kyogu 
-    Lee. It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.
-    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
-    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
-    etc.)
-    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
-    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
-    and behavior.
     Parameters:
+    The Pop2PianoForConditionalGeneration model was proposed in [POP2PIANO : POP AUDIO-BASED PIANO COVER
+    GENERATION](https://arxiv.org/pdf/2211.00895) by Jongho Choi, Kyogu Lee. It's an encoder decoder transformer
+    pre-trained in a text-to-text denoising generative setting. This model inherits from [`PreTrainedModel`]. Check the:
+    superclass documentation for the generic methods the library implements for all its model (such as downloading or
+    saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch
+    [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch
+    Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
         config ([`Pop2PianoConfig`]): Model configuration class with all the parameters of the model.
             Initializing with a config file does not load the weights associated with the model, only the
             configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
@@ -1037,7 +1060,16 @@ def custom_forward(*inputs):
 num_heads)`.
 """
 
-# Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration with T5->Pop2Piano,t5->pop2piano
+# Warning message for FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask
+__HEAD_MASK_WARNING_MSG = """
+The input argument `head_mask` was split into two arguments `head_mask` and `decoder_head_mask`. Currently,
+`decoder_head_mask` is set to copy `head_mask`, but this feature is deprecated and will be removed in future versions.
+If you do not want to use any `decoder_head_mask` now, please set `decoder_head_mask = torch.ones(num_layers,
+num_heads)`.
+"""
+
+
+# Adapted from transformers.models.t5.modeling_t5.T5ForConditionalGeneration with T5->Pop2Piano,t5->pop2piano
 @add_start_docstrings("""Pop2Piano Model with a `language modeling` head on top.""", Pop2Piano_START_DOCSTRING)
 class Pop2PianoForConditionalGeneration(Pop2PianoPreTrainedModel):
     _keys_to_ignore_on_load_missing = [
@@ -1054,20 +1086,20 @@ def __init__(self, config: Pop2PianoConfig):
         self.config = config
         self.model_dim = config.d_model
 
-        self.spectrogram = LogMelSpectrogram(sampling_rate=config.dataset.get("sampling_rate"),
-                                             n_fft=config.n_fft,
-                                             hop_length=config.hop_length,
-                                             f_min=config.f_min,
-                                             n_mels=config.n_mels
-                                             )
-        if config.dataset.get("mel_is_conditioned", True):
+        self.spectrogram = LogMelSpectrogram(
+            sampling_rate=config.dataset_sampling_rate,
+            n_fft=config.n_fft,
+            hop_length=config.hop_length,
+            f_min=config.f_min,
+            n_mels=config.n_mels,
+        )
+        if config.dataset_mel_is_conditioned:
             n_dim = 512
             composer_n_vocab = len(config.composer_to_feature_token)
             embedding_offset = min(config.composer_to_feature_token.values())
-            self.mel_conditioner = ConcatEmbeddingToMel(embedding_offset=embedding_offset,
-                                                        n_vocab=composer_n_vocab,
-                                                        n_dim=n_dim
-                                                        )
+            self.mel_conditioner = ConcatEmbeddingToMel(
+                embedding_offset=embedding_offset, n_vocab=composer_n_vocab, n_dim=n_dim
+            )
 
         self.shared = nn.Embedding(config.vocab_size, config.d_model)
 
@@ -1116,23 +1148,23 @@ def get_decoder(self):
     @add_start_docstrings_to_model_forward(Pop2Piano_INPUTS_DOCSTRING)
     @replace_return_docstrings(output_type=Seq2SeqLMOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
-            self,
-            input_ids: Optional[torch.LongTensor] = None,
-            attention_mask: Optional[torch.FloatTensor] = None,
-            decoder_input_ids: Optional[torch.LongTensor] = None,
-            decoder_attention_mask: Optional[torch.BoolTensor] = None,
-            head_mask: Optional[torch.FloatTensor] = None,
-            decoder_head_mask: Optional[torch.FloatTensor] = None,
-            cross_attn_head_mask: Optional[torch.Tensor] = None,
-            encoder_outputs: Optional[Tuple[Tuple[torch.Tensor]]] = None,
-            past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
-            inputs_embeds: Optional[torch.FloatTensor] = None,
-            decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
-            labels: Optional[torch.LongTensor] = None,
-            use_cache: Optional[bool] = None,
-            output_attentions: Optional[bool] = None,
-            output_hidden_states: Optional[bool] = None,
-            return_dict: Optional[bool] = None,
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        decoder_input_ids: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.BoolTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        decoder_head_mask: Optional[torch.FloatTensor] = None,
+        cross_attn_head_mask: Optional[torch.Tensor] = None,
+        encoder_outputs: Optional[Tuple[Tuple[torch.Tensor]]] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
     ) -> Union[Tuple[torch.FloatTensor], Seq2SeqLMOutput]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
@@ -1140,25 +1172,7 @@ def forward(
             config.vocab_size - 1]`. All labels set to `-100` are ignored (masked), the loss is only computed for
             labels in `[0, ..., config.vocab_size]`
         Returns:
-        Examples:
-        ```python
-        >>> from transformers import AutoTokenizer, T5ForConditionalGeneration
-        >>> tokenizer = AutoTokenizer.from_pretrained("t5-small")
-        >>> model = T5ForConditionalGeneration.from_pretrained("t5-small")
-        >>> # training
-        >>> input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
-        >>> labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
-        >>> outputs = model(input_ids=input_ids, labels=labels)
-        >>> loss = outputs.loss
-        >>> logits = outputs.logits
-        >>> # inference
-        >>> input_ids = tokenizer(
-        ...     "summarize: studies have shown that owning a dog is good for you", return_tensors="pt"
-        ... ).input_ids  # Batch size 1
-        >>> outputs = model.generate(input_ids)
-        >>> print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-        >>> # studies have shown that owning a dog is good for you.
-        ```"""
+        """
         use_cache = use_cache if use_cache is not None else self.config.use_cache
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
 
@@ -1234,7 +1248,7 @@ def forward(
         if self.config.tie_word_embeddings:
             # Rescale output before projecting on vocab
             # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586
-            sequence_output = sequence_output * (self.model_dim ** -0.5)
+            sequence_output = sequence_output * (self.model_dim**-0.5)
 
         lm_logits = self.lm_head(sequence_output)
 
@@ -1262,33 +1276,28 @@ def forward(
 
     @torch.no_grad()
     def generate(
-            self,
-            input_features:BatchFeature,
-            inputs_embeds=None,
-            composer="composer1",
-            n_bars:int = 2,
-            max_length:int=None,
-            inputs: Optional[torch.Tensor] = None,
-            generation_config=None,
-            logits_processor=None,
-            stopping_criteria=None,
-            prefix_allowed_tokens_fn=None,
-            synced_gpus=False,
-            return_timestamps=None,
-            task=None,
-            language=None,
-            is_multilingual=None,
-            **kwargs,
+        self,
+        input_features: BatchFeature,
+        inputs_embeds=None,
+        composer="composer1",
+        n_bars: int = 2,
+        max_length: int = None,
+        inputs: Optional[torch.Tensor] = None,
+        generation_config=None,
+        **kwargs,
     ):
         """
         Generates sequences of token ids for models with a language modeling head.
+
         <Tip warning={true}>
+
         Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
         model's default generation configuration. You can override any `generation_config` by passing the corresponding
-        parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
+        parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`. For an overview of generation
+        strategies and code examples, check out the [following guide](./generation_strategies).
+
         </Tip>
+
         Parameters:
             input_features (`BatchFeature`):
                 `input_features` returned by `Pop2PianoFeatureExtractor.__call__`
@@ -1314,33 +1323,6 @@ def generate(
                 priority: 1) from the `generation_config.json` model file, if it exists; 2) from the model
                 configuration. Please note that unspecified parameters will inherit [`~generation.GenerationConfig`]'s
                 default values, whose documentation should be checked to parameterize generation.
-            logits_processor (`LogitsProcessorList`, *optional*):
-                Custom logits processors that complement the default logits processors built from arguments and
-                generation config. If a logit processor is passed that is already created with the arguments or a
-                generation config an error is thrown. This feature is intended for advanced users.
-            stopping_criteria (`StoppingCriteriaList`, *optional*):
-                Custom stopping criteria that complement the default stopping criteria built from arguments and a
-                generation config. If a stopping criteria is passed that is already created with the arguments or a
-                generation config an error is thrown. This feature is intended for advanced users.
-            prefix_allowed_tokens_fn (`Callable[[int, torch.Tensor], List[int]]`, *optional*):
-                If provided, this function constraints the beam search to allowed tokens only at each step. If not
-                provided no constraint is applied. This function takes 2 arguments: the batch ID `batch_id` and
-                `input_ids`. It has to return a list with the allowed tokens for the next generation step conditioned
-                on the batch ID `batch_id` and the previously generated tokens `inputs_ids`. This argument is useful
-                for constrained generation conditioned on the prefix, as described in [Autoregressive Entity
-                Retrieval](https://arxiv.org/abs/2010.00904).
-            synced_gpus (`bool`, *optional*, defaults to `False`):
-                Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
-            return_timestamps (`bool`, *optional*):
-                Whether to return the timestamps with the text. This enables the `WhisperTimestampsLogitsProcessor`.
-            task (`bool`, *optional*):
-                Task to use for generation, either "translate" or "transcribe". The `model.config.forced_decoder_ids`
-                will be updated accordingly.
-            language (`bool`, *optional*):
-                Language token to use for generation, can be either in the form of `<|en|>`, `en` or `english`. You can
-                find all the possible language tokens in the `model.generation_config.lang_to_id` dictionary.
-            is_multilingual (`bool`, *optional*):
-                Whether or not the model is multilingual.
             kwargs:
                 Ad hoc parametrization of `generate_config` and/or additional model-specific kwargs that will be
                 forwarded to the `forward` function of the model. If the model is an encoder-decoder model, encoder
@@ -1360,87 +1342,56 @@ def generate(
                     - [`~generation.SampleEncoderDecoderOutput`],
                     - [`~generation.BeamSearchEncoderDecoderOutput`],
                     - [`~generation.BeamSampleEncoderDecoderOutput`]
-        """
+        Examples:
+        ```python
+        >>> import librosa
+        >>> from transformers import Pop2PianoFeatureExtractor, Pop2PianoForConditionalGeneration
+
+        >>> raw_audio, sr = librosa.load("audio.mp3", sr=44100)
+        >>> model = Pop2PianoForConditionalGeneration.from_pretrained("susnato/pop2piano_dev")
+        >>> feature_extractor = Pop2PianoFeatureExtractor.from_pretrained("susnato/pop2piano_dev")
+        >>> model.eval()
+
+        >>> feature_extractor_outputs = fe(raw_audio=raw_audio, audio_sr=sr, return_tensors="pt")
+        >>> model_outputs = model.generate(feature_extractor_outputs, composer="composer1")
+
+        >>> prettymidi_output = feature_extractor.postprocess(
+        ...     relative_tokens=model_outputs,
+        ...     beatsteps=feature_extractor_outputs["beatsteps"],
+        ...     ext_beatstep=feature_extractor_outputs["ext_beatstep"],
+        ...     raw_audio=raw_audio,
+        ...     sampling_rate=sr,
+        ...     mix_sampling_rate=sr,
+        ...     save_path="./Outputs/",
+        ...     audio_file_name="output_filename",
+        ...     save_midi=True,
+        ...     save_mix=True,
+        ... )
+        ```"""
         if input_features is not None and inputs_embeds is not None:
             raise ValueError("Both input_features and inputs_embeds received. Please give only input_features")
 
         if generation_config is None:
             generation_config = self.generation_config
 
-        if return_timestamps is not None:
-            if not hasattr(generation_config, "no_timestamps_token_id"):
-                raise ValueError(
-                    "You are trying to return timestamps, but the generation config is not properly set."
-                    "Make sure to initialize the generation config with the correct attributes that are needed such as `no_timestamps_token_id`."
-                    "For more details on how to generate the approtiate config, refer to https://github.com/huggingface/transformers/issues/21878#issuecomment-1451902363"
-                )
-
-            generation_config.return_timestamps = return_timestamps
-        else:
-            generation_config.return_timestamps = False
-
-        if language is not None:
-            generation_config.language = language
-        if task is not None:
-            generation_config.task = task
-
-        forced_decoder_ids = []
-        if task is not None or language is not None:
-            if hasattr(generation_config, "language"):
-                if generation_config.language in generation_config.lang_to_id.keys():
-                    language_token = generation_config.language
-                elif generation_config.language in TO_LANGUAGE_CODE.keys():
-                    language_token = f"<|{TO_LANGUAGE_CODE[generation_config.language]}|>"
-                else:
-                    raise ValueError(
-                        f"Unsupported language: {self.language}. Language should be one of:"
-                        f" {list(TO_LANGUAGE_CODE.keys()) if generation_config.language in TO_LANGUAGE_CODE.keys() else list(TO_LANGUAGE_CODE.values())}."
-                    )
-                forced_decoder_ids.append((1, generation_config.lang_to_id[language_token]))
-            else:
-                forced_decoder_ids.append((1, None))  # automatically detect the language
-
-            if hasattr(generation_config, "task"):
-                if generation_config.task in TASK_IDS:
-                    forced_decoder_ids.append((2, generation_config.task_to_id[generation_config.task]))
-                else:
-                    raise ValueError(
-                        f"The `{generation_config.task}`task is not supported. The task should be one of `{TASK_IDS}`"
-                    )
-            else:
-                forced_decoder_ids.append((2, generation_config.task_to_id["transcribe"]))  # defaults to transcribe
-            if hasattr(generation_config, "no_timestamps_token_id") and not generation_config.return_timestamps:
-                idx = forced_decoder_ids[-1][0] + 1 if forced_decoder_ids else 1
-                forced_decoder_ids.append((idx, generation_config.no_timestamps_token_id))
-
-        # Legacy code for backward compatibility
-        elif hasattr(self.config, "forced_decoder_ids") and self.config.forced_decoder_ids is not None:
-            forced_decoder_ids = self.config.forced_decoder_ids
-        elif (
-                hasattr(self.generation_config, "forced_decoder_ids")
-                and self.generation_config.forced_decoder_ids is not None
-        ):
-            forced_decoder_ids = self.generation_config.forced_decoder_ids
-
-        if generation_config.return_timestamps:
-            logits_processor = [WhisperTimeStampLogitsProcessor(generation_config)]
-
-        if len(forced_decoder_ids) > 0:
-            generation_config.forced_decoder_ids = forced_decoder_ids
-
         # select composer randomly if not already given
         composer_to_feature_token = self.config.composer_to_feature_token
         if composer is None:
             composer = np.random.choice(list(composer_to_feature_token.keys()), size=1)[0]
         elif composer not in composer_to_feature_token.keys():
-            raise ValueError(f"Composer not found in list, Please choose from {list(composer_to_feature_token.keys())}")
+            raise ValueError(
+                f"Composer not found in list, Please choose from {list(composer_to_feature_token.keys())}"
+            )
 
-        n_bars = self.config.dataset.get("n_bars", None) if n_bars is None else n_bars
-        max_length = self.config.dataset.get("target_length") * max(1, (n_bars // self.config.dataset.get("n_bars"))) \
-            if max_length is None else max_length
+        n_bars = self.config.dataset_n_bars if n_bars is None else n_bars
+        max_length = (
+            self.config.dataset_target_length * max(1, (n_bars // self.config.dataset_n_bars))
+            if max_length is None
+            else max_length
+        )
 
         inputs_embeds = self.spectrogram(input_features["input_features"]).transpose(-1, -2)
-        if self.config.dataset.get("mel_is_conditioned", None):
+        if self.config.dataset_mel_is_conditioned:
             composer_value = composer_to_feature_token[composer]
             composer_value = torch.tensor(composer_value, device=self.device)
             composer_value = composer_value.repeat(inputs_embeds.shape[0])
@@ -1449,26 +1400,22 @@ def generate(
         return super().generate(
             inputs,
             generation_config,
-            logits_processor,
-            stopping_criteria,
-            prefix_allowed_tokens_fn,
-            synced_gpus,
             inputs_embeds=inputs_embeds,
             max_length=max_length,
             **kwargs,
         )
 
     def prepare_inputs_for_generation(
-            self,
-            input_ids,
-            past_key_values=None,
-            attention_mask=None,
-            head_mask=None,
-            decoder_head_mask=None,
-            cross_attn_head_mask=None,
-            use_cache=None,
-            encoder_outputs=None,
-            **kwargs,
+        self,
+        input_ids,
+        past_key_values=None,
+        attention_mask=None,
+        head_mask=None,
+        decoder_head_mask=None,
+        cross_attn_head_mask=None,
+        use_cache=None,
+        encoder_outputs=None,
+        **kwargs,
     ):
         # cut decoder_input_ids if past is used
         if past_key_values is not None:
diff --git a/src/transformers/testing_utils.py b/src/transformers/testing_utils.py
index 80b6e9c72b30f6..d265c854bcbb63 100644
--- a/src/transformers/testing_utils.py
+++ b/src/transformers/testing_utils.py
@@ -55,6 +55,7 @@
     is_cython_available,
     is_decord_available,
     is_detectron2_available,
+    is_essentia_available,
     is_faiss_available,
     is_flax_available,
     is_ftfy_available,
@@ -62,12 +63,11 @@
     is_jumanpp_available,
     is_keras_nlp_available,
     is_librosa_available,
-    is_essentia_available,
-    is_pretty_midi_available,
     is_natten_available,
     is_onnx_available,
     is_pandas_available,
     is_phonemizer_available,
+    is_pretty_midi_available,
     is_pyctcdecode_available,
     is_pytesseract_available,
     is_pytorch_quantization_available,
@@ -706,18 +706,21 @@ def require_librosa(test_case):
     """
     return unittest.skipUnless(is_librosa_available(), "test requires librosa")(test_case)
 
+
 def require_essentia(test_case):
     """
     Decorator marking a test that requires essentia
     """
     return unittest.skipUnless(is_essentia_available(), "test requires essentia")(test_case)
 
+
 def require_pretty_midi(test_case):
     """
     Decorator marking a test that requires pretty_midi
     """
     return unittest.skipUnless(is_pretty_midi_available(), "test requires pretty_midi")(test_case)
 
+
 def cmd_exists(cmd):
     return shutil.which(cmd) is not None
 
diff --git a/src/transformers/utils/__init__.py b/src/transformers/utils/__init__.py
index 2a91bfa491d18b..fe3c1c65b62058 100644
--- a/src/transformers/utils/__init__.py
+++ b/src/transformers/utils/__init__.py
@@ -104,6 +104,7 @@
     is_datasets_available,
     is_decord_available,
     is_detectron2_available,
+    is_essentia_available,
     is_faiss_available,
     is_flax_available,
     is_ftfy_available,
@@ -113,13 +114,12 @@
     is_kenlm_available,
     is_keras_nlp_available,
     is_librosa_available,
-    is_essentia_available,
-    is_pretty_midi_available,
     is_natten_available,
     is_ninja_available,
     is_onnx_available,
     is_pandas_available,
     is_phonemizer_available,
+    is_pretty_midi_available,
     is_protobuf_available,
     is_psutil_available,
     is_py3nvml_available,
diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py
index a80af49e278499..f800fa96575bb7 100644
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -5111,6 +5111,23 @@ def __init__(self, *args, **kwargs):
         requires_backends(self, ["torch"])
 
 
+POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST = None
+
+
+class Pop2PianoForConditionalGeneration(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class Pop2PianoPreTrainedModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
 PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST = None
 
 
diff --git a/src/transformers/utils/import_utils.py b/src/transformers/utils/import_utils.py
index 11e06ae94fee63..8d4063c64f568b 100644
--- a/src/transformers/utils/import_utils.py
+++ b/src/transformers/utils/import_utils.py
@@ -312,12 +312,15 @@ def is_pyctcdecode_available():
 def is_librosa_available():
     return _librosa_available
 
+
 def is_essentia_available():
     return _essentia_available
 
+
 def is_pretty_midi_available():
     return _pretty_midi_available
 
+
 def is_torch_cuda_available():
     if is_torch_available():
         import torch
diff --git a/tests/models/pop2piano/test_feature_extraction_pop2piano.py b/tests/models/pop2piano/test_feature_extraction_pop2piano.py
index 0df3e8533b9cb2..cc996fdc5fa347 100644
--- a/tests/models/pop2piano/test_feature_extraction_pop2piano.py
+++ b/tests/models/pop2piano/test_feature_extraction_pop2piano.py
@@ -14,9 +14,7 @@
 # limitations under the License.
 
 
-import itertools
 import os
-import random
 import tempfile
 import unittest
 
@@ -24,23 +22,41 @@
 from datasets import load_dataset
 
 from transformers import is_speech_available
-from transformers.testing_utils import (check_json_file_has_correct_format, require_torch,
-                                        require_essentia, require_librosa, require_scipy,
-                                        require_pretty_midi, require_soundfile)
-from transformers.utils.import_utils import (is_torch_available, is_essentia_available,
-                                             is_scipy_available, is_librosa_available,
-                                             is_soundfile_availble, )
+from transformers.testing_utils import (
+    check_json_file_has_correct_format,
+    require_essentia,
+    require_librosa,
+    require_pretty_midi,
+    require_scipy,
+    require_soundfile,
+    require_torch,
+)
+from transformers.utils.import_utils import (
+    is_essentia_available,
+    is_librosa_available,
+    is_scipy_available,
+    is_soundfile_availble,
+    is_torch_available,
+)
 
 from ...test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin
 
-requirements = is_speech_available() and is_torch_available() and is_essentia_available() and is_scipy_available() and \
-        is_librosa_available() and is_soundfile_availble()
+
+requirements = (
+    is_speech_available()
+    and is_torch_available()
+    and is_essentia_available()
+    and is_scipy_available()
+    and is_librosa_available()
+    and is_soundfile_availble()
+)
 
 if requirements:
     from transformers import Pop2PianoFeatureExtractor
 if is_torch_available():
     import torch
 
+
 @require_torch
 @require_essentia
 @require_librosa
@@ -79,9 +95,10 @@ def prepare_feat_extract_dict(self):
             "vocab_size_special": self.vocab_size_special,
             "vocab_size_note": self.vocab_size_note,
             "vocab_size_velocity": self.vocab_size_velocity,
-            "vocab_size_time":self.vocab_size_time,
+            "vocab_size_time": self.vocab_size_time,
         }
 
+
 @require_torch
 @require_essentia
 @require_librosa
@@ -126,7 +143,11 @@ def test_feat_extract_to_json_file(self):
 
     def test_call(self):
         feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
-        speech_input = np.zeros([1000000, ])
+        speech_input = np.zeros(
+            [
+                1000000,
+            ]
+        )
 
         input_features = feature_extractor(speech_input, audio_sr=16_000, return_tensors="np")
         self.assertTrue(input_features.input_features.ndim == 2)
@@ -142,12 +163,38 @@ def _load_datasamples(self, num_samples):
 
     def test_integration(self):
         EXPECTED_INPUT_FEATURES = torch.tensor(
-            [-4.5434e-05, -1.8900e-04, -2.2150e-04, -2.1844e-04, -2.7647e-04,
-             -2.1334e-04, -1.5305e-04, -2.6124e-04, -2.6863e-04, -1.5969e-04,
-             -1.6224e-04, -1.2900e-04, -9.9139e-06, 1.5336e-05, 4.7507e-05,
-             9.3454e-05, -2.3652e-05, -1.2942e-04, -1.0804e-04, -1.4267e-04,
-             -1.5102e-04, -6.7488e-05, -9.6527e-05, -9.6909e-05, 8.0032e-05,
-             8.1948e-05, -7.3148e-05, 3.4405e-05, 1.5065e-04, -1.0989e-04]
+            [
+                -4.5434e-05,
+                -1.8900e-04,
+                -2.2150e-04,
+                -2.1844e-04,
+                -2.7647e-04,
+                -2.1334e-04,
+                -1.5305e-04,
+                -2.6124e-04,
+                -2.6863e-04,
+                -1.5969e-04,
+                -1.6224e-04,
+                -1.2900e-04,
+                -9.9139e-06,
+                1.5336e-05,
+                4.7507e-05,
+                9.3454e-05,
+                -2.3652e-05,
+                -1.2942e-04,
+                -1.0804e-04,
+                -1.4267e-04,
+                -1.5102e-04,
+                -6.7488e-05,
+                -9.6527e-05,
+                -9.6909e-05,
+                8.0032e-05,
+                8.1948e-05,
+                -7.3148e-05,
+                3.4405e-05,
+                1.5065e-04,
+                -1.0989e-04,
+            ]
         )
 
         input_speech, sampling_rate = self._load_datasamples(1)
@@ -197,4 +244,4 @@ def test_padding_from_list(self):
 
     @unittest.skip("Pop2PianoFeatureExtractor does not supports padding")
     def test_padding_from_array(self):
-        pass
\ No newline at end of file
+        pass
diff --git a/tests/models/pop2piano/test_modeling_pop2piano.py b/tests/models/pop2piano/test_modeling_pop2piano.py
index 30777a7c144e61..4b3ffd3da8c2a2 100644
--- a/tests/models/pop2piano/test_modeling_pop2piano.py
+++ b/tests/models/pop2piano/test_modeling_pop2piano.py
@@ -15,23 +15,18 @@
 """ Testing suite for the PyTorch Pop2Piano model. """
 
 import copy
-import inspect
-import os
 import tempfile
 import unittest
 
-import numpy as np
-
-import transformers
 from transformers import Pop2PianoConfig
-from transformers.testing_utils import is_pt_flax_cross_test, require_torch, require_torchaudio, slow, torch_device
-from transformers.utils import cached_property, is_flax_available, is_torch_available
 from transformers.feature_extraction_utils import BatchFeature
+from transformers.testing_utils import require_torch, require_torchaudio, slow, torch_device
+from transformers.utils import is_torch_available
 
 # from ...test_pipeline_mixin import PipelineTesterMixin
 from ...generation.test_utils import GenerationTesterMixin
 from ...test_configuration_common import ConfigTester
-from ...test_modeling_common import ModelTesterMixin, _config_zero_init, floats_tensor, ids_tensor
+from ...test_modeling_common import ModelTesterMixin, ids_tensor
 
 
 if is_torch_available():
@@ -39,10 +34,10 @@
 
     from transformers import (
         Pop2PianoForConditionalGeneration,
-        set_seed,
     )
     from transformers.models.pop2piano.modeling_pop2piano import POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST
 
+
 class Pop2PianoModelTester:
     def __init__(
         self,
@@ -393,7 +388,9 @@ def create_and_check_model_fp16_forward(
         lm_labels,
     ):
         model = Pop2PianoForConditionalGeneration(config=config).to(torch_device).half().eval()
-        output = model(input_ids, decoder_input_ids=input_ids, attention_mask=attention_mask)["encoder_last_hidden_state"]
+        output = model(input_ids, decoder_input_ids=input_ids, attention_mask=attention_mask)[
+            "encoder_last_hidden_state"
+        ]
         self.parent.assertFalse(torch.isnan(output).any().item())
 
     def create_and_check_encoder_decoder_shared_weights(
@@ -509,7 +506,7 @@ def prepare_config_and_inputs_for_common(self):
 
 @require_torch
 class Pop2PianoModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
-    all_model_classes = (Pop2PianoForConditionalGeneration, ) if is_torch_available() else ()
+    all_model_classes = (Pop2PianoForConditionalGeneration,) if is_torch_available() else ()
     all_generative_model_classes = ()
     all_parallelizable_model_classes = ()
     fx_compatible = False
@@ -591,10 +588,6 @@ def test_decoder_model_past_with_large_inputs(self):
         config_and_inputs = self.model_tester.prepare_config_and_inputs()
         self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
 
-    def test_generate_with_past_key_values(self):
-        config_and_inputs = self.model_tester.prepare_config_and_inputs()
-        self.model_tester.create_and_check_generate_with_past_key_values(*config_and_inputs)
-
     def test_encoder_decoder_shared_weights(self):
         config_and_inputs = self.model_tester.prepare_config_and_inputs()
         self.model_tester.create_and_check_encoder_decoder_shared_weights(*config_and_inputs)
@@ -628,40 +621,6 @@ def test_export_to_onnx(self):
                 input_names=["input_ids", "decoder_input_ids"],
             )
 
-    def test_generate_with_head_masking(self):
-        attention_names = ["encoder_attentions", "decoder_attentions", "cross_attentions"]
-        config_and_inputs = self.model_tester.prepare_config_and_inputs()
-        config = config_and_inputs[0]
-        max_length = config_and_inputs[1].shape[-1] + 3
-        model = Pop2PianoForConditionalGeneration(config).eval()
-        model.to(torch_device)
-
-        head_masking = {
-            "head_mask": torch.zeros(config.num_layers, config.num_heads, device=torch_device),
-            "decoder_head_mask": torch.zeros(config.num_decoder_layers, config.num_heads, device=torch_device),
-            "cross_attn_head_mask": torch.zeros(config.num_decoder_layers, config.num_heads, device=torch_device),
-        }
-
-        for attn_name, (name, mask) in zip(attention_names, head_masking.items()):
-            head_masks = {name: mask}
-            # Explicitly pass decoder_head_mask as it is required from Pop2Piano model when head_mask specified
-            if name == "head_mask":
-                head_masks["decoder_head_mask"] = torch.ones(
-                    config.num_decoder_layers, config.num_heads, device=torch_device
-                )
-
-            out = model.generate(
-                config_and_inputs[1],
-                num_beams=1,
-                max_length=max_length,
-                output_attentions=True,
-                return_dict_in_generate=True,
-                **head_masks,
-            )
-            # We check the state of decoder_attentions and cross_attentions just from the last step
-            attn_weights = out[attn_name] if attn_name == attention_names[0] else out[attn_name][-1]
-            self.assertEqual(sum([w.sum().item() for w in attn_weights]), 0.0)
-
     @unittest.skip("Does not work on the tiny model as we keep hitting edge cases.")
     def test_disk_offload(self):
         pass
@@ -674,6 +633,7 @@ def test_generate_with_head_masking(self):
     def test_generate_with_past_key_values(self):
         pass
 
+
 @require_torch
 @require_torchaudio
 class Pop2PianoModelIntegrationTests(unittest.TestCase):
@@ -687,11 +647,14 @@ def test_log_mel_spectrogram_integration(self):
         self.assertEqual(output.size(), torch.Size([10, 512, 98]))
 
         # check values
-        self.assertEqual(output[0, :3, :3].cpu().numpy().tolist(),
-                         [[-13.815510749816895, -13.815510749816895, -13.815510749816895],
-                          [-13.815510749816895, -13.815510749816895, -13.815510749816895],
-                          [-13.815510749816895, -13.815510749816895, -13.815510749816895]]
-                         )
+        self.assertEqual(
+            output[0, :3, :3].cpu().numpy().tolist(),
+            [
+                [-13.815510749816895, -13.815510749816895, -13.815510749816895],
+                [-13.815510749816895, -13.815510749816895, -13.815510749816895],
+                [-13.815510749816895, -13.815510749816895, -13.815510749816895],
+            ],
+        )
 
     @slow
     def test_mel_conditioner_integration(self):
@@ -708,23 +671,20 @@ def test_mel_conditioner_integration(self):
         self.assertEqual(outputs.size(), torch.Size([10, 101, 512]))
 
         # check values
-        self.assertEqual(outputs[0, :3, :3].detach().cpu().numpy().tolist(),
-                         [[1.0475305318832397, 0.29052114486694336, -0.47778210043907166],
-                          [1.0, 1.0, 1.0],
-                          [1.0, 1.0, 1.0]]
-                         )
+        self.assertEqual(
+            outputs[0, :3, :3].detach().cpu().numpy().tolist(),
+            [[1.0475305318832397, 0.29052114486694336, -0.47778210043907166], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0]],
+        )
 
     @slow
     def test_full_model_integration(self):
         model = Pop2PianoForConditionalGeneration.from_pretrained("susnato/pop2piano_dev")
         model.eval()
-        input_features = BatchFeature({'input_features': torch.ones([100, 100000])})
+        input_features = BatchFeature({"input_features": torch.ones([100, 100000])})
         outputs = model.generate(input_features=input_features)
 
         # check for shapes
         self.assertEqual(outputs.size(0), 100)
 
         # check for values
-        self.assertEqual(outputs[0, :3].detach().cpu().numpy().tolist(),
-                         [0, 134, 133]
-                         )
\ No newline at end of file
+        self.assertEqual(outputs[0, :3].detach().cpu().numpy().tolist(), [0, 134, 133])
diff --git a/utils/check_repo.py b/utils/check_repo.py
index 121993bc1e833c..4cbcd1a3ca942b 100644
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@@ -45,6 +45,7 @@
     "RealmBertModel",
     "T5Stack",
     "MT5Stack",
+    "Pop2PianoStack",
     "SwitchTransformersStack",
     "TFDPRSpanPredictor",
     "MaskFormerSwinModel",
diff --git a/utils/documentation_tests.txt b/utils/documentation_tests.txt
index 8b622bf778dc2b..035ce24e7da2bd 100644
--- a/utils/documentation_tests.txt
+++ b/utils/documentation_tests.txt
@@ -147,6 +147,8 @@ src/transformers/models/plbart/configuration_plbart.py
 src/transformers/models/plbart/modeling_plbart.py
 src/transformers/models/poolformer/configuration_poolformer.py
 src/transformers/models/poolformer/modeling_poolformer.py
+src/transformers/models/pop2piano/modeling_pop2piano.py
+src/transformers/models/pop2piano/configuration_pop2piano.py
 src/transformers/models/realm/configuration_realm.py
 src/transformers/models/reformer/configuration_reformer.py
 src/transformers/models/reformer/modeling_reformer.py