diff --git a/README.md b/README.md index 6397d14adaadb2..195ae1c03bed11 100644 --- a/README.md +++ b/README.md @@ -401,6 +401,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. diff --git a/README_es.md b/README_es.md index 899140210cf13d..193e2f5747b9c3 100644 --- a/README_es.md +++ b/README_es.md @@ -389,6 +389,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. diff --git a/README_hd.md b/README_hd.md index 826306cf67bf4a..aa5c5969f96ec2 100644 --- a/README_hd.md +++ b/README_hd.md @@ -361,6 +361,7 @@ conda install -c huggingface transformers 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (VinAI Research से) कागज के साथ [PhoBERT: वियतनामी के लिए पूर्व-प्रशिक्षित भाषा मॉडल](https://www .aclweb.org/anthology/2020.findings-emnlp.92/) डैट क्वोक गुयेन और अन्ह तुआन गुयेन द्वारा पोस्ट किया गया। 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP से) साथ वाला पेपर [प्रोग्राम अंडरस्टैंडिंग एंड जेनरेशन के लिए यूनिफाइड प्री-ट्रेनिंग](https://arxiv .org/abs/2103.06333) वसी उद्दीन अहमद, सैकत चक्रवर्ती, बैशाखी रे, काई-वेई चांग द्वारा। 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (माइक्रोसॉफ्ट रिसर्च से) साथ में पेपर [ProphetNet: प्रेडिक्टिंग फ्यूचर एन-ग्राम फॉर सीक्वेंस-टू-सीक्वेंस प्री-ट्रेनिंग ](https://arxiv.org/abs/2001.04063) यू यान, वीज़ेन क्यूई, येयुन गोंग, दयाहेंग लियू, नान डुआन, जिउशेंग चेन, रुओफ़ेई झांग और मिंग झोउ द्वारा पोस्ट किया गया। 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA से) साथ वाला पेपर [डीप लर्निंग इंफ़ेक्शन के लिए इंटीजर क्वांटिज़ेशन: प्रिंसिपल्स एंड एम्पिरिकल इवैल्यूएशन](https:// arxiv.org/abs/2004.09602) हाओ वू, पैट्रिक जुड, जिआओजी झांग, मिखाइल इसेव और पॉलियस माइकेविसियस द्वारा। 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (फेसबुक से) साथ में कागज [रिट्रीवल-ऑगमेंटेड जेनरेशन फॉर नॉलेज-इंटेंसिव एनएलपी टास्क](https://arxiv .org/abs/2005.11401) पैट्रिक लुईस, एथन पेरेज़, अलेक्जेंड्रा पिक्टस, फैबियो पेट्रोनी, व्लादिमीर कारपुखिन, नमन गोयल, हेनरिक कुटलर, माइक लुईस, वेन-ताउ यिह, टिम रॉकटाशेल, सेबस्टियन रिडेल, डौवे कीला द्वारा। diff --git a/README_ja.md b/README_ja.md index b45cc68ea6b2b8..7f289adecdc364 100644 --- a/README_ja.md +++ b/README_ja.md @@ -423,6 +423,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (VinAI Research から) Dat Quoc Nguyen and Anh Tuan Nguyen から公開された研究論文: [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP から) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang から公開された研究論文: [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (Sea AI Labs から) Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng から公開された研究論文: [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) +1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (Microsoft Research から) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou から公開された研究論文: [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA から) Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius から公開された研究論文: [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (Facebook から) Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela から公開された研究論文: [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) diff --git a/README_ko.md b/README_ko.md index a5c0b8cf1eee75..fe7b68ee65afb8 100644 --- a/README_ko.md +++ b/README_ko.md @@ -338,6 +338,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (VinAI Research 에서) Dat Quoc Nguyen and Anh Tuan Nguyen 의 [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) 논문과 함께 발표했습니다. 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP 에서) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang 의 [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) 논문과 함께 발표했습니다. 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (Sea AI Labs 에서) Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng 의 [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) 논문과 함께 발표했습니다. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (Microsoft Research 에서) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 의 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 논문과 함께 발표했습니다. 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA 에서) Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 의 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 논문과 함께 발표했습니다. 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (Facebook 에서) Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela 의 [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) 논문과 함께 발표했습니다. diff --git a/README_zh-hans.md b/README_zh-hans.md index 9ae3bc24494f47..9a63bb249090bb 100644 --- a/README_zh-hans.md +++ b/README_zh-hans.md @@ -362,6 +362,7 @@ conda install -c huggingface transformers 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (来自 VinAI Research) 伴随论文 [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) 由 Dat Quoc Nguyen and Anh Tuan Nguyen 发布。 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (来自 UCLA NLP) 伴随论文 [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) 由 Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang 发布。 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (来自 Sea AI Labs) 伴随论文 [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) 由 Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng 发布。 +1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (来自 NVIDIA) 伴随论文 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 由 Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 发布。 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (来自 Facebook) 伴随论文 [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) 由 Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela 发布。 diff --git a/README_zh-hant.md b/README_zh-hant.md index 53847bf6739ac4..f7b64aaaf075c5 100644 --- a/README_zh-hant.md +++ b/README_zh-hant.md @@ -374,6 +374,7 @@ conda install -c huggingface transformers 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 7586985b111221..8769450f8b8bed 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -519,6 +519,8 @@ title: Hubert - local: model_doc/mctct title: MCTCT + - local: model_doc/pop2piano + title: Pop2Piano - local: model_doc/sew title: SEW - local: model_doc/sew-d diff --git a/docs/source/en/index.mdx b/docs/source/en/index.mdx index ea1ab27e7970fa..1c032dffe64c7b 100644 --- a/docs/source/en/index.mdx +++ b/docs/source/en/index.mdx @@ -175,6 +175,7 @@ The documentation is organized into five sections: 1. **[PhoBERT](model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. 1. **[PLBart](model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. 1. **[PoolFormer](model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. +1. **[Pop2Piano](model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 1. **[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[QDQBert](model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. 1. **[RAG](model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. @@ -366,6 +367,7 @@ Flax), PyTorch, and/or TensorFlow. | Perceiver | ✅ | ❌ | ✅ | ❌ | ❌ | | PLBart | ✅ | ❌ | ✅ | ❌ | ❌ | | PoolFormer | ❌ | ❌ | ✅ | ❌ | ❌ | +| Pop2Piano | ❌ | ❌ | ✅ | ❌ | ❌ | | ProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ | | QDQBert | ❌ | ❌ | ✅ | ❌ | ❌ | | RAG | ✅ | ❌ | ✅ | ✅ | ❌ | diff --git a/docs/source/en/model_doc/pop2piano.mdx b/docs/source/en/model_doc/pop2piano.mdx index 75d8280f2bcb68..61722c4475b956 100644 --- a/docs/source/en/model_doc/pop2piano.mdx +++ b/docs/source/en/model_doc/pop2piano.mdx @@ -32,11 +32,13 @@ a piano cover from pop audio without melody and chord extraction modules. We show that Pop2Piano trained with our dataset can generate plausible piano covers.* - Tips: - +1. Pop2Piano is an Encoder-Decoder based model like T5. +2. Pop2Piano can be used to generate midi-audio files for a given audio sequence. This HuggingFace implementation allows to save midi_output as well as stereo-mix output of the audio sequence. +3. Choosing different composers in Pop2PianoForConditionalGeneration.generate can lead to variety of different results. +4. Please note that HuggingFace implementation of Pop2Piano(both Pop2PianoForConditionalGeneration and Pop2PianoFeatureExtractor) can only work with one raw_audio sequence at a time. So if you want to process multiple files, please feed them one by one. This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). The original code can be found [here](https://github.com/sweetcocoa/pop2piano). @@ -48,7 +50,7 @@ The original code can be found [here](https://github.com/sweetcocoa/pop2piano). ## Pop2PianoFeatureExtractor -[[autodoc]] WhisperFeatureExtractor +[[autodoc]] Pop2PianoFeatureExtractor - __call__ ## Pop2PianoForConditionalGeneration diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py index 5a6846d3b70e73..243128d98af934 100644 --- a/src/transformers/__init__.py +++ b/src/transformers/__init__.py @@ -405,6 +405,11 @@ "models.phobert": ["PhobertTokenizer"], "models.plbart": ["PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP", "PLBartConfig"], "models.poolformer": ["POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "PoolFormerConfig"], + "models.pop2piano": [ + "POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP", + "Pop2PianoConfig", + "Pop2PianoFeatureExtractor", + ], "models.prophetnet": ["PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "ProphetNetConfig", "ProphetNetTokenizer"], "models.qdqbert": ["QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "QDQBertConfig"], "models.rag": ["RagConfig", "RagRetriever", "RagTokenizer"], @@ -524,14 +529,6 @@ "WhisperFeatureExtractor", "WhisperProcessor", "WhisperTokenizer", - ], - "models.pop2piano": [ - "POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP", - "Pop2PianoConfig", - "Pop2PianoFeatureExtractor", - - - ], "models.x_clip": [ "XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP", @@ -2127,6 +2124,13 @@ "PoolFormerPreTrainedModel", ] ) + _import_structure["models.pop2piano"].extend( + [ + "POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST", + "Pop2PianoForConditionalGeneration", + "Pop2PianoPreTrainedModel", + ] + ) _import_structure["models.prophetnet"].extend( [ "PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2613,13 +2617,6 @@ "WhisperPreTrainedModel", ] ) - _import_structure["models.pop2piano"].extend( - [ - "POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST", - "Pop2PianoForConditionalGeneration", - "Pop2PianoPreTrainedModel", - ] - ) _import_structure["models.x_clip"].extend( [ "XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -4031,6 +4028,11 @@ from .models.phobert import PhobertTokenizer from .models.plbart import PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP, PLBartConfig from .models.poolformer import POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, PoolFormerConfig + from .models.pop2piano import ( + POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP, + Pop2PianoConfig, + Pop2PianoFeatureExtractor, + ) from .models.prophetnet import PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, ProphetNetConfig, ProphetNetTokenizer from .models.qdqbert import QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, QDQBertConfig from .models.rag import RagConfig, RagRetriever, RagTokenizer @@ -4133,14 +4135,6 @@ WhisperFeatureExtractor, WhisperProcessor, WhisperTokenizer, - ) - from .models.pop2piano import ( - POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP, - Pop2PianoConfig, - Pop2PianoFeatureExtractor, - - - ) from .models.x_clip import ( XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP, @@ -5472,6 +5466,11 @@ PoolFormerModel, PoolFormerPreTrainedModel, ) + from .models.pop2piano import ( + POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST, + Pop2PianoForConditionalGeneration, + Pop2PianoPreTrainedModel, + ) from .models.prophetnet import ( PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST, ProphetNetDecoder, @@ -5859,11 +5858,6 @@ WhisperModel, WhisperPreTrainedModel, ) - from .models.pop2piano import ( - POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST, - Pop2PianoForConditionalGeneration, - Pop2PianoPreTrainedModel, - ) from .models.x_clip import ( XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST, XCLIPModel, diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py index 94ba97db92ed86..5e36c41ae733d4 100644 --- a/src/transformers/models/__init__.py +++ b/src/transformers/models/__init__.py @@ -142,6 +142,7 @@ phobert, plbart, poolformer, + pop2piano, prophetnet, qdqbert, rag, @@ -197,7 +198,6 @@ wav2vec2_with_lm, wavlm, whisper, - pop2piano, x_clip, xglm, xlm, diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py index c5bf3d2b8a59c5..3ac19982468be4 100755 --- a/src/transformers/models/auto/configuration_auto.py +++ b/src/transformers/models/auto/configuration_auto.py @@ -143,6 +143,7 @@ ("perceiver", "PerceiverConfig"), ("plbart", "PLBartConfig"), ("poolformer", "PoolFormerConfig"), + ("pop2piano", "Pop2PianoConfig"), ("prophetnet", "ProphetNetConfig"), ("qdqbert", "QDQBertConfig"), ("rag", "RagConfig"), @@ -195,7 +196,6 @@ ("wav2vec2-conformer", "Wav2Vec2ConformerConfig"), ("wavlm", "WavLMConfig"), ("whisper", "WhisperConfig"), - ("pop2piano", "Pop2PianoConfig"), ("xclip", "XCLIPConfig"), ("xglm", "XGLMConfig"), ("xlm", "XLMConfig"), @@ -318,6 +318,7 @@ ("perceiver", "PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("plbart", "PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("poolformer", "POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("pop2piano", "POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("prophetnet", "PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("qdqbert", "QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("realm", "REALM_PRETRAINED_CONFIG_ARCHIVE_MAP"), @@ -361,7 +362,6 @@ ("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("wav2vec2-conformer", "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("whisper", "WHISPER_PRETRAINED_CONFIG_ARCHIVE_MAP"), - ("pop2piano", "POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("xclip", "XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("xglm", "XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("xlm", "XLM_PRETRAINED_CONFIG_ARCHIVE_MAP"), @@ -509,6 +509,7 @@ ("phobert", "PhoBERT"), ("plbart", "PLBart"), ("poolformer", "PoolFormer"), + ("pop2piano", "Pop2Piano"), ("prophetnet", "ProphetNet"), ("qdqbert", "QDQBert"), ("rag", "RAG"), @@ -565,7 +566,6 @@ ("wav2vec2_phoneme", "Wav2Vec2Phoneme"), ("wavlm", "WavLM"), ("whisper", "Whisper"), - ("pop2piano", "Pop2Piano"), ("xclip", "X-CLIP"), ("xglm", "XGLM"), ("xlm", "XLM"), diff --git a/src/transformers/models/auto/feature_extraction_auto.py b/src/transformers/models/auto/feature_extraction_auto.py index f8522e3d307b71..542103565698b4 100644 --- a/src/transformers/models/auto/feature_extraction_auto.py +++ b/src/transformers/models/auto/feature_extraction_auto.py @@ -71,6 +71,7 @@ ("owlvit", "OwlViTFeatureExtractor"), ("perceiver", "PerceiverFeatureExtractor"), ("poolformer", "PoolFormerFeatureExtractor"), + ("pop2piano", "Pop2PianoFeatureExtractor"), ("regnet", "ConvNextFeatureExtractor"), ("resnet", "ConvNextFeatureExtractor"), ("segformer", "SegformerFeatureExtractor"), @@ -95,7 +96,6 @@ ("wav2vec2-conformer", "Wav2Vec2FeatureExtractor"), ("wavlm", "Wav2Vec2FeatureExtractor"), ("whisper", "WhisperFeatureExtractor"), - ("pop2piano", "Pop2PianoFeatureExtractor"), ("xclip", "CLIPFeatureExtractor"), ("yolos", "YolosFeatureExtractor"), ] diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py index 12d8d1dc775f30..4d717b7fd2196e 100755 --- a/src/transformers/models/auto/modeling_auto.py +++ b/src/transformers/models/auto/modeling_auto.py @@ -311,6 +311,7 @@ ("openai-gpt", "OpenAIGPTLMHeadModel"), ("pegasus_x", "PegasusXForConditionalGeneration"), ("plbart", "PLBartForConditionalGeneration"), + ("pop2piano", "Pop2PianoForConditionalGeneration"), ("qdqbert", "QDQBertForMaskedLM"), ("reformer", "ReformerModelWithLMHead"), ("rembert", "RemBertForMaskedLM"), @@ -326,7 +327,6 @@ ("transfo-xl", "TransfoXLLMHeadModel"), ("wav2vec2", "Wav2Vec2ForMaskedLM"), ("whisper", "WhisperForConditionalGeneration"), - ("pop2piano", "Pop2PianoForConditionalGeneration"), ("xlm", "XLMWithLMHeadModel"), ("xlm-roberta", "XLMRobertaForMaskedLM"), ("xlm-roberta-xl", "XLMRobertaXLForMaskedLM"), @@ -612,11 +612,11 @@ MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict( [ + ("pop2piano", "Pop2PianoForConditionalGeneration"), ("speech-encoder-decoder", "SpeechEncoderDecoderModel"), ("speech_to_text", "Speech2TextForConditionalGeneration"), ("speecht5", "SpeechT5ForSpeechToText"), ("whisper", "WhisperForConditionalGeneration"), - ("pop2piano", "Pop2PianoForConditionalGeneration"), ] ) diff --git a/src/transformers/models/pop2piano/__init__.py b/src/transformers/models/pop2piano/__init__.py index 0f63bebca64211..f16c76a2dc8065 100644 --- a/src/transformers/models/pop2piano/__init__.py +++ b/src/transformers/models/pop2piano/__init__.py @@ -16,14 +16,15 @@ from ...utils import ( OptionalDependencyNotAvailable, _LazyModule, - is_torch_available, + is_essentia_available, is_librosa_available, + is_pretty_midi_available, is_scipy_available, is_soundfile_availble, - is_essentia_available, - is_pretty_midi_available, + is_torch_available, ) + # Config _import_structure = { "configuration_pop2piano": ["POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP", "Pop2PianoConfig"], @@ -44,9 +45,13 @@ # Feature Extractor try: - if not (is_librosa_available() and is_essentia_available() and - is_scipy_available() and is_pretty_midi_available() and - is_soundfile_availble()): + if not ( + is_librosa_available() + and is_essentia_available() + and is_scipy_available() + and is_pretty_midi_available() + and is_soundfile_availble() + ): raise OptionalDependencyNotAvailable() except OptionalDependencyNotAvailable: pass @@ -73,9 +78,13 @@ # Feature Extractor try: - if not (is_librosa_available() and is_essentia_available() and - is_scipy_available() and is_pretty_midi_available() and - is_soundfile_availble()): + if not ( + is_librosa_available() + and is_essentia_available() + and is_scipy_available() + and is_pretty_midi_available() + and is_soundfile_availble() + ): raise OptionalDependencyNotAvailable() except OptionalDependencyNotAvailable: pass @@ -84,4 +93,5 @@ else: import sys + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) diff --git a/src/transformers/models/pop2piano/configuration_pop2piano.py b/src/transformers/models/pop2piano/configuration_pop2piano.py index 9abf7962c4157d..5307d9b0251ae6 100644 --- a/src/transformers/models/pop2piano/configuration_pop2piano.py +++ b/src/transformers/models/pop2piano/configuration_pop2piano.py @@ -12,57 +12,58 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" Pop2Piano model configuration """ +""" Pop2Piano model configuration""" -from collections import OrderedDict -from typing import TYPE_CHECKING, Any, Mapping, Optional, Union from ...configuration_utils import PretrainedConfig from ...utils import logging + logger = logging.get_logger(__name__) POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP = { - "susnato/pop2piano_dev": "https://huggingface.co/susnato/pop2piano_dev/blob/main/config.json" # For now + "susnato/pop2piano_dev": "https://huggingface.co/susnato/pop2piano_dev/blob/main/config.json" # For now } -COMPOSER_TO_FEATURE_TOKEN = {'composer1': 2052, - 'composer2': 2053, - 'composer3': 2054, - 'composer4': 2055, - 'composer5': 2056, - 'composer6': 2057, - 'composer7': 2058, - 'composer8': 2059, - 'composer9': 2060, - 'composer10': 2061, - 'composer11': 2062, - 'composer12': 2063, - 'composer13': 2064, - 'composer14': 2065, - 'composer15': 2066, - 'composer16': 2067, - 'composer17': 2068, - 'composer18': 2069, - 'composer19': 2070, - 'composer20': 2071, - 'composer21': 2072 +COMPOSER_TO_FEATURE_TOKEN = { + "composer1": 2052, + "composer2": 2053, + "composer3": 2054, + "composer4": 2055, + "composer5": 2056, + "composer6": 2057, + "composer7": 2058, + "composer8": 2059, + "composer9": 2060, + "composer10": 2061, + "composer11": 2062, + "composer12": 2063, + "composer13": 2064, + "composer14": 2065, + "composer15": 2066, + "composer16": 2067, + "composer17": 2068, + "composer18": 2069, + "composer19": 2070, + "composer20": 2071, + "composer21": 2072, } + class Pop2PianoConfig(PretrainedConfig): r""" - This is the configuration class to store the configuration of a [`Pop2PianoForConditionalGeneration`]. It is used to instantiate a - Pop2PianoForConditionalGeneration model according to the specified arguments, defining the model architecture. Instantiating a configuration - with the defaults will yield a similar configuration to that of the Pop2Piano - [sweetcocoa/pop2piano](https://huggingface.co/sweetcocoa/pop2piano) architecture. + This is the configuration class to store the configuration of a [`Pop2PianoForConditionalGeneration`]. It is used + to instantiate a Pop2PianoForConditionalGeneration model according to the specified arguments, defining the model + architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the + Pop2Piano [sweetcocoa/pop2piano](https://huggingface.co/sweetcocoa/pop2piano) architecture. Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the documentation from [`PretrainedConfig`] for more information. Arguments: vocab_size (`int`, *optional*, defaults to 2400): - Vocabulary size of the Pop2PianoForConditionalGeneration model. Defines the number of different tokens that can be represented by the - `inputs_ids` passed when calling [`Pop2PianoForConditionalGeneration`]. + Vocabulary size of the Pop2PianoForConditionalGeneration model. Defines the number of different tokens that + can be represented by the `inputs_ids` passed when calling [`Pop2PianoForConditionalGeneration`]. d_model (`int`, *optional*, defaults to 512): Size of the encoder layers and the pooler layer. d_kv (`int`, *optional*, defaults to 64): @@ -139,7 +140,6 @@ def __init__( dataset_n_bars=2, dataset_sampling_rate=22050, dataset_mel_is_conditioned=True, - n_fft=4096, hop_length=1024, f_min=10.0, @@ -165,10 +165,12 @@ def __init__( self.dense_act_fn = dense_act_fn self.is_gated_act = act_info[0] == "gated" self.composer_to_feature_token = COMPOSER_TO_FEATURE_TOKEN - self.dataset = {'target_length': dataset_target_length, - 'n_bars': dataset_n_bars, - 'sampling_rate': dataset_sampling_rate, - 'mel_is_conditioned': dataset_mel_is_conditioned} + + self.dataset_mel_is_conditioned = dataset_mel_is_conditioned + self.dataset_target_length = dataset_target_length + self.dataset_n_bars = dataset_n_bars + self.dataset_sampling_rate = dataset_sampling_rate + self.n_fft = n_fft self.hop_length = hop_length self.f_min = f_min diff --git a/src/transformers/models/pop2piano/feature_extraction_pop2piano.py b/src/transformers/models/pop2piano/feature_extraction_pop2piano.py index 7aa280570304c6..3f267310ac0650 100644 --- a/src/transformers/models/pop2piano/feature_extraction_pop2piano.py +++ b/src/transformers/models/pop2piano/feature_extraction_pop2piano.py @@ -12,30 +12,27 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" Feature extractor class for Pop2Piano """ - -import copy -from typing import Any, Dict, List, Optional, Union +""" Feature extractor class for Pop2Piano""" import os +import warnings +from typing import List, Optional, Union -import tensorflow -import torch -import scipy -import librosa -import pathlib import essentia -import warnings -import pretty_midi +import essentia.standard +import librosa import numpy as np +import pretty_midi +import scipy import soundfile as sf -import essentia.standard +import tensorflow +import torch from torch.nn.utils.rnn import pad_sequence -from .configuration_pop2piano import Pop2PianoConfig -from ...utils import TensorType, logging -from ...feature_extraction_utils import BatchFeature from ...feature_extraction_sequence_utils import SequenceFeatureExtractor +from ...feature_extraction_utils import BatchFeature +from ...utils import TensorType, logging + logger = logging.get_logger(__name__) @@ -50,6 +47,7 @@ EOS: int = 1 PAD: int = 0 + class Pop2PianoFeatureExtractor(SequenceFeatureExtractor): r""" Constructs a Pop2Piano feature extractor. @@ -57,45 +55,43 @@ class Pop2PianoFeatureExtractor(SequenceFeatureExtractor): This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. - This class loads audio, extracts rhythm and does preprocesses before being passed through `LogMelSpectrogram`. - This class also contains postprocessing methods to convert model outputs to midi audio and stereo-mix. Args: + This class loads audio, extracts rhythm and does preprocesses before being passed through `LogMelSpectrogram`. This: + class also contains postprocessing methods to convert model outputs to midi audio and stereo-mix. n_bars (`int`, *optional*, defaults to 2): Determines `n_steps` in method `preprocess_mel`. sampling_rate (`int`, *optional*, defaults to 22050): Sample rate of audio signal. use_mel (`bool`, *optional*, defaults to `True`): - Whether to preprocess for `LogMelSpectrogram` or not. - For the current implementation this must be `True`. + Whether to preprocess for `LogMelSpectrogram` or not. For the current implementation this must be `True`. padding_value (`int`, *optional*, defaults to 0): Padding value used to pad the audio. Should correspond to silences. vocab_size_special (`int`, *optional*, defaults to 4): Number of special values. vocab_size_note (`int`, *optional*, defaults to 128): - This represents the number of Note Values. - Note values indicate a pitch event for one of the MIDI pitches. But only the 88 pitches corresponding to - piano keys are actually used. + This represents the number of Note Values. Note values indicate a pitch event for one of the MIDI pitches. + But only the 88 pitches corresponding to piano keys are actually used. vocab_size_velocity (`int`, *optional*, defaults to 2): Number of Velocity tokens. vocab_size_time (`int`, *optional*, defaults to 100): - This represents the number of Beat Shifts. - Beat Shift [100 values] Indicates the relative time shift within the segment quantized into 8th-note - beats(half-beats). + This represents the number of Beat Shifts. Beat Shift [100 values] Indicates the relative time shift within + the segment quantized into 8th-note beats(half-beats). """ model_input_names = ["input_features"] - def __init__(self, - n_bars:int = 2, - sampling_rate:int = 22050, - use_mel:int = True, - padding_value:int = 0, - vocab_size_special:int = 4, - vocab_size_note:int = 128, - vocab_size_velocity:int = 2, - vocab_size_time:int = 100, - feature_size=None, - **kwargs - ): + def __init__( + self, + n_bars: int = 2, + sampling_rate: int = 22050, + use_mel: int = True, + padding_value: int = 0, + vocab_size_special: int = 4, + vocab_size_note: int = 128, + vocab_size_velocity: int = 2, + vocab_size_time: int = 100, + feature_size=None, + **kwargs, + ): super().__init__( feature_size=feature_size, sampling_rate=sampling_rate, @@ -114,8 +110,8 @@ def __init__(self, def extract_rhythm(self, raw_audio): """ This algorithm(`RhythmExtractor2013`) extracts the beat positions and estimates their confidence as well as - tempo in bpm for an audio signal. - For more information please visit https://essentia.upf.edu/reference/std_RhythmExtractor2013.html . + tempo in bpm for an audio signal. For more information please visit + https://essentia.upf.edu/reference/std_RhythmExtractor2013.html . """ essentia_tracker = essentia.standard.RhythmExtractor2013(method="multifeature") bpm, beat_times, confidence, estimates, essentia_beat_intervals = essentia_tracker(raw_audio) @@ -130,9 +126,7 @@ def interpolate_beat_times(self, beat_times, steps_per_beat, extend=False): fill_value="extrapolate", ) if extend: - beat_steps_8th = beat_times_function( - np.linspace(0, beat_times.size, beat_times.size * steps_per_beat + 1) - ) + beat_steps_8th = beat_times_function(np.linspace(0, beat_times.size, beat_times.size * steps_per_beat + 1)) else: beat_steps_8th = beat_times_function( np.linspace(0, beat_times.size - 1, beat_times.size * steps_per_beat - 1) @@ -146,16 +140,18 @@ def extrapolate_beat_times(self, beat_times, n_extend=1): bounds_error=False, fill_value="extrapolate", ) - ext_beats = beat_times_function( - np.linspace(0, beat_times.size + n_extend - 1, beat_times.size + n_extend) - ) + ext_beats = beat_times_function(np.linspace(0, beat_times.size + n_extend - 1, beat_times.size + n_extend)) return ext_beats def preprocess_mel( - self, audio, beatstep, n_bars, padding_value, - ): - """ Preprocessing for `LogMelSpectrogram` """ + self, + audio, + beatstep, + n_bars, + padding_value, + ): + """Preprocessing for `LogMelSpectrogram`""" n_steps = n_bars * 4 n_target_step = len(beatstep) @@ -163,9 +159,8 @@ def preprocess_mel( def split_audio(audio): """ - Split audio corresponding beat intervals. - Each audio's lengths are different. - Because each corresponding beat interval times are different. + Split audio corresponding beat intervals. Each audio's lengths are different. Because each corresponding + beat interval times are different. """ batch = [] @@ -186,13 +181,13 @@ def split_audio(audio): return batch, ext_beatstep def single_preprocess( - self, - beatstep, - feature_tokens=None, - audio=None, - n_bars=None, + self, + beatstep, + feature_tokens=None, + audio=None, + n_bars=None, ): - """ Preprocessing method for a single sequence. """ + """Preprocessing method for a single sequence.""" if feature_tokens is None and audio is None: raise ValueError("Both `feature_tokens` and `audio` can't be None at the same time!") @@ -209,27 +204,29 @@ def single_preprocess( beatstep = beatstep - beatstep[0] if self.use_mel: - batch, ext_beatstep = self.preprocess_mel(audio, - beatstep, - n_bars=n_bars, - padding_value=self.padding_value, - ) + batch, ext_beatstep = self.preprocess_mel( + audio, + beatstep, + n_bars=n_bars, + padding_value=self.padding_value, + ) else: raise NotImplementedError("use_mel must be True") return batch, ext_beatstep - def __call__(self, - raw_audio:Union[np.ndarray, List[float], List[np.ndarray]], - audio_sr:int, - steps_per_beat:int=2, - return_tensors:Optional[Union[str, TensorType]]="pt", - **kwargs - ) -> BatchFeature: + def __call__( + self, + raw_audio: Union[np.ndarray, List[float], List[np.ndarray]], + audio_sr: int, + steps_per_beat: int = 2, + return_tensors: Optional[Union[str, TensorType]] = "pt", + **kwargs, + ) -> BatchFeature: """ - Main method to featurize and prepare for the model one sequence. - Please note that `Pop2PianoFeatureExtractor` only accepts one raw_audio at a time. Args: + Main method to featurize and prepare for the model one sequence. Please note that `Pop2PianoFeatureExtractor` + only accepts one raw_audio at a time. raw_audio (`np.ndarray`, `List`): Denotes the raw_audio. audio_sr (`int`): @@ -242,7 +239,9 @@ def __call__(self, - `'pt'`: Return PyTorch `torch.Tensor` objects. - `'np'`: Return Numpy `np.ndarray` objects. """ - warnings.warn("Pop2PianoFeatureExtractor only takes one raw_audio at a time, if you want to extract features from more than a single audio then you might need to call it multiple times.") + warnings.warn( + "Pop2PianoFeatureExtractor only takes one raw_audio at a time, if you want to extract features from more than a single audio then you might need to call it multiple times." + ) # If it's [np.ndarray] if isinstance(raw_audio, list) and isinstance(raw_audio[0], np.ndarray): @@ -254,7 +253,7 @@ def __call__(self, if self.sampling_rate != audio_sr and self.sampling_rate is not None: # Change `raw_audio_sr` to `self.sampling_rate` raw_audio = librosa.core.resample( - raw_audio, orig_sr=audio_sr, target_sr=self.sampling_rate, res_type='kaiser_best' + raw_audio, orig_sr=audio_sr, target_sr=self.sampling_rate, res_type="kaiser_best" ) audio_sr = self.sampling_rate start_sample = int(beatsteps[0] * audio_sr) @@ -270,10 +269,13 @@ def __call__(self, ) batch = batch.cpu().numpy() - output = BatchFeature({"input_features": batch, - "beatsteps": beatsteps, - "ext_beatstep": ext_beatstep, - }) + output = BatchFeature( + { + "input_features": batch, + "beatsteps": beatsteps, + "ext_beatstep": ext_beatstep, + } + ) if return_tensors is not None: output = output.convert_to_tensors(return_tensors) @@ -281,15 +283,17 @@ def __call__(self, return output def decode(self, token, time_idx_offset): - """ Decodes the tokens generated by the transformer """ + """Decodes the tokens generated by the transformer""" if token >= (self.vocab_size_special + self.vocab_size_note + self.vocab_size_velocity): - type, value = TOKEN_TIME, ((token - (self.vocab_size_special + self.vocab_size_note + self.vocab_size_velocity)) + time_idx_offset) + type, value = TOKEN_TIME, ( + (token - (self.vocab_size_special + self.vocab_size_note + self.vocab_size_velocity)) + time_idx_offset + ) elif token >= (self.vocab_size_special + self.vocab_size_note): - type, value = TOKEN_VELOCITY, (token - (self.vocab_size_special + self.vocab_size_note)) + type, value = TOKEN_VELOCITY, (token - (self.vocab_size_special + self.vocab_size_note)) value = int(value) elif token >= self.vocab_size_special: - type, value = TOKEN_NOTE, (token - self.vocab_size_special) + type, value = TOKEN_NOTE, (token - self.vocab_size_special) value = int(value) else: type, value = TOKEN_SPECIAL, token @@ -298,14 +302,14 @@ def decode(self, token, time_idx_offset): return [type, value] def relative_batch_tokens_to_midi( - self, - tokens, - beatstep, - beat_offset_idx=None, - bars_per_batch=None, - cutoff_time_idx=None, + self, + tokens, + beatstep, + beat_offset_idx=None, + bars_per_batch=None, + cutoff_time_idx=None, ): - """ Converts tokens to midi """ + """Converts tokens to midi""" beat_offset_idx = 0 if beat_offset_idx is None else beat_offset_idx notes = None @@ -336,13 +340,15 @@ def relative_batch_tokens_to_midi( def relative_tokens_to_notes(self, tokens, start_idx, cutoff_time_idx=None): # decoding If the first token is an arranger - if tokens[0] >= (self.vocab_size_special + self.vocab_size_note + self.vocab_size_velocity + self.vocab_size_time): + if tokens[0] >= ( + self.vocab_size_special + self.vocab_size_note + self.vocab_size_velocity + self.vocab_size_time + ): tokens = tokens[1:] words = [self.decode(token, time_idx_offset=0) for token in tokens] if hasattr(start_idx, "item"): - """ if numpy or torch tensor """ + """if numpy or torch tensor""" start_idx = start_idx.item() current_idx = start_idx @@ -374,9 +380,7 @@ def relative_tokens_to_notes(self, tokens, start_idx, cutoff_time_idx=None): pass else: offset_idx = current_idx - notes.append( - [onset_idx, offset_idx, pitch, DEFAULT_VELOCITY] - ) + notes.append([onset_idx, offset_idx, pitch, DEFAULT_VELOCITY]) note_onsets_ready[pitch] = None else: # note_on @@ -390,9 +394,7 @@ def relative_tokens_to_notes(self, tokens, start_idx, cutoff_time_idx=None): pass else: offset_idx = current_idx - notes.append( - [onset_idx, offset_idx, pitch, DEFAULT_VELOCITY] - ) + notes.append([onset_idx, offset_idx, pitch, DEFAULT_VELOCITY]) note_onsets_ready[pitch] = current_idx else: raise ValueError @@ -417,7 +419,7 @@ def relative_tokens_to_notes(self, tokens, start_idx, cutoff_time_idx=None): return notes def notes_to_midi(self, notes, beatstep, offset_sec=None): - """ Converts notes to midi """ + """Converts notes to midi""" new_pm = pretty_midi.PrettyMIDI(resolution=384, initial_tempo=120.0) new_inst = pretty_midi.Instrument(program=0) @@ -439,7 +441,7 @@ def notes_to_midi(self, notes, beatstep, offset_sec=None): return new_pm def get_stereo(self, pop_y, midi_y, pop_scale=0.99): - """ Generates stereo audio using `pop audio(`pop_y`)` and `generated midi audio(`midi_y`)` """ + """Generates stereo audio using `pop audio(`pop_y`)` and `generated midi audio(`midi_y`)`""" if len(pop_y) > len(midi_y): midi_y = np.pad(midi_y, (0, len(pop_y) - len(midi_y))) @@ -449,7 +451,7 @@ def get_stereo(self, pop_y, midi_y, pop_scale=0.99): return stereo def _to_np(self, tensor): - """ Converts tensorflow or pytorch tensor to np.ndarray. """ + """Converts tensorflow or pytorch tensor to np.ndarray.""" if isinstance(tensor, np.ndarray): return tensor elif isinstance(tensor, torch.Tensor): @@ -457,24 +459,25 @@ def _to_np(self, tensor): elif isinstance(tensor, tensorflow.Tensor): return tensor.numpy() - def postprocess(self, - relative_tokens:Union[TensorType], - beatsteps:Union[TensorType], - ext_beatstep:Union[TensorType], - raw_audio:Union[np.ndarray, List[float], List[np.ndarray]], - sampling_rate:int, - mix_sampling_rate=None, - save_path:str=None, - audio_file_name:str=None, - save_midi:bool=False, - save_mix:bool=False, - click_amp:float=0.2, - stereo_amp:float=0.5, - add_click:bool=False, - ): + def postprocess( + self, + relative_tokens: Union[TensorType], + beatsteps: Union[TensorType], + ext_beatstep: Union[TensorType], + raw_audio: Union[np.ndarray, List[float], List[np.ndarray]], + sampling_rate: int, + mix_sampling_rate=None, + save_path: str = None, + audio_file_name: str = None, + save_midi: bool = False, + save_mix: bool = False, + click_amp: float = 0.2, + stereo_amp: float = 0.5, + add_click: bool = False, + ): r""" - Postprocess step. It also saves the `"generated midi audio"`, `"stereo-mix"` Args: + Postprocess step. It also saves the `"generated midi audio"`, `"stereo-mix"` relative_tokens ([`~utils.TensorType`]): Output of `Pop2PianoConditionalGeneration` model. beatsteps ([`~utils.TensorType`]): @@ -512,8 +515,10 @@ def postprocess(self, raise ValueError("If you want to save any mix or midi file then you must define save_path.") if save_path and (not save_midi and not save_mix): - raise ValueError("You are setting save_path but not saving anything, use save_midi=True to " - "save the midi file and use save_mix to save the mix file or do both!") + raise ValueError( + "You are setting save_path but not saving anything, use save_midi=True to " + "save the midi file and use save_mix to save the mix file or do both!" + ) mix_sampling_rate = sampling_rate if mix_sampling_rate is None else mix_sampling_rate @@ -524,11 +529,12 @@ def postprocess(self, else: raise ValueError(f"Is {save_path} a directory?") - pm, notes = self.relative_batch_tokens_to_midi(tokens=relative_tokens, - beatstep=ext_beatstep, - bars_per_batch=self.n_bars, - cutoff_time_idx=(self.n_bars + 1) * 4, - ) + pm, notes = self.relative_batch_tokens_to_midi( + tokens=relative_tokens, + beatstep=ext_beatstep, + bars_per_batch=self.n_bars, + cutoff_time_idx=(self.n_bars + 1) * 4, + ) for n in pm.instruments[0].notes: n.start += beatsteps[0] n.end += beatsteps[0] @@ -555,4 +561,4 @@ def postprocess(self, ) print(f"stereo-mix file saved at {mix_path}!") - return pm \ No newline at end of file + return pm diff --git a/src/transformers/models/pop2piano/modeling_pop2piano.py b/src/transformers/models/pop2piano/modeling_pop2piano.py index 6d578c5144c92a..8488fe06a7becf 100644 --- a/src/transformers/models/pop2piano/modeling_pop2piano.py +++ b/src/transformers/models/pop2piano/modeling_pop2piano.py @@ -17,32 +17,32 @@ import copy import math -import random import warnings -import torchaudio -from typing import Optional, Tuple, Union, List +from typing import Optional, Tuple, Union import numpy as np import torch -from ...feature_extraction_utils import BatchFeature -import torch.utils.checkpoint +import torchaudio from torch import nn from torch.nn import CrossEntropyLoss -from ...generation.utils import GreedySearchEncoderDecoderOutput - -from ...pytorch_utils import ALL_LAYERNORM_LAYERS, find_pruneable_heads_and_indices, prune_linear_layer +from torch.utils.checkpoint import checkpoint from ...activations import ACT2FN - +from ...feature_extraction_utils import BatchFeature from ...modeling_outputs import ( BaseModelOutput, BaseModelOutputWithPastAndCrossAttentions, Seq2SeqLMOutput, - Seq2SeqModelOutput, - BackboneOutput, ) from ...modeling_utils import PreTrainedModel -from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings, is_torch_fx_proxy +from ...pytorch_utils import ALL_LAYERNORM_LAYERS, find_pruneable_heads_and_indices, prune_linear_layer +from ...utils import ( + add_start_docstrings, + add_start_docstrings_to_model_forward, + is_torch_fx_proxy, + logging, + replace_return_docstrings, +) from .configuration_pop2piano import Pop2PianoConfig @@ -52,7 +52,7 @@ _CHECKPOINT_FOR_DOC = "susnato/pop2piano_dev" POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST = [ - "susnato/pop2piano_dev", # For now + "susnato/pop2piano_dev", # For now # See all Pop2Piano models at https://huggingface.co/models?filter=pop2piano ] @@ -60,26 +60,22 @@ Pop2Piano_INPUTS_DOCSTRING = r""" Args: input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): - Indices of input sequence tokens in the vocabulary. T5 is a model with relative position embeddings so you - should be able to pad the inputs on both the right and the left. - Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and - [`PreTrainedTokenizer.__call__`] for detail. - [What are input IDs?](../glossary#input-ids) - To know more on how to prepare `input_ids` for pretraining take a look a [T5 Training](./t5#training). + Indices of input sequence tokens in the vocabulary. Pop2Piano is a model with relative position embeddings + so you should be able to pad the inputs on both the right and the left. Indices can be obtained using + [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for detail. + [What are input IDs?](../glossary#input-ids) To know more on how to prepare `input_ids` for pretraining + take a look a [Pop2Pianp Training](./Pop2Piano#training). attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*): Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. [What are attention masks?](../glossary#attention-mask) decoder_input_ids (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*): - Indices of decoder input sequence tokens in the vocabulary. - Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and - [`PreTrainedTokenizer.__call__`] for details. - [What are decoder input IDs?](../glossary#decoder-input-ids) - T5 uses the `pad_token_id` as the starting token for `decoder_input_ids` generation. If `past_key_values` - is used, optionally only the last `decoder_input_ids` have to be input (see `past_key_values`). - To know more on how to prepare `decoder_input_ids` for pretraining take a look at [T5 - Training](./t5#training). + Indices of decoder input sequence tokens in the vocabulary. Indices can be obtained using + [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for details. + [What are decoder input IDs?](../glossary#decoder-input-ids) Pop2Piano uses the `pad_token_id` as the + starting token for `decoder_input_ids` generation. If `past_key_values` is used, optionally only the last + `decoder_input_ids` have to be input (see `past_key_values`). To know more on how to prepare decoder_attention_mask (`torch.BoolTensor` of shape `(batch_size, target_sequence_length)`, *optional*): Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also be used by default. @@ -115,9 +111,9 @@ Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be input (see `past_key_values`). This is useful if you want more control over how to convert - `decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix. - If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value - of `inputs_embeds`. + `decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix. If + `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value of + `inputs_embeds`. use_cache (`bool`, *optional*): If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`). @@ -131,6 +127,7 @@ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. """ + class Pop2PianoPreTrainedModel(PreTrainedModel): """ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained @@ -143,7 +140,7 @@ class Pop2PianoPreTrainedModel(PreTrainedModel): supports_gradient_checkpointing = False _no_split_modules = None _keep_in_fp32_modules = ["wo"] - + def _init_weights(self, module): """Initialize the weights""" factor = self.config.initializer_factor # Used for testing weights initialization @@ -198,10 +195,9 @@ def _shift_right(self, input_ids): decoder_start_token_id = self.config.decoder_start_token_id pad_token_id = self.config.pad_token_id - assert decoder_start_token_id is not None, ( - "self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id." - " See T5 docs for more information" - ) + assert ( + decoder_start_token_id is not None + ), "self.model.config.decoder_start_token_id has to be defined. In Pop2Piano it is usually set to the pad_token_id." # shift inputs to the right if is_torch_fx_proxy(input_ids): @@ -219,8 +215,9 @@ def _shift_right(self, input_ids): return shifted_input_ids + class LogMelSpectrogram(nn.Module): - """ Generates MelSpectrogram then applies log base e. """ + """Generates MelSpectrogram then applies log base e.""" def __init__(self, sampling_rate, n_fft, hop_length, f_min, n_mels): super(LogMelSpectrogram, self).__init__() @@ -240,8 +237,9 @@ def forward(self, x): return X + class ConcatEmbeddingToMel(nn.Module): - """ Embedding Matrix for `composer` tokens. """ + """Embedding Matrix for `composer` tokens.""" def __init__(self, embedding_offset, n_vocab, n_dim) -> None: super(ConcatEmbeddingToMel, self).__init__() @@ -254,7 +252,8 @@ def forward(self, feature, index_value): inputs_embeds = torch.cat([composer_embedding, feature], dim=1) return inputs_embeds -# Copied from transformers.models.t5.T5LayerNorm with T5->Pop2Piano,t5->pop2piano + +# Copied from transformers.models.t5.modeling_t5.T5LayerNorm with T5->Pop2Piano class Pop2PianoLayerNorm(nn.Module): def __init__(self, hidden_size, eps=1e-6): """ @@ -296,7 +295,7 @@ def forward(self, hidden_states): ALL_LAYERNORM_LAYERS.append(Pop2PianoLayerNorm) -# Copied from transformers.models.t5.T5LayerSelfAttention with T5->Pop2Piano,t5->pop2piano +# Copied from transformers.models.t5.modeling_t5.T5LayerSelfAttention with T5->Pop2Piano,t5->pop2piano class Pop2PianoLayerSelfAttention(nn.Module): def __init__(self, config, has_relative_attention_bias=False): super().__init__() @@ -328,7 +327,8 @@ def forward( outputs = (hidden_states,) + attention_output[1:] # add attentions if we output them return outputs -# Copied from transformers.models.t5.T5LayerCrossAttention with T5->Pop2Piano,t5->pop2piano + +# Adapted from transformers.models.t5.modeling_t5.T5Attention with T5->Pop2Piano,t5->pop2piano class Pop2PianoAttention(nn.Module): def __init__(self, config: Pop2PianoConfig, has_relative_attention_bias=False): super().__init__() @@ -372,19 +372,18 @@ def prune_heads(self, heads): @staticmethod def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128): """ + Args: Adapted from Mesh Tensorflow: - https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593 + https: + //github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593 Translate relative position to a bucket number for relative attention. The relative position is defined as memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to position. If bidirectional=False, then positive relative positions are invalid. We use smaller buckets for small absolute relative_position and larger buckets for larger absolute relative_positions. All relative - positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket. + positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket.: This should allow for more graceful generalization to longer sequences than the model has been trained on - Args: - relative_position: an int32 Tensor - bidirectional: a boolean - whether the attention is bidirectional - num_buckets: an integer - max_distance: an integer + relative_position: an int32 Tensor bidirectional: a boolean - whether the attention is bidirectional + num_buckets: an integer max_distance: an integer Returns: a Tensor with the same shape as relative_position, containing int32 values in the range [0, num_buckets) """ @@ -559,11 +558,12 @@ def project(hidden_states, proj_layer, key_value_states, past_key_value): outputs = outputs + (attn_weights,) return outputs -# Copied from transformers.models.t5.T5Block with T5->Pop2Piano,t5->pop2piano + +# Adapted from transformers.models.t5.modeling_t5.T5LayerFF with T5->Pop2Piano,t5->pop2piano class Pop2PianoLayerFF(nn.Module): def __init__(self, config: Pop2PianoConfig): super().__init__() - if config.is_gated_act: + if config.is_gated_act or config.feed_forward_proj.split("-")[0] == "gated": self.DenseReluDense = Pop2PianoDenseGatedActDense(config) else: self.DenseReluDense = Pop2PianoDenseActDense(config) @@ -577,7 +577,8 @@ def forward(self, hidden_states): hidden_states = hidden_states + self.dropout(forwarded_states) return hidden_states -# Copied from transformers.models.t5.T5Block with T5->Pop2Piano,t5->pop2piano + +# Copied from transformers.models.t5.modeling_t5.T5DenseActDense with T5->Pop2Piano,t5->pop2piano class Pop2PianoDenseActDense(nn.Module): def __init__(self, config: Pop2PianoConfig): super().__init__() @@ -590,12 +591,17 @@ def forward(self, hidden_states): hidden_states = self.wi(hidden_states) hidden_states = self.act(hidden_states) hidden_states = self.dropout(hidden_states) - if hidden_states.dtype != self.wo.weight.dtype and self.wo.weight.dtype != torch.int8: + if ( + isinstance(self.wo.weight, torch.Tensor) + and hidden_states.dtype != self.wo.weight.dtype + and self.wo.weight.dtype != torch.int8 + ): hidden_states = hidden_states.to(self.wo.weight.dtype) hidden_states = self.wo(hidden_states) return hidden_states -# Copied from transformers.models.t5.T5Block with T5->Pop2Piano,t5->pop2piano + +# Copied from transformers.models.t5.modeling_t5.T5DenseGatedActDense with T5->Pop2Piano class Pop2PianoDenseGatedActDense(nn.Module): def __init__(self, config: Pop2PianoConfig): super().__init__() @@ -614,13 +620,18 @@ def forward(self, hidden_states): # To make 8bit quantization work for google/flan-t5-xxl, self.wo is kept in float32. # See https://github.com/huggingface/transformers/issues/20287 # we also make sure the weights are not in `int8` in case users will force `_keep_in_fp32_modules` to be `None`` - if hidden_states.dtype != self.wo.weight.dtype and self.wo.weight.dtype != torch.int8: + if ( + isinstance(self.wo.weight, torch.Tensor) + and hidden_states.dtype != self.wo.weight.dtype + and self.wo.weight.dtype != torch.int8 + ): hidden_states = hidden_states.to(self.wo.weight.dtype) hidden_states = self.wo(hidden_states) return hidden_states -# Copied from transformers.models.t5.T5Block with T5->Pop2Piano,t5->pop2piano + +# Copied from transformers.models.t5.modeling_t5.T5LayerCrossAttention with T5->Pop2Piano,t5->pop2piano class Pop2PianoLayerCrossAttention(nn.Module): def __init__(self, config): super().__init__() @@ -656,7 +667,8 @@ def forward( outputs = (layer_output,) + attention_output[1:] # add attentions if we output them return outputs -# Copied from transformers.models.t5.T5Block with T5->Pop2Piano,t5->pop2piano + +# Copied from transformers.models.t5.modeling_t5.T5Block with T5->Pop2Piano,t5->pop2piano class Pop2PianoBlock(nn.Module): def __init__(self, config, has_relative_attention_bias=False): super().__init__() @@ -713,8 +725,12 @@ def forward( attention_outputs = self_attention_outputs[2:] # Keep self-attention outputs and relative position weights # clamp inf values to enable fp16 training - if hidden_states.dtype == torch.float16 and torch.isinf(hidden_states).any(): - clamp_value = torch.finfo(hidden_states.dtype).max - 1000 + if hidden_states.dtype == torch.float16: + clamp_value = torch.where( + torch.isinf(hidden_states).any(), + torch.finfo(hidden_states.dtype).max - 1000, + torch.finfo(hidden_states.dtype).max, + ) hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value) do_cross_attention = self.is_decoder and encoder_hidden_states is not None @@ -740,8 +756,12 @@ def forward( hidden_states = cross_attention_outputs[0] # clamp inf values to enable fp16 training - if hidden_states.dtype == torch.float16 and torch.isinf(hidden_states).any(): - clamp_value = torch.finfo(hidden_states.dtype).max - 1000 + if hidden_states.dtype == torch.float16: + clamp_value = torch.where( + torch.isinf(hidden_states).any(), + torch.finfo(hidden_states.dtype).max - 1000, + torch.finfo(hidden_states.dtype).max, + ) hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value) # Combine self attn and cross attn key value states @@ -755,8 +775,12 @@ def forward( hidden_states = self.layer[-1](hidden_states) # clamp inf values to enable fp16 training - if hidden_states.dtype == torch.float16 and torch.isinf(hidden_states).any(): - clamp_value = torch.finfo(hidden_states.dtype).max - 1000 + if hidden_states.dtype == torch.float16: + clamp_value = torch.where( + torch.isinf(hidden_states).any(), + torch.finfo(hidden_states.dtype).max - 1000, + torch.finfo(hidden_states.dtype).max, + ) hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value) outputs = (hidden_states,) @@ -769,7 +793,7 @@ def forward( return outputs # hidden-states, present_key_value_states, (self-attention position bias), (self-attention weights), (cross-attention position bias), (cross-attention weights) -# Copied from transformers.models.t5.T5Stack with T5->Pop2Piano,t5->pop2piano +# Adapted from transformers.models.t5.modeling_t5.T5Stack with T5->Pop2Piano,t5->pop2piano class Pop2PianoStack(Pop2PianoPreTrainedModel): def __init__(self, config, embed_tokens=None): super().__init__(config) @@ -1015,15 +1039,14 @@ def custom_forward(*inputs): Pop2Piano_START_DOCSTRING = r""" - The Pop2PianoForConditionalGeneration model was proposed in [POP2PIANO : POP AUDIO-BASED PIANO COVER GENERATION](https://arxiv.org/pdf/2211.00895) by Jongho Choi, Kyogu - Lee. It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting. - This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the - library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads - etc.) - This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. - Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage - and behavior. Parameters: + The Pop2PianoForConditionalGeneration model was proposed in [POP2PIANO : POP AUDIO-BASED PIANO COVER + GENERATION](https://arxiv.org/pdf/2211.00895) by Jongho Choi, Kyogu Lee. It's an encoder decoder transformer + pre-trained in a text-to-text denoising generative setting. This model inherits from [`PreTrainedModel`]. Check the: + superclass documentation for the generic methods the library implements for all its model (such as downloading or + saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch + [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch + Module and refer to the PyTorch documentation for all matter related to general usage and behavior. config ([`Pop2PianoConfig`]): Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. @@ -1037,7 +1060,16 @@ def custom_forward(*inputs): num_heads)`. """ -# Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration with T5->Pop2Piano,t5->pop2piano +# Warning message for FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask +__HEAD_MASK_WARNING_MSG = """ +The input argument `head_mask` was split into two arguments `head_mask` and `decoder_head_mask`. Currently, +`decoder_head_mask` is set to copy `head_mask`, but this feature is deprecated and will be removed in future versions. +If you do not want to use any `decoder_head_mask` now, please set `decoder_head_mask = torch.ones(num_layers, +num_heads)`. +""" + + +# Adapted from transformers.models.t5.modeling_t5.T5ForConditionalGeneration with T5->Pop2Piano,t5->pop2piano @add_start_docstrings("""Pop2Piano Model with a `language modeling` head on top.""", Pop2Piano_START_DOCSTRING) class Pop2PianoForConditionalGeneration(Pop2PianoPreTrainedModel): _keys_to_ignore_on_load_missing = [ @@ -1054,20 +1086,20 @@ def __init__(self, config: Pop2PianoConfig): self.config = config self.model_dim = config.d_model - self.spectrogram = LogMelSpectrogram(sampling_rate=config.dataset.get("sampling_rate"), - n_fft=config.n_fft, - hop_length=config.hop_length, - f_min=config.f_min, - n_mels=config.n_mels - ) - if config.dataset.get("mel_is_conditioned", True): + self.spectrogram = LogMelSpectrogram( + sampling_rate=config.dataset_sampling_rate, + n_fft=config.n_fft, + hop_length=config.hop_length, + f_min=config.f_min, + n_mels=config.n_mels, + ) + if config.dataset_mel_is_conditioned: n_dim = 512 composer_n_vocab = len(config.composer_to_feature_token) embedding_offset = min(config.composer_to_feature_token.values()) - self.mel_conditioner = ConcatEmbeddingToMel(embedding_offset=embedding_offset, - n_vocab=composer_n_vocab, - n_dim=n_dim - ) + self.mel_conditioner = ConcatEmbeddingToMel( + embedding_offset=embedding_offset, n_vocab=composer_n_vocab, n_dim=n_dim + ) self.shared = nn.Embedding(config.vocab_size, config.d_model) @@ -1116,23 +1148,23 @@ def get_decoder(self): @add_start_docstrings_to_model_forward(Pop2Piano_INPUTS_DOCSTRING) @replace_return_docstrings(output_type=Seq2SeqLMOutput, config_class=_CONFIG_FOR_DOC) def forward( - self, - input_ids: Optional[torch.LongTensor] = None, - attention_mask: Optional[torch.FloatTensor] = None, - decoder_input_ids: Optional[torch.LongTensor] = None, - decoder_attention_mask: Optional[torch.BoolTensor] = None, - head_mask: Optional[torch.FloatTensor] = None, - decoder_head_mask: Optional[torch.FloatTensor] = None, - cross_attn_head_mask: Optional[torch.Tensor] = None, - encoder_outputs: Optional[Tuple[Tuple[torch.Tensor]]] = None, - past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, - inputs_embeds: Optional[torch.FloatTensor] = None, - decoder_inputs_embeds: Optional[torch.FloatTensor] = None, - labels: Optional[torch.LongTensor] = None, - use_cache: Optional[bool] = None, - output_attentions: Optional[bool] = None, - output_hidden_states: Optional[bool] = None, - return_dict: Optional[bool] = None, + self, + input_ids: Optional[torch.LongTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + decoder_input_ids: Optional[torch.LongTensor] = None, + decoder_attention_mask: Optional[torch.BoolTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + decoder_head_mask: Optional[torch.FloatTensor] = None, + cross_attn_head_mask: Optional[torch.Tensor] = None, + encoder_outputs: Optional[Tuple[Tuple[torch.Tensor]]] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + decoder_inputs_embeds: Optional[torch.FloatTensor] = None, + labels: Optional[torch.LongTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, ) -> Union[Tuple[torch.FloatTensor], Seq2SeqLMOutput]: r""" labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): @@ -1140,25 +1172,7 @@ def forward( config.vocab_size - 1]`. All labels set to `-100` are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]` Returns: - Examples: - ```python - >>> from transformers import AutoTokenizer, T5ForConditionalGeneration - >>> tokenizer = AutoTokenizer.from_pretrained("t5-small") - >>> model = T5ForConditionalGeneration.from_pretrained("t5-small") - >>> # training - >>> input_ids = tokenizer("The walks in park", return_tensors="pt").input_ids - >>> labels = tokenizer(" cute dog the ", return_tensors="pt").input_ids - >>> outputs = model(input_ids=input_ids, labels=labels) - >>> loss = outputs.loss - >>> logits = outputs.logits - >>> # inference - >>> input_ids = tokenizer( - ... "summarize: studies have shown that owning a dog is good for you", return_tensors="pt" - ... ).input_ids # Batch size 1 - >>> outputs = model.generate(input_ids) - >>> print(tokenizer.decode(outputs[0], skip_special_tokens=True)) - >>> # studies have shown that owning a dog is good for you. - ```""" + """ use_cache = use_cache if use_cache is not None else self.config.use_cache return_dict = return_dict if return_dict is not None else self.config.use_return_dict @@ -1234,7 +1248,7 @@ def forward( if self.config.tie_word_embeddings: # Rescale output before projecting on vocab # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586 - sequence_output = sequence_output * (self.model_dim ** -0.5) + sequence_output = sequence_output * (self.model_dim**-0.5) lm_logits = self.lm_head(sequence_output) @@ -1262,33 +1276,28 @@ def forward( @torch.no_grad() def generate( - self, - input_features:BatchFeature, - inputs_embeds=None, - composer="composer1", - n_bars:int = 2, - max_length:int=None, - inputs: Optional[torch.Tensor] = None, - generation_config=None, - logits_processor=None, - stopping_criteria=None, - prefix_allowed_tokens_fn=None, - synced_gpus=False, - return_timestamps=None, - task=None, - language=None, - is_multilingual=None, - **kwargs, + self, + input_features: BatchFeature, + inputs_embeds=None, + composer="composer1", + n_bars: int = 2, + max_length: int = None, + inputs: Optional[torch.Tensor] = None, + generation_config=None, + **kwargs, ): """ Generates sequences of token ids for models with a language modeling head. + + Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the model's default generation configuration. You can override any `generation_config` by passing the corresponding - parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`. - For an overview of generation strategies and code examples, check out the [following - guide](./generation_strategies). + parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`. For an overview of generation + strategies and code examples, check out the [following guide](./generation_strategies). + + Parameters: input_features (`BatchFeature`): `input_features` returned by `Pop2PianoFeatureExtractor.__call__` @@ -1314,33 +1323,6 @@ def generate( priority: 1) from the `generation_config.json` model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [`~generation.GenerationConfig`]'s default values, whose documentation should be checked to parameterize generation. - logits_processor (`LogitsProcessorList`, *optional*): - Custom logits processors that complement the default logits processors built from arguments and - generation config. If a logit processor is passed that is already created with the arguments or a - generation config an error is thrown. This feature is intended for advanced users. - stopping_criteria (`StoppingCriteriaList`, *optional*): - Custom stopping criteria that complement the default stopping criteria built from arguments and a - generation config. If a stopping criteria is passed that is already created with the arguments or a - generation config an error is thrown. This feature is intended for advanced users. - prefix_allowed_tokens_fn (`Callable[[int, torch.Tensor], List[int]]`, *optional*): - If provided, this function constraints the beam search to allowed tokens only at each step. If not - provided no constraint is applied. This function takes 2 arguments: the batch ID `batch_id` and - `input_ids`. It has to return a list with the allowed tokens for the next generation step conditioned - on the batch ID `batch_id` and the previously generated tokens `inputs_ids`. This argument is useful - for constrained generation conditioned on the prefix, as described in [Autoregressive Entity - Retrieval](https://arxiv.org/abs/2010.00904). - synced_gpus (`bool`, *optional*, defaults to `False`): - Whether to continue running the while loop until max_length (needed for ZeRO stage 3) - return_timestamps (`bool`, *optional*): - Whether to return the timestamps with the text. This enables the `WhisperTimestampsLogitsProcessor`. - task (`bool`, *optional*): - Task to use for generation, either "translate" or "transcribe". The `model.config.forced_decoder_ids` - will be updated accordingly. - language (`bool`, *optional*): - Language token to use for generation, can be either in the form of `<|en|>`, `en` or `english`. You can - find all the possible language tokens in the `model.generation_config.lang_to_id` dictionary. - is_multilingual (`bool`, *optional*): - Whether or not the model is multilingual. kwargs: Ad hoc parametrization of `generate_config` and/or additional model-specific kwargs that will be forwarded to the `forward` function of the model. If the model is an encoder-decoder model, encoder @@ -1360,87 +1342,56 @@ def generate( - [`~generation.SampleEncoderDecoderOutput`], - [`~generation.BeamSearchEncoderDecoderOutput`], - [`~generation.BeamSampleEncoderDecoderOutput`] - """ + Examples: + ```python + >>> import librosa + >>> from transformers import Pop2PianoFeatureExtractor, Pop2PianoForConditionalGeneration + + >>> raw_audio, sr = librosa.load("audio.mp3", sr=44100) + >>> model = Pop2PianoForConditionalGeneration.from_pretrained("susnato/pop2piano_dev") + >>> feature_extractor = Pop2PianoFeatureExtractor.from_pretrained("susnato/pop2piano_dev") + >>> model.eval() + + >>> feature_extractor_outputs = fe(raw_audio=raw_audio, audio_sr=sr, return_tensors="pt") + >>> model_outputs = model.generate(feature_extractor_outputs, composer="composer1") + + >>> prettymidi_output = feature_extractor.postprocess( + ... relative_tokens=model_outputs, + ... beatsteps=feature_extractor_outputs["beatsteps"], + ... ext_beatstep=feature_extractor_outputs["ext_beatstep"], + ... raw_audio=raw_audio, + ... sampling_rate=sr, + ... mix_sampling_rate=sr, + ... save_path="./Outputs/", + ... audio_file_name="output_filename", + ... save_midi=True, + ... save_mix=True, + ... ) + ```""" if input_features is not None and inputs_embeds is not None: raise ValueError("Both input_features and inputs_embeds received. Please give only input_features") if generation_config is None: generation_config = self.generation_config - if return_timestamps is not None: - if not hasattr(generation_config, "no_timestamps_token_id"): - raise ValueError( - "You are trying to return timestamps, but the generation config is not properly set." - "Make sure to initialize the generation config with the correct attributes that are needed such as `no_timestamps_token_id`." - "For more details on how to generate the approtiate config, refer to https://github.com/huggingface/transformers/issues/21878#issuecomment-1451902363" - ) - - generation_config.return_timestamps = return_timestamps - else: - generation_config.return_timestamps = False - - if language is not None: - generation_config.language = language - if task is not None: - generation_config.task = task - - forced_decoder_ids = [] - if task is not None or language is not None: - if hasattr(generation_config, "language"): - if generation_config.language in generation_config.lang_to_id.keys(): - language_token = generation_config.language - elif generation_config.language in TO_LANGUAGE_CODE.keys(): - language_token = f"<|{TO_LANGUAGE_CODE[generation_config.language]}|>" - else: - raise ValueError( - f"Unsupported language: {self.language}. Language should be one of:" - f" {list(TO_LANGUAGE_CODE.keys()) if generation_config.language in TO_LANGUAGE_CODE.keys() else list(TO_LANGUAGE_CODE.values())}." - ) - forced_decoder_ids.append((1, generation_config.lang_to_id[language_token])) - else: - forced_decoder_ids.append((1, None)) # automatically detect the language - - if hasattr(generation_config, "task"): - if generation_config.task in TASK_IDS: - forced_decoder_ids.append((2, generation_config.task_to_id[generation_config.task])) - else: - raise ValueError( - f"The `{generation_config.task}`task is not supported. The task should be one of `{TASK_IDS}`" - ) - else: - forced_decoder_ids.append((2, generation_config.task_to_id["transcribe"])) # defaults to transcribe - if hasattr(generation_config, "no_timestamps_token_id") and not generation_config.return_timestamps: - idx = forced_decoder_ids[-1][0] + 1 if forced_decoder_ids else 1 - forced_decoder_ids.append((idx, generation_config.no_timestamps_token_id)) - - # Legacy code for backward compatibility - elif hasattr(self.config, "forced_decoder_ids") and self.config.forced_decoder_ids is not None: - forced_decoder_ids = self.config.forced_decoder_ids - elif ( - hasattr(self.generation_config, "forced_decoder_ids") - and self.generation_config.forced_decoder_ids is not None - ): - forced_decoder_ids = self.generation_config.forced_decoder_ids - - if generation_config.return_timestamps: - logits_processor = [WhisperTimeStampLogitsProcessor(generation_config)] - - if len(forced_decoder_ids) > 0: - generation_config.forced_decoder_ids = forced_decoder_ids - # select composer randomly if not already given composer_to_feature_token = self.config.composer_to_feature_token if composer is None: composer = np.random.choice(list(composer_to_feature_token.keys()), size=1)[0] elif composer not in composer_to_feature_token.keys(): - raise ValueError(f"Composer not found in list, Please choose from {list(composer_to_feature_token.keys())}") + raise ValueError( + f"Composer not found in list, Please choose from {list(composer_to_feature_token.keys())}" + ) - n_bars = self.config.dataset.get("n_bars", None) if n_bars is None else n_bars - max_length = self.config.dataset.get("target_length") * max(1, (n_bars // self.config.dataset.get("n_bars"))) \ - if max_length is None else max_length + n_bars = self.config.dataset_n_bars if n_bars is None else n_bars + max_length = ( + self.config.dataset_target_length * max(1, (n_bars // self.config.dataset_n_bars)) + if max_length is None + else max_length + ) inputs_embeds = self.spectrogram(input_features["input_features"]).transpose(-1, -2) - if self.config.dataset.get("mel_is_conditioned", None): + if self.config.dataset_mel_is_conditioned: composer_value = composer_to_feature_token[composer] composer_value = torch.tensor(composer_value, device=self.device) composer_value = composer_value.repeat(inputs_embeds.shape[0]) @@ -1449,26 +1400,22 @@ def generate( return super().generate( inputs, generation_config, - logits_processor, - stopping_criteria, - prefix_allowed_tokens_fn, - synced_gpus, inputs_embeds=inputs_embeds, max_length=max_length, **kwargs, ) def prepare_inputs_for_generation( - self, - input_ids, - past_key_values=None, - attention_mask=None, - head_mask=None, - decoder_head_mask=None, - cross_attn_head_mask=None, - use_cache=None, - encoder_outputs=None, - **kwargs, + self, + input_ids, + past_key_values=None, + attention_mask=None, + head_mask=None, + decoder_head_mask=None, + cross_attn_head_mask=None, + use_cache=None, + encoder_outputs=None, + **kwargs, ): # cut decoder_input_ids if past is used if past_key_values is not None: diff --git a/src/transformers/testing_utils.py b/src/transformers/testing_utils.py index 80b6e9c72b30f6..d265c854bcbb63 100644 --- a/src/transformers/testing_utils.py +++ b/src/transformers/testing_utils.py @@ -55,6 +55,7 @@ is_cython_available, is_decord_available, is_detectron2_available, + is_essentia_available, is_faiss_available, is_flax_available, is_ftfy_available, @@ -62,12 +63,11 @@ is_jumanpp_available, is_keras_nlp_available, is_librosa_available, - is_essentia_available, - is_pretty_midi_available, is_natten_available, is_onnx_available, is_pandas_available, is_phonemizer_available, + is_pretty_midi_available, is_pyctcdecode_available, is_pytesseract_available, is_pytorch_quantization_available, @@ -706,18 +706,21 @@ def require_librosa(test_case): """ return unittest.skipUnless(is_librosa_available(), "test requires librosa")(test_case) + def require_essentia(test_case): """ Decorator marking a test that requires essentia """ return unittest.skipUnless(is_essentia_available(), "test requires essentia")(test_case) + def require_pretty_midi(test_case): """ Decorator marking a test that requires pretty_midi """ return unittest.skipUnless(is_pretty_midi_available(), "test requires pretty_midi")(test_case) + def cmd_exists(cmd): return shutil.which(cmd) is not None diff --git a/src/transformers/utils/__init__.py b/src/transformers/utils/__init__.py index 2a91bfa491d18b..fe3c1c65b62058 100644 --- a/src/transformers/utils/__init__.py +++ b/src/transformers/utils/__init__.py @@ -104,6 +104,7 @@ is_datasets_available, is_decord_available, is_detectron2_available, + is_essentia_available, is_faiss_available, is_flax_available, is_ftfy_available, @@ -113,13 +114,12 @@ is_kenlm_available, is_keras_nlp_available, is_librosa_available, - is_essentia_available, - is_pretty_midi_available, is_natten_available, is_ninja_available, is_onnx_available, is_pandas_available, is_phonemizer_available, + is_pretty_midi_available, is_protobuf_available, is_psutil_available, is_py3nvml_available, diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py index a80af49e278499..f800fa96575bb7 100644 --- a/src/transformers/utils/dummy_pt_objects.py +++ b/src/transformers/utils/dummy_pt_objects.py @@ -5111,6 +5111,23 @@ def __init__(self, *args, **kwargs): requires_backends(self, ["torch"]) +POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST = None + + +class Pop2PianoForConditionalGeneration(metaclass=DummyObject): + _backends = ["torch"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + +class Pop2PianoPreTrainedModel(metaclass=DummyObject): + _backends = ["torch"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST = None diff --git a/src/transformers/utils/import_utils.py b/src/transformers/utils/import_utils.py index 11e06ae94fee63..8d4063c64f568b 100644 --- a/src/transformers/utils/import_utils.py +++ b/src/transformers/utils/import_utils.py @@ -312,12 +312,15 @@ def is_pyctcdecode_available(): def is_librosa_available(): return _librosa_available + def is_essentia_available(): return _essentia_available + def is_pretty_midi_available(): return _pretty_midi_available + def is_torch_cuda_available(): if is_torch_available(): import torch diff --git a/tests/models/pop2piano/test_feature_extraction_pop2piano.py b/tests/models/pop2piano/test_feature_extraction_pop2piano.py index 0df3e8533b9cb2..cc996fdc5fa347 100644 --- a/tests/models/pop2piano/test_feature_extraction_pop2piano.py +++ b/tests/models/pop2piano/test_feature_extraction_pop2piano.py @@ -14,9 +14,7 @@ # limitations under the License. -import itertools import os -import random import tempfile import unittest @@ -24,23 +22,41 @@ from datasets import load_dataset from transformers import is_speech_available -from transformers.testing_utils import (check_json_file_has_correct_format, require_torch, - require_essentia, require_librosa, require_scipy, - require_pretty_midi, require_soundfile) -from transformers.utils.import_utils import (is_torch_available, is_essentia_available, - is_scipy_available, is_librosa_available, - is_soundfile_availble, ) +from transformers.testing_utils import ( + check_json_file_has_correct_format, + require_essentia, + require_librosa, + require_pretty_midi, + require_scipy, + require_soundfile, + require_torch, +) +from transformers.utils.import_utils import ( + is_essentia_available, + is_librosa_available, + is_scipy_available, + is_soundfile_availble, + is_torch_available, +) from ...test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin -requirements = is_speech_available() and is_torch_available() and is_essentia_available() and is_scipy_available() and \ - is_librosa_available() and is_soundfile_availble() + +requirements = ( + is_speech_available() + and is_torch_available() + and is_essentia_available() + and is_scipy_available() + and is_librosa_available() + and is_soundfile_availble() +) if requirements: from transformers import Pop2PianoFeatureExtractor if is_torch_available(): import torch + @require_torch @require_essentia @require_librosa @@ -79,9 +95,10 @@ def prepare_feat_extract_dict(self): "vocab_size_special": self.vocab_size_special, "vocab_size_note": self.vocab_size_note, "vocab_size_velocity": self.vocab_size_velocity, - "vocab_size_time":self.vocab_size_time, + "vocab_size_time": self.vocab_size_time, } + @require_torch @require_essentia @require_librosa @@ -126,7 +143,11 @@ def test_feat_extract_to_json_file(self): def test_call(self): feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict()) - speech_input = np.zeros([1000000, ]) + speech_input = np.zeros( + [ + 1000000, + ] + ) input_features = feature_extractor(speech_input, audio_sr=16_000, return_tensors="np") self.assertTrue(input_features.input_features.ndim == 2) @@ -142,12 +163,38 @@ def _load_datasamples(self, num_samples): def test_integration(self): EXPECTED_INPUT_FEATURES = torch.tensor( - [-4.5434e-05, -1.8900e-04, -2.2150e-04, -2.1844e-04, -2.7647e-04, - -2.1334e-04, -1.5305e-04, -2.6124e-04, -2.6863e-04, -1.5969e-04, - -1.6224e-04, -1.2900e-04, -9.9139e-06, 1.5336e-05, 4.7507e-05, - 9.3454e-05, -2.3652e-05, -1.2942e-04, -1.0804e-04, -1.4267e-04, - -1.5102e-04, -6.7488e-05, -9.6527e-05, -9.6909e-05, 8.0032e-05, - 8.1948e-05, -7.3148e-05, 3.4405e-05, 1.5065e-04, -1.0989e-04] + [ + -4.5434e-05, + -1.8900e-04, + -2.2150e-04, + -2.1844e-04, + -2.7647e-04, + -2.1334e-04, + -1.5305e-04, + -2.6124e-04, + -2.6863e-04, + -1.5969e-04, + -1.6224e-04, + -1.2900e-04, + -9.9139e-06, + 1.5336e-05, + 4.7507e-05, + 9.3454e-05, + -2.3652e-05, + -1.2942e-04, + -1.0804e-04, + -1.4267e-04, + -1.5102e-04, + -6.7488e-05, + -9.6527e-05, + -9.6909e-05, + 8.0032e-05, + 8.1948e-05, + -7.3148e-05, + 3.4405e-05, + 1.5065e-04, + -1.0989e-04, + ] ) input_speech, sampling_rate = self._load_datasamples(1) @@ -197,4 +244,4 @@ def test_padding_from_list(self): @unittest.skip("Pop2PianoFeatureExtractor does not supports padding") def test_padding_from_array(self): - pass \ No newline at end of file + pass diff --git a/tests/models/pop2piano/test_modeling_pop2piano.py b/tests/models/pop2piano/test_modeling_pop2piano.py index 30777a7c144e61..4b3ffd3da8c2a2 100644 --- a/tests/models/pop2piano/test_modeling_pop2piano.py +++ b/tests/models/pop2piano/test_modeling_pop2piano.py @@ -15,23 +15,18 @@ """ Testing suite for the PyTorch Pop2Piano model. """ import copy -import inspect -import os import tempfile import unittest -import numpy as np - -import transformers from transformers import Pop2PianoConfig -from transformers.testing_utils import is_pt_flax_cross_test, require_torch, require_torchaudio, slow, torch_device -from transformers.utils import cached_property, is_flax_available, is_torch_available from transformers.feature_extraction_utils import BatchFeature +from transformers.testing_utils import require_torch, require_torchaudio, slow, torch_device +from transformers.utils import is_torch_available # from ...test_pipeline_mixin import PipelineTesterMixin from ...generation.test_utils import GenerationTesterMixin from ...test_configuration_common import ConfigTester -from ...test_modeling_common import ModelTesterMixin, _config_zero_init, floats_tensor, ids_tensor +from ...test_modeling_common import ModelTesterMixin, ids_tensor if is_torch_available(): @@ -39,10 +34,10 @@ from transformers import ( Pop2PianoForConditionalGeneration, - set_seed, ) from transformers.models.pop2piano.modeling_pop2piano import POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST + class Pop2PianoModelTester: def __init__( self, @@ -393,7 +388,9 @@ def create_and_check_model_fp16_forward( lm_labels, ): model = Pop2PianoForConditionalGeneration(config=config).to(torch_device).half().eval() - output = model(input_ids, decoder_input_ids=input_ids, attention_mask=attention_mask)["encoder_last_hidden_state"] + output = model(input_ids, decoder_input_ids=input_ids, attention_mask=attention_mask)[ + "encoder_last_hidden_state" + ] self.parent.assertFalse(torch.isnan(output).any().item()) def create_and_check_encoder_decoder_shared_weights( @@ -509,7 +506,7 @@ def prepare_config_and_inputs_for_common(self): @require_torch class Pop2PianoModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase): - all_model_classes = (Pop2PianoForConditionalGeneration, ) if is_torch_available() else () + all_model_classes = (Pop2PianoForConditionalGeneration,) if is_torch_available() else () all_generative_model_classes = () all_parallelizable_model_classes = () fx_compatible = False @@ -591,10 +588,6 @@ def test_decoder_model_past_with_large_inputs(self): config_and_inputs = self.model_tester.prepare_config_and_inputs() self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs) - def test_generate_with_past_key_values(self): - config_and_inputs = self.model_tester.prepare_config_and_inputs() - self.model_tester.create_and_check_generate_with_past_key_values(*config_and_inputs) - def test_encoder_decoder_shared_weights(self): config_and_inputs = self.model_tester.prepare_config_and_inputs() self.model_tester.create_and_check_encoder_decoder_shared_weights(*config_and_inputs) @@ -628,40 +621,6 @@ def test_export_to_onnx(self): input_names=["input_ids", "decoder_input_ids"], ) - def test_generate_with_head_masking(self): - attention_names = ["encoder_attentions", "decoder_attentions", "cross_attentions"] - config_and_inputs = self.model_tester.prepare_config_and_inputs() - config = config_and_inputs[0] - max_length = config_and_inputs[1].shape[-1] + 3 - model = Pop2PianoForConditionalGeneration(config).eval() - model.to(torch_device) - - head_masking = { - "head_mask": torch.zeros(config.num_layers, config.num_heads, device=torch_device), - "decoder_head_mask": torch.zeros(config.num_decoder_layers, config.num_heads, device=torch_device), - "cross_attn_head_mask": torch.zeros(config.num_decoder_layers, config.num_heads, device=torch_device), - } - - for attn_name, (name, mask) in zip(attention_names, head_masking.items()): - head_masks = {name: mask} - # Explicitly pass decoder_head_mask as it is required from Pop2Piano model when head_mask specified - if name == "head_mask": - head_masks["decoder_head_mask"] = torch.ones( - config.num_decoder_layers, config.num_heads, device=torch_device - ) - - out = model.generate( - config_and_inputs[1], - num_beams=1, - max_length=max_length, - output_attentions=True, - return_dict_in_generate=True, - **head_masks, - ) - # We check the state of decoder_attentions and cross_attentions just from the last step - attn_weights = out[attn_name] if attn_name == attention_names[0] else out[attn_name][-1] - self.assertEqual(sum([w.sum().item() for w in attn_weights]), 0.0) - @unittest.skip("Does not work on the tiny model as we keep hitting edge cases.") def test_disk_offload(self): pass @@ -674,6 +633,7 @@ def test_generate_with_head_masking(self): def test_generate_with_past_key_values(self): pass + @require_torch @require_torchaudio class Pop2PianoModelIntegrationTests(unittest.TestCase): @@ -687,11 +647,14 @@ def test_log_mel_spectrogram_integration(self): self.assertEqual(output.size(), torch.Size([10, 512, 98])) # check values - self.assertEqual(output[0, :3, :3].cpu().numpy().tolist(), - [[-13.815510749816895, -13.815510749816895, -13.815510749816895], - [-13.815510749816895, -13.815510749816895, -13.815510749816895], - [-13.815510749816895, -13.815510749816895, -13.815510749816895]] - ) + self.assertEqual( + output[0, :3, :3].cpu().numpy().tolist(), + [ + [-13.815510749816895, -13.815510749816895, -13.815510749816895], + [-13.815510749816895, -13.815510749816895, -13.815510749816895], + [-13.815510749816895, -13.815510749816895, -13.815510749816895], + ], + ) @slow def test_mel_conditioner_integration(self): @@ -708,23 +671,20 @@ def test_mel_conditioner_integration(self): self.assertEqual(outputs.size(), torch.Size([10, 101, 512])) # check values - self.assertEqual(outputs[0, :3, :3].detach().cpu().numpy().tolist(), - [[1.0475305318832397, 0.29052114486694336, -0.47778210043907166], - [1.0, 1.0, 1.0], - [1.0, 1.0, 1.0]] - ) + self.assertEqual( + outputs[0, :3, :3].detach().cpu().numpy().tolist(), + [[1.0475305318832397, 0.29052114486694336, -0.47778210043907166], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0]], + ) @slow def test_full_model_integration(self): model = Pop2PianoForConditionalGeneration.from_pretrained("susnato/pop2piano_dev") model.eval() - input_features = BatchFeature({'input_features': torch.ones([100, 100000])}) + input_features = BatchFeature({"input_features": torch.ones([100, 100000])}) outputs = model.generate(input_features=input_features) # check for shapes self.assertEqual(outputs.size(0), 100) # check for values - self.assertEqual(outputs[0, :3].detach().cpu().numpy().tolist(), - [0, 134, 133] - ) \ No newline at end of file + self.assertEqual(outputs[0, :3].detach().cpu().numpy().tolist(), [0, 134, 133]) diff --git a/utils/check_repo.py b/utils/check_repo.py index 121993bc1e833c..4cbcd1a3ca942b 100644 --- a/utils/check_repo.py +++ b/utils/check_repo.py @@ -45,6 +45,7 @@ "RealmBertModel", "T5Stack", "MT5Stack", + "Pop2PianoStack", "SwitchTransformersStack", "TFDPRSpanPredictor", "MaskFormerSwinModel", diff --git a/utils/documentation_tests.txt b/utils/documentation_tests.txt index 8b622bf778dc2b..035ce24e7da2bd 100644 --- a/utils/documentation_tests.txt +++ b/utils/documentation_tests.txt @@ -147,6 +147,8 @@ src/transformers/models/plbart/configuration_plbart.py src/transformers/models/plbart/modeling_plbart.py src/transformers/models/poolformer/configuration_poolformer.py src/transformers/models/poolformer/modeling_poolformer.py +src/transformers/models/pop2piano/modeling_pop2piano.py +src/transformers/models/pop2piano/configuration_pop2piano.py src/transformers/models/realm/configuration_realm.py src/transformers/models/reformer/configuration_reformer.py src/transformers/models/reformer/modeling_reformer.py