Merge pull request PaddlePaddle#653 from dingsiyu/ernie-doc

add ernie-doc to ernie develop
Meiyim · May 19, 2021 · 4b1b4ee · 4b1b4ee
2 parents 69e3d22 + 5fe5bb6
commit 4b1b4ee
Show file tree

Hide file tree

Showing 43 changed files with 183,760 additions and 0 deletions.
diff --git a/ernie-doc/.gitignore b/ernie-doc/.gitignore
@@ -0,0 +1,61 @@
+# Virtualenv
+/.venv/
+/venv/
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+
+# C extensions
+*.so
+
+# Distribution / packaging
+/bin/
+/build/
+/develop-eggs/
+/dist/
+/eggs/
+/lib/
+/lib64/
+/output/
+/parts/
+/sdist/
+/var/
+/*.egg-info/
+/.installed.cfg
+/*.egg
+/.eggs
+
+# AUTHORS and ChangeLog will be generated while packaging
+/AUTHORS
+/ChangeLog
+
+# BCloud / BuildSubmitter
+/build_submitter.*
+/logger_client_log
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+.tox/
+.coverage
+.cache
+.pytest_cache
+nosetests.xml
+coverage.xml
+
+# Translations
+*.mo
+
+# Sphinx documentation
+/docs/_build/
+
+# user-defined
+ernie_doc/output
+ernie_doc/log
+ernie_doc/data/imdb/*.txt
+ernie_doc/data/imdb.debug
+ernie_doc/tmpout
+ernie_doc/py37
diff --git a/ernie-doc/.meta/framework.pdf b/ernie-doc/.meta/framework.pdf
diff --git a/ernie-doc/.meta/framework.png b/ernie-doc/.meta/framework.png
diff --git a/ernie-doc/README.md b/ernie-doc/README.md
@@ -0,0 +1,216 @@
+English | [简体中文](./README_zh.md)
+
+## _ERNIE-Doc_: A Retrospective Long-Document Modeling Transformer
+
+- [Framework](#framework)
+- [Pre-trained Models](#Pre-trained-Models)
+- [Fine-tuning Tasks](#Fine-tuning-Tasks)
+  * [Language Modeling](#Language-Modeling)
+  * [Long-Text Classification](#Long-Text-Classification)
+  * [Question Answering](#Question-Answering)
+  * [Information Extraction](#Information-Extraction)
+  * [Semantic Matching](#Semantic-Matching)
+- [Usage](#Usage)
+  * [Install Paddle](#Install-PaddlePaddle)
+  * [Fine-tuning](#Fine-tuning)
+- [Citation](#Citation)
+
+For technical description of the algorithm, please see our paper:
+>[_**ERNIE-Doc: A Retrospective Long-Document Modeling Transformer**_](https://arxiv.org/abs/2012.15688)
+>
+>Siyu Ding\*, Junyuan Shang\*, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
+>
+>Preprint December 2020
+>
+>Accepted by **ACL-2021**
+
+![ERNIE-Doc](https://img.shields.io/badge/Pretraining-Long%20Document%20Modeling-green) ![paper](https://img.shields.io/badge/Paper-ACL2021-yellow) 
+
+---
+**ERNIE-Doc is a document-level language pretraining model**. Two well-designed techniques, namely the **retrospective feed mechanism** and the **enhanced recurrence mechanism**, enable ERNIE-Doc, which has a much longer effective context length, to capture the contextual information of a complete document. ERNIE-Doc improved the state-of-the-art language modeling result of perplexity to 16.8 on WikiText-103. Moreover, it outperformed competitive pretraining models by a large margin on most language understanding tasks, such as text classification, question answering, information extraction and semantic matching.
+
+## Framework
+
+We proposed three novel methods to enhance the long document modeling ability of Transformers:
+
+- **Retrospective Feed Mechanism**: Inspired by the human reading behavior of skimming a document first and then looking back upon it attentively, we design a retrospective feed mechanism in which segments from a document are fed twice as input. As a result, each segment in the retrospective phase could explicitly fuse the semantic information of the entire document learned in the skimming phase, which prevents context fragmentation.
+- **Enhanced Recurrence Mechansim**, a drop-in replacement for a Recurrence Transformer (like Transformer-XL), by changing the shifting-one-layer-downwards recurrence to the same-layer recurrence. In this manner, the maximum effective context length can be expanded, and past higher-level representations can be exploited to enrich future lower-level representations.
+- **Segment-reordering Objective**, a document-aware task of predicting the correct order of the permuted set of segments of a document, to model the relationship among segments directly. This allows ERNIE-Doc to build full document representations for prediction. 
+
+
+
+![framework](.meta/framework.png)
+Illustrations of ERNIE-Doc and Recurrence Transformers, where models with three layers take as input a long document which is sliced into four segments.
+
+## Pre-trained Models
+
+We release the checkpoints for **ERNIE-Doc _base_en/zh_** and **ERNIE-Doc _large_en_** model。 
+
+- [**ERNIE-Doc _base_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-en.tar.gz) (_12-layer, 768-hidden, 12-heads_)
+- [**ERNIE-Doc _base_zh_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-zh.tar.gz) (_12-layer, 768-hidden, 12-heads_)
+- [**ERNIE-Doc _large_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-large-en.tar.gz) (_24-layer, 1024-hidden, 16-heads_)
+
+
+## Fine-tuning Tasks
+
+We compare the performance of [ERNIE-Doc](https://arxiv.org/abs/2012.15688) with the existing SOTA pre-training models (such as [Longformer](https://arxiv.org/abs/2004.05150), [BigBird](https://arxiv.org/abs/2007.14062), [ETC](https://arxiv.org/abs/2004.08483) and [ERNIE2.0](https://arxiv.org/abs/1907.12412)) for language modeling (**_WikiText-103_**) and document-level natural language understanding tasks, including long-text classification (**_IMDB_**,  **_HYP_**, **_THUCNews_**, **_IFLYTEK_**), question answering (**_TriviaQA_**, **_HotpotQA_**, **_DRCD_**, **_CMRC2018_**, **_DuReader_**, **_C3_**), information extraction (**_OpenKPE_**) and semantic matching (**_CAIL2019-SCM_**).
+
+### Language Modeling
+
+- [WikiText-103](https://arxiv.org/abs/1609.07843)
+
+| Model                    | Param. | PPL  |
+|--------------------------|:--------:|:------:|
+| _Results of base models_   |        |      |
+| LSTM                     |    -   | 48.7 |
+| LSTM+Neural cache        |    -   | 40.8 |
+| GCNN-14                  |    -   | 37.2 |
+| QRNN                     |  151M  | 33.0 |
+| Transformer-XL Base      |  151M  | 24.0 |
+| SegaTransformer-XL Base  |  151M  | 22.5 |
+| **ERNIE-Doc** Base           |  151M  | **21.0** |
+| _Results of large models_  |        |      |
+| Adaptive Input           |  247M  | 18.7 |
+| Transformer-XL Large     |  247M  | 18.3 |
+| Compressive Transformer  |  247M  | 17.1 |
+| SegaTransformer-XL Large |  247M  | 17.1 |
+| **ERNIE-Doc** Large          |  247M  | **16.8** |
+
+### Long-Text Classification
+
+- [IMDB reviews](http://ai.stanford.edu/~amaas/data/sentiment/index.html)
+
+| Models          | Acc. | F1 | 
+|-----------------|:----:|:----:|
+| RoBERTa         | 95.3 | 95.0 | 
+| Longformer      | 95.7 |   -  | 
+| BigBird         |   -  | 95.2 |
+| **ERNIE-Doc** Base  | **96.1** | **96.1** |
+| XLNet-Large     | 96.8 |   -  |   -  |
+| **ERNIE-Doc** Large | **97.1** | **97.1** | 
+
+- [Hyperpartisan News Dection](https://pan.webis.de/semeval19/semeval19-web/)
+
+| Models          | F1 |
+|-----------------|:----:|
+| RoBERTa         | 87.8 | 
+| Longformer      | 94.8 |   
+| BigBird         |  92.2  | 
+| **ERNIE-Doc** Base  |  **96.3** | 
+| **ERNIE-Doc** Large | **96.6** | 
+
+- [THUCNews(THU)](http://thuctc.thunlp.org/)、[IFLYTEK(IFK)](https://arxiv.org/abs/2004.05986)
+
+| Models          |    THU   |    THU   |    IFK   |
+|-----------------|:--------:|:--------:|:--------:|
+|                 |   Acc.   |   Acc.   |   Acc.   |
+|                 |    Dev   | Test     |    Dev   |
+| BERT            |   97.7   |   97.3   |   60.3   |
+| BERT-wwm-ext    |   97.6   |   97.6   |   59.4   |
+| RoBERTa-wwm-ext |     -    |     -    |   60.3   |
+| ERNIE 1.0       |   97.7   |   97.3   |   59.0   |
+| ERNIE 2.0       |   98.0   |   97.5   |   61.7   |
+| **ERNIE-Doc**       | **98.3** | **97.7** | **62.4** |
+
+### Question Answering
+
+- [TriviaQA](http://nlp.cs.washington.edu/triviaqa/) on dev-set
+
+| Models        | F1 |
+|-----------------|:----:|
+| RoBERTa         | 74.3 | 
+| Longformer      | 75.2 |   
+| BigBird         |  79.5 | 
+| **ERNIE-Doc** Base  |  **80.1** | 
+| Longformer Large  |  77.8 | 
+|   BigBird Large  |  - | 
+| **ERNIE-Doc** Large | **82.5** | 
+
+- [HotpotQA](https://hotpotqa.github.io/) on dev-set
+
+| Models          | Span-F1 | Supp.-F1 | Joint-F1 |
+|-----------------|:----:|:----:|:----:|
+| RoBERTa         | 73.5 | 83.4 | 63.5 | 
+| Longformer      | 74.3 |  84.4 | 64.4 |  
+| BigBird         |  75.5 | **87.1** | 67.8 | 
+| **ERNIE-Doc** Base  |  **79.4** | 86.3 | **70.5** | 
+| Longformer Large  |  81.0 | 85.8 | 71.4 | 
+|   BigBird Large  |  81.3 | **89.4** | - | 
+| **ERNIE-Doc** Large | **82.2** | 87.6 | **73.7** | 
+
+- [DRCD](https://arxiv.org/abs/1806.00920), [CMRC2018](https://arxiv.org/abs/1810.07366), [DuReader](https://arxiv.org/abs/1711.05073), [C3](https://arxiv.org/abs/1904.09679)
+
+| Models            | DRCD          | DRCD          | CMRC2018      | DuReader      | C3       | C3       |
+|-----------------|---------------|---------------|---------------|---------------|----------|----------|
+|                 | dev           | test          | dev           | dev           | dev      | test     |
+|                 | EM/F1         | EM/F1         | EM/F1         | EM/F1         | Acc.     | Acc.     |
+| BERT            | 85.7/91.6     | 84.9/90.9     | 66.3/85.9     | 59.5/73.1     |   65.7   |   64.5   |
+| BERT-wwm-ext    | 85.0/91.2     | 83.6/90.4     | 67.1/85.7     | -/-           |   67.8   |   68.5   |
+| RoBERTa-wwm-ext | 86.6/92.5     | 85.2/92.0     | 67.4/87.2     | -/-           |   67.1   |   66.5   |
+| MacBERT         | 88.3/93.5     | 87.9/93.2     | 69.5/87.7     | -/-           |     -    |     -    |
+| XLNet-zh        | 83.2/92.0     | 82.8/91.8     | 63.0/85.9     | -/-           |     -    |     -    |
+| ERNIE 1.0       | 84.6/90.9     | 84.0/90.5     | 65.1/85.1     | 57.9/72/1     |   65.5   |   64.1   |
+| ERNIE 2.0       | 88.5/93.8     | 88.0/93.4     | 69.1/88.6     | 61.3/74.9     |   72.3   |   73.2   |
+| **ERNIE-Doc**   | **90.5/95.2** | **90.5/95.1** | **76.1/91.6** | **65.8/77.9** | **76.5** | **76.5** |
+
+### Information Extraction
+
+- [Open Domain Web Keyphrase Extraction](https://www.aclweb.org/anthology/D19-1521/)
+
+| Models    | F1@1 | F1@3 | F1@5 |
+|-----------|:----:|:----:|:----:|
+| BLING-KPE | 26.7 | 29.2 | 20.9 |
+| JointKPE  | 39.1 | 39.8 | 33.8 |
+| ETC       |   -  | 40.2 |   -  |
+| ERNIE-Doc | **40.2** | **40.5** | **34.4** |
+
+### Semantic Matching
+
+- [CAIL2019-SCM](https://arxiv.org/abs/1911.08962)
+
+| Models    |      Dev (Acc.)     |     Test  (Acc.)       |
+|-----------|:-------------:|:-------------:|
+| BERT      |      61.9     |      67.3     |
+| ERNIE 2.0 |      64.9     |      67.9     |
+| ERNIE-Doc | **65.6** | **68.8** |
+
+
+## Usage
+
+### Install PaddlePaddle
+
+This code base has been tested with Paddle (version>=1.8) with Python3. Other dependency of ERNIE-GEN is listed in `requirements.txt`, you can install it by
+```script
+pip install -r requirements.txt
+```
+
+### Fine-tuning
+We release the finetuning code for English and Chinese classification tasks and Chinese Question Answers Tasks. For example, you can finetune **ERNIE-Doc** base model on IMDB and IFLYTEK dataset by
+```shell
+sh script/run_imdb.sh
+sh script/run_iflytek.sh
+sh script/run_dureader.sh
+```
+[Preprocessing code for IMDB dataset](./ernie_doc/data/imdb/README.md)
+
+
+The log of training and the evaluation results are in `log/job.log.0`.
+
+**Notice**: The actual total batch size is equal to `configured batch size * number of used gpus`.
+
+
+## Citation
+
+You can cite the paper as below:
+
+```
+@article{ding2020ernie,
+  title={ERNIE-DOC: The Retrospective Long-Document Modeling Transformer},
+  author={Ding, Siyu and Shang, Junyuan and Wang, Shuohuan and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
+  journal={arXiv preprint arXiv:2012.15688},
+  year={2020}
+}
+```
+
+
+