Skip to content

Commit

Permalink
Merge pull request PaddlePaddle#653 from dingsiyu/ernie-doc
Browse files Browse the repository at this point in the history
add ernie-doc to ernie develop
  • Loading branch information
nbcc authored May 19, 2021
2 parents 69e3d22 + 5fe5bb6 commit 4b1b4ee
Show file tree
Hide file tree
Showing 43 changed files with 183,760 additions and 0 deletions.
61 changes: 61 additions & 0 deletions ernie-doc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Virtualenv
/.venv/
/venv/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]

# C extensions
*.so

# Distribution / packaging
/bin/
/build/
/develop-eggs/
/dist/
/eggs/
/lib/
/lib64/
/output/
/parts/
/sdist/
/var/
/*.egg-info/
/.installed.cfg
/*.egg
/.eggs

# AUTHORS and ChangeLog will be generated while packaging
/AUTHORS
/ChangeLog

# BCloud / BuildSubmitter
/build_submitter.*
/logger_client_log

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
.tox/
.coverage
.cache
.pytest_cache
nosetests.xml
coverage.xml

# Translations
*.mo

# Sphinx documentation
/docs/_build/

# user-defined
ernie_doc/output
ernie_doc/log
ernie_doc/data/imdb/*.txt
ernie_doc/data/imdb.debug
ernie_doc/tmpout
ernie_doc/py37
Binary file added ernie-doc/.meta/framework.pdf
Binary file not shown.
Binary file added ernie-doc/.meta/framework.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
216 changes: 216 additions & 0 deletions ernie-doc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
English | [简体中文](./README_zh.md)

## _ERNIE-Doc_: A Retrospective Long-Document Modeling Transformer

- [Framework](#framework)
- [Pre-trained Models](#Pre-trained-Models)
- [Fine-tuning Tasks](#Fine-tuning-Tasks)
* [Language Modeling](#Language-Modeling)
* [Long-Text Classification](#Long-Text-Classification)
* [Question Answering](#Question-Answering)
* [Information Extraction](#Information-Extraction)
* [Semantic Matching](#Semantic-Matching)
- [Usage](#Usage)
* [Install Paddle](#Install-PaddlePaddle)
* [Fine-tuning](#Fine-tuning)
- [Citation](#Citation)

For technical description of the algorithm, please see our paper:
>[_**ERNIE-Doc: A Retrospective Long-Document Modeling Transformer**_](https://arxiv.org/abs/2012.15688)
>
>Siyu Ding\*, Junyuan Shang\*, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
>
>Preprint December 2020
>
>Accepted by **ACL-2021**
![ERNIE-Doc](https://img.shields.io/badge/Pretraining-Long%20Document%20Modeling-green) ![paper](https://img.shields.io/badge/Paper-ACL2021-yellow)

---
**ERNIE-Doc is a document-level language pretraining model**. Two well-designed techniques, namely the **retrospective feed mechanism** and the **enhanced recurrence mechanism**, enable ERNIE-Doc, which has a much longer effective context length, to capture the contextual information of a complete document. ERNIE-Doc improved the state-of-the-art language modeling result of perplexity to 16.8 on WikiText-103. Moreover, it outperformed competitive pretraining models by a large margin on most language understanding tasks, such as text classification, question answering, information extraction and semantic matching.

## Framework

We proposed three novel methods to enhance the long document modeling ability of Transformers:

- **Retrospective Feed Mechanism**: Inspired by the human reading behavior of skimming a document first and then looking back upon it attentively, we design a retrospective feed mechanism in which segments from a document are fed twice as input. As a result, each segment in the retrospective phase could explicitly fuse the semantic information of the entire document learned in the skimming phase, which prevents context fragmentation.
- **Enhanced Recurrence Mechansim**, a drop-in replacement for a Recurrence Transformer (like Transformer-XL), by changing the shifting-one-layer-downwards recurrence to the same-layer recurrence. In this manner, the maximum effective context length can be expanded, and past higher-level representations can be exploited to enrich future lower-level representations.
- **Segment-reordering Objective**, a document-aware task of predicting the correct order of the permuted set of segments of a document, to model the relationship among segments directly. This allows ERNIE-Doc to build full document representations for prediction.



![framework](.meta/framework.png)
Illustrations of ERNIE-Doc and Recurrence Transformers, where models with three layers take as input a long document which is sliced into four segments.

## Pre-trained Models

We release the checkpoints for **ERNIE-Doc _base_en/zh_** and **ERNIE-Doc _large_en_** model。

- [**ERNIE-Doc _base_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-en.tar.gz) (_12-layer, 768-hidden, 12-heads_)
- [**ERNIE-Doc _base_zh_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-base-zh.tar.gz) (_12-layer, 768-hidden, 12-heads_)
- [**ERNIE-Doc _large_en_**](https://ernie-github.cdn.bcebos.com/model-ernie-doc-large-en.tar.gz) (_24-layer, 1024-hidden, 16-heads_)


## Fine-tuning Tasks

We compare the performance of [ERNIE-Doc](https://arxiv.org/abs/2012.15688) with the existing SOTA pre-training models (such as [Longformer](https://arxiv.org/abs/2004.05150), [BigBird](https://arxiv.org/abs/2007.14062), [ETC](https://arxiv.org/abs/2004.08483) and [ERNIE2.0](https://arxiv.org/abs/1907.12412)) for language modeling (**_WikiText-103_**) and document-level natural language understanding tasks, including long-text classification (**_IMDB_**, **_HYP_**, **_THUCNews_**, **_IFLYTEK_**), question answering (**_TriviaQA_**, **_HotpotQA_**, **_DRCD_**, **_CMRC2018_**, **_DuReader_**, **_C3_**), information extraction (**_OpenKPE_**) and semantic matching (**_CAIL2019-SCM_**).

### Language Modeling

- [WikiText-103](https://arxiv.org/abs/1609.07843)

| Model | Param. | PPL |
|--------------------------|:--------:|:------:|
| _Results of base models_ | | |
| LSTM | - | 48.7 |
| LSTM+Neural cache | - | 40.8 |
| GCNN-14 | - | 37.2 |
| QRNN | 151M | 33.0 |
| Transformer-XL Base | 151M | 24.0 |
| SegaTransformer-XL Base | 151M | 22.5 |
| **ERNIE-Doc** Base | 151M | **21.0** |
| _Results of large models_ | | |
| Adaptive Input | 247M | 18.7 |
| Transformer-XL Large | 247M | 18.3 |
| Compressive Transformer | 247M | 17.1 |
| SegaTransformer-XL Large | 247M | 17.1 |
| **ERNIE-Doc** Large | 247M | **16.8** |

### Long-Text Classification

- [IMDB reviews](http://ai.stanford.edu/~amaas/data/sentiment/index.html)

| Models | Acc. | F1 |
|-----------------|:----:|:----:|
| RoBERTa | 95.3 | 95.0 |
| Longformer | 95.7 | - |
| BigBird | - | 95.2 |
| **ERNIE-Doc** Base | **96.1** | **96.1** |
| XLNet-Large | 96.8 | - | - |
| **ERNIE-Doc** Large | **97.1** | **97.1** |

- [Hyperpartisan News Dection](https://pan.webis.de/semeval19/semeval19-web/)

| Models | F1 |
|-----------------|:----:|
| RoBERTa | 87.8 |
| Longformer | 94.8 |
| BigBird | 92.2 |
| **ERNIE-Doc** Base | **96.3** |
| **ERNIE-Doc** Large | **96.6** |

- [THUCNews(THU)](http://thuctc.thunlp.org/)[IFLYTEK(IFK)](https://arxiv.org/abs/2004.05986)

| Models | THU | THU | IFK |
|-----------------|:--------:|:--------:|:--------:|
| | Acc. | Acc. | Acc. |
| | Dev | Test | Dev |
| BERT | 97.7 | 97.3 | 60.3 |
| BERT-wwm-ext | 97.6 | 97.6 | 59.4 |
| RoBERTa-wwm-ext | - | - | 60.3 |
| ERNIE 1.0 | 97.7 | 97.3 | 59.0 |
| ERNIE 2.0 | 98.0 | 97.5 | 61.7 |
| **ERNIE-Doc** | **98.3** | **97.7** | **62.4** |

### Question Answering

- [TriviaQA](http://nlp.cs.washington.edu/triviaqa/) on dev-set

| Models | F1 |
|-----------------|:----:|
| RoBERTa | 74.3 |
| Longformer | 75.2 |
| BigBird | 79.5 |
| **ERNIE-Doc** Base | **80.1** |
| Longformer Large | 77.8 |
| BigBird Large | - |
| **ERNIE-Doc** Large | **82.5** |

- [HotpotQA](https://hotpotqa.github.io/) on dev-set

| Models | Span-F1 | Supp.-F1 | Joint-F1 |
|-----------------|:----:|:----:|:----:|
| RoBERTa | 73.5 | 83.4 | 63.5 |
| Longformer | 74.3 | 84.4 | 64.4 |
| BigBird | 75.5 | **87.1** | 67.8 |
| **ERNIE-Doc** Base | **79.4** | 86.3 | **70.5** |
| Longformer Large | 81.0 | 85.8 | 71.4 |
| BigBird Large | 81.3 | **89.4** | - |
| **ERNIE-Doc** Large | **82.2** | 87.6 | **73.7** |

- [DRCD](https://arxiv.org/abs/1806.00920), [CMRC2018](https://arxiv.org/abs/1810.07366), [DuReader](https://arxiv.org/abs/1711.05073), [C3](https://arxiv.org/abs/1904.09679)

| Models | DRCD | DRCD | CMRC2018 | DuReader | C3 | C3 |
|-----------------|---------------|---------------|---------------|---------------|----------|----------|
| | dev | test | dev | dev | dev | test |
| | EM/F1 | EM/F1 | EM/F1 | EM/F1 | Acc. | Acc. |
| BERT | 85.7/91.6 | 84.9/90.9 | 66.3/85.9 | 59.5/73.1 | 65.7 | 64.5 |
| BERT-wwm-ext | 85.0/91.2 | 83.6/90.4 | 67.1/85.7 | -/- | 67.8 | 68.5 |
| RoBERTa-wwm-ext | 86.6/92.5 | 85.2/92.0 | 67.4/87.2 | -/- | 67.1 | 66.5 |
| MacBERT | 88.3/93.5 | 87.9/93.2 | 69.5/87.7 | -/- | - | - |
| XLNet-zh | 83.2/92.0 | 82.8/91.8 | 63.0/85.9 | -/- | - | - |
| ERNIE 1.0 | 84.6/90.9 | 84.0/90.5 | 65.1/85.1 | 57.9/72/1 | 65.5 | 64.1 |
| ERNIE 2.0 | 88.5/93.8 | 88.0/93.4 | 69.1/88.6 | 61.3/74.9 | 72.3 | 73.2 |
| **ERNIE-Doc** | **90.5/95.2** | **90.5/95.1** | **76.1/91.6** | **65.8/77.9** | **76.5** | **76.5** |

### Information Extraction

- [Open Domain Web Keyphrase Extraction](https://www.aclweb.org/anthology/D19-1521/)

| Models | F1@1 | F1@3 | F1@5 |
|-----------|:----:|:----:|:----:|
| BLING-KPE | 26.7 | 29.2 | 20.9 |
| JointKPE | 39.1 | 39.8 | 33.8 |
| ETC | - | 40.2 | - |
| ERNIE-Doc | **40.2** | **40.5** | **34.4** |

### Semantic Matching

- [CAIL2019-SCM](https://arxiv.org/abs/1911.08962)

| Models | Dev (Acc.) | Test (Acc.) |
|-----------|:-------------:|:-------------:|
| BERT | 61.9 | 67.3 |
| ERNIE 2.0 | 64.9 | 67.9 |
| ERNIE-Doc | **65.6** | **68.8** |


## Usage

### Install PaddlePaddle

This code base has been tested with Paddle (version>=1.8) with Python3. Other dependency of ERNIE-GEN is listed in `requirements.txt`, you can install it by
```script
pip install -r requirements.txt
```

### Fine-tuning
We release the finetuning code for English and Chinese classification tasks and Chinese Question Answers Tasks. For example, you can finetune **ERNIE-Doc** base model on IMDB and IFLYTEK dataset by
```shell
sh script/run_imdb.sh
sh script/run_iflytek.sh
sh script/run_dureader.sh
```
[Preprocessing code for IMDB dataset](./ernie_doc/data/imdb/README.md)


The log of training and the evaluation results are in `log/job.log.0`.

**Notice**: The actual total batch size is equal to `configured batch size * number of used gpus`.


## Citation

You can cite the paper as below:

```
@article{ding2020ernie,
title={ERNIE-DOC: The Retrospective Long-Document Modeling Transformer},
author={Ding, Siyu and Shang, Junyuan and Wang, Shuohuan and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15688},
year={2020}
}
```



Loading

0 comments on commit 4b1b4ee

Please sign in to comment.