BERT Models for Japanese NLP

Note: The contents of this repository is migrated to cl-tohoku/bert-japanese.

BERT Models for Japanese NLP

BERT models trained on Japanese texts.

Pretrained models

Pretrained models can be downloaded from Releases.

At present, BERT-base models are available. We are planning to release BERT-large models in the future.

Features

All the models are trained on Japanese Wikipedia.
We trained models with different tokenization algorithms.
- mecab-ipadic-bpe-32k: texts are first tokenized with MeCab morphological parser and then split into subwords by WordPiece. The vocabulary size is 32000.
- mecab-ipadic-char-4k: texts are first tokenized with MeCab and then split into characters (information of MeCab tokenization is preserved). The vocabulary size is 4000.
All the models are trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.
We also distribute models trained with Whole Word Masking enabled; all of the tokens corresponding to a word (tokenized by MeCab) are masked at once.
Along with the models, we provide tokenizers, which are compatible with ones defined in Transformers by Hugging Face.

Usage

Refer to masked_lm_example.ipynb.

Requirements

For just using the models with tokenizers.py:

Transformers (>= 2.1.1)
mecab-python3 with MeCab installed

If you wish to pretrain a model:

Details of pretraining

Corpus generation and preprocessing

The all distributed models are pretrained on Japanese Wikipedia. To generate the corpus, WikiExtractor is used to extract plain texts from a Wikipedia dump file.

$ python WikiExtractor.py --output /path/to/corpus/dir --bytes 512M --compress --json --links --namespaces 0 --no_templates --min_text_length 16 --processes 20 jawiki-20190901-pages-articles-multistream.xml.bz2

Some preprocessing is applied to the extracted texts. Preprocessing includes splitting texts into sentences, removing noisy markups, etc.

Here we used mecab-ipadic-NEologd to handle proper nouns correctly (i.e. not to treat 。 in named entities such as モーニング娘。 and ゲスの極み乙女。 as sentence boundaries.)

$ seq -f %02g 0 8|xargs -L 1 -I {} -P 9 python make_corpus.py --input_file /path/to/corpus/dir/AA/wiki_{}.bz2 --output_file /path/to/corpus/dir/corpus.txt.{} --mecab_dict_path /path/to/neologd/dict/dir/

Building vocabulary

Same as the original BERT, we used byte-pair-encoding (BPE) to obtain subwords. We used a implementaion of BPE in SentencePiece.

# For mecab-ipadic-bpe-32k models
$ python build_vocab.py --input_file "/path/to/corpus/dir/corpus.txt.*" --output_file "/path/to/base/dir/vocab.txt" --subword_type bpe --vocab_size 32000

# For mecab-ipadic-char-4k models
$ python build_vocab.py --input_file "/path/to/corpus/dir/corpus.txt.*" --output_file "/path/to/base/dir/vocab.txt" --subword_type char --vocab_size 4000

Creating data for pretraining

With the vocabulary and text files above, we create dataset files for pretraining. Note that this process is highly memory-consuming and takes many hours.

# For mecab-ipadic-bpe-32k w/ whole word masking
# Note: each process will consume about 32GB RAM
$ seq -f %02g 0 8|xargs -L 1 -I {} -P 1 python create_pretraining_data.py --input_file /path/to/corpus/dir/corpus.txt.{} --output_file /path/to/base/dir/pretraining-data.tf_record.{} --do_whole_word_mask True --vocab_file /path/to/base/dir/vocab.txt --subword_type bpe --max_seq_length 512 --max_predictions_per_seq 80 --masked_lm_prob 0.15

# For mecab-ipadic-bpe-32k w/o whole word masking
# Note: each process will consume about 32GB RAM
$ seq -f %02g 0 8|xargs -L 1 -I {} -P 1 python create_pretraining_data.py --input_file /path/to/corpus/dir/corpus.txt.{} --output_file /path/to/base/dir/pretraining-data.tf_record.{} --vocab_file /path/to/base/dir/vocab.txt --subword_type bpe --max_seq_length 512 --max_predictions_per_seq 80 --masked_lm_prob 0.15

# For mecab-ipadic-char-4k w whole word masking
# Note: each process will consume about 45GB RAM
$ seq -f %02g 0 8|xargs -L 1 -I {} -P 1 python create_pretraining_data.py --input_file /path/to/corpus/dir/corpus.txt.{} --output_file /path/to/base/dir/pretraining-data.tf_record.{} --do_whole_word_mask True --vocab_file /path/to/base/dir/vocab.txt --subword_type char --max_seq_length 512 --max_predictions_per_seq 80 --masked_lm_prob 0.15

# For mecab-ipadic-char-4k w/o whole word masking
# Note: each process will consume about 45GB RAM
$ seq -f %02g 0 8|xargs -L 1 -I {} -P 1 python create_pretraining_data.py --input_file /path/to/corpus/dir/corpus.txt.{} --output_file /path/to/base/dir/pretraining-data.tf_record.{} --vocab_file /path/to/base/dir/vocab.txt --subword_type char --max_seq_length 512 --max_predictions_per_seq 80 --masked_lm_prob 0.15

Training

We used Cloud TPUs to run pre-training.

For BERT-base models, v3-8 TPUs are used.

# For mecab-ipadic-bpe-32k BERT-base models
$ python3 run_pretraining.py \
--input_file="/path/to/pretraining-data.tf_record.*" \
--output_dir="/path/to/output_dir" \
--bert_config_file=bert_base_32k_config.json \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--do_train=True \
--train_batch_size=256 \
--num_train_steps=1000000 \
--learning_rate=1e-4 \
--save_checkpoints_steps=100000 \
--keep_checkpoint_max=10 \
--use_tpu=True \
--tpu_name=<tpu name> \
--num_tpu_cores=8

# For mecab-ipadic-char-4k BERT-base models
$ python3 run_pretraining.py \
--input_file="/path/to/pretraining-data.tf_record.*" \
--output_dir="/path/to/output_dir" \
--bert_config_file=bert_base_4k_config.json \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--do_train=True \
--train_batch_size=256 \
--num_train_steps=1000000 \
--learning_rate=1e-4 \
--save_checkpoints_steps=100000 \
--keep_checkpoint_max=10 \
--use_tpu=True \
--tpu_name=<tpu name> \
--num_tpu_cores=8

Acknowledgments

For training models, we used Cloud TPUs provided by TensorFlow Research Cloud program.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
wikiextractor @ 3162bb6		wikiextractor @ 3162bb6
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
bert_base_32k_config.json		bert_base_32k_config.json
bert_base_4k_config.json		bert_base_4k_config.json
build_vocab.py		build_vocab.py
create_pretraining_data.py		create_pretraining_data.py
make_corpus.py		make_corpus.py
masked_lm_example.ipynb		masked_lm_example.ipynb
modeling.py		modeling.py
optimization.py		optimization.py
run_pretraining.py		run_pretraining.py
tokenization.py		tokenization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Note: The contents of this repository is migrated to cl-tohoku/bert-japanese.

BERT Models for Japanese NLP

Pretrained models

Features

Usage

Requirements

Details of pretraining

Corpus generation and preprocessing

Building vocabulary

Creating data for pretraining

Training

Acknowledgments

About

Releases 2

Packages

Languages

License

singletongue/japanese-bert

Folders and files

Latest commit

History

Repository files navigation

Note: The contents of this repository is migrated to cl-tohoku/bert-japanese.

BERT Models for Japanese NLP

Pretrained models

Features

Usage

Requirements

Details of pretraining

Corpus generation and preprocessing

Building vocabulary

Creating data for pretraining

Training

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages