GitHub - jinlanfu/Polyglot_Prompt: Code and dataset for Polyglot Prompting: Multilingual Multitask Prompt Training.

Polyglot Prompting: Multilingual Multitask Prompt Training

This repository contains the code and datasets for our paper Polyglot Prompting: Multilingual Multitask Prompt Training.

Overview

This paper aims for a potential architectural improvement for multilingual learning and asks: `Can different tasks from different languages be modeled in a monolithic framework, i.e. without any task/language-specific module? `

We approach this goal by developing a learning framework named Polyglot Prompting to exploit prompting methods for learning a unified semantic space for different languages and tasks with multilingual prompt engineering. We performed a comprehensive evaluation of $6$ tasks, namely topic classification, sentiment classification, named entity recognition, question answering, natural language inference, and summarization, covering $24$ datasets and $49$ languages.

Quick Installation

Python==3.7
torch==1.9.0
transformers==4.15.0

Run the following script to install the dependencies,

pip3 install -r requirements.txt

PolyPrompt Datasets

How to use the PolyPrompt Datasets? We have released all the datasets prompted with the best settings. We also provide two methods for downloading the original datasets.

1. Load the PolyPrompt Datasets from `DataLab`.

(1) Install DataLab with the following command:

pip install --upgrade pip
pip install datalabs
python -m nltk.downloader omw-1.4 # to support more feature calculation

More detailed instructions on installing DataLab can be found here.

(2) After installing DataLab, the following code can be used to download/load datasets equipped with cross-language prompts.

# pip install datalabs
from datalabs import load_dataset
dataset = load_dataset("poly_prompt","xquad.es")

# Get more information about the dataset.
print('dataset: ',dataset)
print(dataset['train'][0])
print(dataset['train']._info)

2. Build the PolyPrompt Datasets with our provided preprocessing code.

data_preprocess.py is the data preprocessing code for seven target datasets (e.g., XNLI, TydiQA) and 15 non-target datasets (e.g., MCtest). You can use the prompt template to build the PolyPrompt Datasets.

Resources

Polyglot Prompt Templates

./templates/CL is the cross-languge prompt templates explored in this work.
./templates/IL is the in-languge prompt templates explored in this work.

Preprocessed PolyPrompt Datasets

7 target datasets with the cross-language prompt: 7targetdatas_CL.
15 non-target datasets with the cross-language prompt: multilingual_expanddatas_CL.
The training set with .pt format for the PolyPrompt model: 7targetdatas_CL_train.

How to Run?

Preprocess or download datasets with cross-language prompt templates, and place datasets in ./datas. Run the train_mt5.py with the following command:

export model_dir='./models'
export taskname="tydiqa,pawsx,xnli,mldoc,marc,mlqa,xquad"
export model_name="mt5base_polyprompt_crossLanguage"
export output_dir=${model_dir}/${model_name}
export model_path='google/mt5-base'
export datadir='./datas/'
export prompt_dir='./datas/templates/CL/'
export train_filename='train3k_promptCL_7datas_18w.pt'
export train_file_path=${datadir}/${train_filename}
export num_train_epochs=18
export save_steps=4000
export do_train=True # set do_train=False, if you don't need to fine-tuning the model.
export do_eval=True
export do_test=True
export PER_DEVICE_TRAIN_BATCH_SIZE=18
export PER_DEVICE_EVAL_BATCH_SIZE=2
export gradient_accumulation_steps=5
export eval_batch_size=100

CUDA_VISIBLE_DEVICES=2,3 python ./train_mt5.py \
    --output_dir=$output_dir \
    --taskname=${taskname} \
    --model_name_or_path=$model_path \
    --train_file_path=$train_file_path \
    --overwrite_output_dir=True \
     --per_device_train_batch_size=$PER_DEVICE_TRAIN_BATCH_SIZE \
     --per_device_eval_batch_size=$PER_DEVICE_EVAL_BATCH_SIZE \
     --source_max_len=512 \
     --target_max_len=64 \
     --eval_batch_size=$eval_batch_size \
     --gradient_accumulation_steps=${gradient_accumulation_steps} \
    --learning_rate=1e-4 \
    --num_train_epochs=$num_train_epochs \
    --save_steps=$save_steps \
    --do_train=$do_train \
    --do_eval=$do_eval \
    --data_dir=$datadir \
    --prompt_dir=$prompt_dir \
    --model_dir=$model_dir

The above commands can be found in run_train.sh.

If you just want to evaluate your own PolyPrompt model, you can run the following command:

./run_pred.sh

Bib

@article{fu2022polyglot,
  title={Polyglot Prompt: Multilingual Multitask PrompTraining},
  author={Fu, Jinlan and Ng, See-Kiong and Liu, Pengfei},
  journal={EMNLP},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Polyglot Prompting: Multilingual Multitask Prompt Training

Overview

Quick Installation

PolyPrompt Datasets

1. Load the PolyPrompt Datasets from `DataLab`.

2. Build the PolyPrompt Datasets with our provided preprocessing code.

Resources

Polyglot Prompt Templates

Preprocessed PolyPrompt Datasets

How to Run?

Bib

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
datas		datas
pic		pic
utils		utils
README.md		README.md
data_preprocess.py		data_preprocess.py
evaluate.py		evaluate.py
requirements.txt		requirements.txt
run_pred.sh		run_pred.sh
run_train.sh		run_train.sh
train_mt5.py		train_mt5.py

jinlanfu/Polyglot_Prompt

Folders and files

Latest commit

History

Repository files navigation

Polyglot Prompting: Multilingual Multitask Prompt Training

Overview

Quick Installation

PolyPrompt Datasets

1. Load the PolyPrompt Datasets from DataLab.

2. Build the PolyPrompt Datasets with our provided preprocessing code.

Resources

Polyglot Prompt Templates

Preprocessed PolyPrompt Datasets

How to Run?

Bib

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Load the PolyPrompt Datasets from `DataLab`.

Packages