MultiDoc-MultiLingual

Setup

Install Rouge Package

Follow the README.md under the multilingual_rouge_scoring directory.

Dataset Augmentation

We use Google Search Engine to find more input articles for each event from the WCEP with the following steps:

Extract Keywords

We use keyBERT to extract keywords by running:

$ python dataset_collection/keywords_extraction_keyBERT.py \
    --file_name "cantonese_crawl.jsonl" \
    --data_dir "./Multi-Doc-Sum/Mtl_data" \
    --output_dir "./Multi-Doc-Sum/keywords_extraction_keyBERT"

The meaning of the arguments are as follows:

data_dir dataset directory of the original crawled data from WCEP.
file_name a specific file of a certain language under the dataset directory.
output_dir output directory of the extracted keywords.

Google Search

We run the following command to search Google:

$ python dataset_collection/google_search.py \
    --my_api_key $GOOGLE_SEARCH_API_KEY \
    --my_cse_id $CUSTOM_SEARCH_ENGINE_ID \
    --file_name "cantonese_crawl.jsonl" \
    --data_dir "./Multi-Doc-Sum/Mtl_data" \
    --keywords_dir "./Multi-Doc-Sum/keywords_extraction_keyBERT" \
    --data_aug_dir "./Multi-Doc-Sum/Mtl_data_aug"

The meaning of the arguments are as follows:

MY_API_KEY your google search API key.
MY_CSE_ID your Custom Search Engine ID.
data_dir dataset directory of the original crawled data from WCEP.
file_name a specific file of a certain language under the dataset directory.
output_dir output directory of the extracted keywords.

Dataset Cleaning

A clean dataset is generated by filtering the source documents using the ORACLE method. The first step is to calculate the ORACLE score of each source document with the summary:

$ python dataset_collection/filter_source_documents/filter_oracle_get_score.py \
    --data-dir "./Multi-Doc-Sum/Mtl_data/doc_extraction" \
    --output-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/scored" \
    --input-file-name "cantonese_extracted.jsonl" \
    --output-file-name "cantonese_scored.jsonl"

The second step is to filter out the source docuemnts with ORACLE score bellow a threshold:

$ python dataset_collection/filter_source_documents/filter_oracle.py \
    --threshold 7 \
    --data-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/scored" \
    --output-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/filtered" \
    --input-file-name "cantonese_scored.jsonl" \
    --output-file-name "cantonese_filtered.jsonl"

Split Dataset

Both the noisy and the clean datasets are randomly split into 80%, 10%, and 10% training, validation, and test sets, respectively:

$ python dataset_collection/split.py \
    --lang cantonese\
    --data-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/orig" \
    --output-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/split/orig" \
    --input-file-name "cantonese_scored.jsonl" 

$ python dataset_collection/split.py \
    --lang cantonese \
    --data-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/filtered" \
    --output-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/split/filtered" \
    --input-file-name "cantonese_filtered.jsonl"

Baselines

Heuristic Baseline

Heuristic Baselines are calculated by:

$ python baselines/heuristic/get_heuristic_baseline_result.py \
    --input-file-name "cantonese_test.jsonl" \
    --data-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/split/filtered/" \
    --lang cantonese \
    --output-dir "./baseline_results/clean_dataset"

TextRank Baseline

TextRank Baselines are calculated by:

Mt5 Baseline

Prepare the dataset to train mt5 models:

$ python baselines/mt5/prepare_dataset.py \
    --input-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/split/filtered" \
    --output-dir "data/first_sentences" \
    --language "cantonese"

Here is an example trainer_cantonese.sh to train single langauge mt5 model.
Here is an example train_multilingual.sh to train multilingual mt5 model. Our trained multilingual model can be downloaded from Multilingual-mt5.

Evaluation

Here is an example evaluation.py to use evaluation metrics: BERTScore and T5Score. To run T5Score, a T5Score model should be downloaded from T5Score-summ to directory ./model/T5Score/.

Docker image

We have also provided a docker image with all dependencies pre-installed in order to make it easier to run the above scripts. Here's how to run the training pipeline inside a docker container:

docker pull zs12/multidoc_multilingual:v0.3.1

# train single langauge mt5 model
./dockerfiles/docker_train_mt5.sh prepared_dataset/individual/EN/ output/

# train multilingual mt5 model
./dockerfiles/docker_train_mt5.sh prepared_dataset/multilingual/ output/ multi

# Run inference/prediction using a trained multilingual mt5 model
# data_dir/ should contain files named test.source and test.target
./dockerfiles/docker_predict_with_generate.sh model_dir/ data_dir/ output_dir/

# Run a prediction server that accepts requests from localhost:4123
./dockerfiles/docker_prediction_server.sh model_dir/
# upload a file to summarize (one doc/passage per line)
curl --data-binary @test.source localhost:4123

To run the docker container via ClearML:

# train single langauge mt5 model
./clearml_scripts/clearml_train_mt5.sh prepared_dataset/individual/EN/ output/

# train multilingual mt5 model
./clearml_scripts/clearml_train_mt5.sh prepared_dataset/multilingual/ output/ multi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiDoc-MultiLingual

Setup

Install Rouge Package

Dataset Augmentation

Extract Keywords

Google Search

Dataset Cleaning

Split Dataset

Baselines

Heuristic Baseline

TextRank Baseline

Mt5 Baseline

Evaluation

Docker image

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
baselines		baselines
clearml_scripts		clearml_scripts
dataset_collection		dataset_collection
dockerfiles		dockerfiles
evaluation		evaluation
multilingual_rouge_scoring		multilingual_rouge_scoring
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pip_freeze.txt		pip_freeze.txt
requirements.txt		requirements.txt

License

zaidsheikh/MultiDoc-MultiLingual

Folders and files

Latest commit

History

Repository files navigation

MultiDoc-MultiLingual

Setup

Install Rouge Package

Dataset Augmentation

Extract Keywords

Google Search

Dataset Cleaning

Split Dataset

Baselines

Heuristic Baseline

TextRank Baseline

Mt5 Baseline

Evaluation

Docker image

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages