Wikipedia Translated Clusters

This is the dataset introduced and described in the EMNLP 2022 Findings paper Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters.

It is a collection of 5K introductions to popular English Wikipedia articles, with their parallel versions in 10 other languages, and machine translations to English. Also includes a synthetically corrupted dataset where one sentence out of the English Wiki is modified, and the task is to use the multilingual documents to identify the outlier with natural language inference (NLI).

Data format

The all_clusters file contains intorductions to popular English Wikipedia articles, and their machine versions in 10 other languages, machine translated to English. The introductions are split into lists of sentneces. Format:

`Article title in English` : {
‘en’: English intro to the Wiki article, split into a list of sentences.
‘de’: German intro, translated to English and split into a list of sentences.
‘fr’: French …
'it': Italian …
'pl': Polish …
'ru': Russian … 
'pt': Portuguese …
'es': Spanish …
'zh': Chinese …
'sv': Swedish …
'uk': Ukrainian …
}

We include translation files per language (popular_pages_translations_{lang}.json) with the follwing format:

`Article title in English` : {
	'title': Article title in English,
 	'mtitle': Article title in source language,
'intro': Intro of the article in the source language,
'translated_intro': Translation of the article intro to English
}

Synthetic corruptions dataset

We create synthetic corruptions to the English Wikipedia introductions by replacing one sentence with an alternative version (based on edits from the VitaminC dataset). The goal in this setting is to automatically identify which sentence was modified by using the information from the other articles in the cluster (i.e., Wikipedia versions in other languages, translated to English). See the paper for more details.

The corrupted articles (wiki_cluster_simulated_corruptions.json) includes a dictionary with 144 titles. Each item has a list of corruption instances (total 825 instances). Each instance has the following format:

{
'wikis': dictionary of `language` to Wiki intro from that language translated to English and split into sentences. The English version has one sentence that was modified.
'gold_ind': [integer] A number indicating the index of the modified sentence in the English intro.
'original': [str] The sentence from the intro that was replaced
'new':  [str] The sentence that replaced the original sentence in this instance (equivalent to the sentence in wiki[‘en’][gold_ind])
'matched_sent': [str] The paired sentence to “new” in the VitaminC dataset that was matched with the “original” sentence by similarity.
}

How to cite

@inproceedings{Schuster2022Stretching,
  title={Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters},
  author={Tal Schuster and Sihao Chen and Senaka Buthpitiya and Alex Fabrikant and Donald Metzler},
  url = {https://arxiv.org/abs/2204.07447},
  booktitle = {Findings of the Association for Computational Linguistics: {EMNLP}},
  year={2022}
}

Disclaimer

This is not an official Google product.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
v1		v1
LICENSE		LICENSE
README.md		README.md
cluster_nli.png		cluster_nli.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Translated Clusters

Data format

Synthetic corruptions dataset

How to cite

Disclaimer

About

Releases

Packages

License

google-research-datasets/wiki-translated-clusters-nli

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Translated Clusters

Data format

Synthetic corruptions dataset

How to cite

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages