This study is driven by the universal right to accessible information, see the full description of our project https://idemproject.eu/en, which aims at Text Simplification (TS) while leveraging strategies from both computational and translation studies. Particularly, intralingual translation, such as Diastratic Translation, focuses on shifting from Standard English to Easy-to-Read (E2R) English, making information accessible to audiences with reading difficulties, including people with disabilities and low literacy levels. The study's contributions include (1) an extended taxonomy of text simplification strategies that integrates insights from translation studies, (2) a corpus of complex and simplified texts sourced from public services in Scotland, (3) experiments using transformer-based models to predict simplification strategies, and (4) the use of Explainable AI (XAI) techniques, such as Integrated Gradients, to interpret model predictions.
This repository contains a novel dataset, which is driven by taxonomy for sentence-level TS tasks. The dataset is located in the dataset/ folder. Unlike previous resources like WikiLarge and ASSET, which emphasize word-level or predefined operations, this study focuses on why corrections are needed by providing annotations for lexical, syntactic, and semantic changes. The dataset itself stems from diverse public services in Scotland:
Source | # Texts | Complex | Simple | ||||
---|---|---|---|---|---|---|---|
# Words | # Sentences | IQR | # Words | # Sentences | IQR | ||
Health | 21 | 183,677 | 7,258 | (15.0–31.0) | 30,253 | 1,519 | (10.0–21.0) |
Public Info | 4 | 12,217 | 527 | (12.0–30.5) | 3,378 | 217 | (9.0–18.0) |
Politics | 9 | 113,412 | 4,824 | (15.0–29.0) | 12,474 | 832 | (9.0–17.0) |
Data Selection | – | 4,166 | 155 | (12–27) | 3,259 | 161 | (9–20) |
an extended taxonomy of text simplification strategies that integrates insights from translation studies
Macro-Strategy | Strategies |
---|---|
Transcription | No simplification needed. |
Synonymy | - Pragmatic: Acronyms spelled out; Proper names to common names; Contextual synonyms made explicit. - Semantic: Hypernyms; Hyponyms; Stereotypes. - Grammatical: Negative to positive sentences; Passive to active sentences; Pronouns to referents; Tenses simplified. |
Explanation | Words given for known; Expressions given for known; Tropes explained; Schemes explained; Deixis clarified; Hidden grammar made explicit; Hidden concepts made explicit. |
Syntactic Changes | Word → Group; Word → Clause; Word → Sentence; Group → Word; Group → Clause; Group → Sentence; Clause → Word; Clause → Group; Clause → Sentence; Sentence → Word; Sentence → Group; Sentence → Clause. |
Transposition | Nouns for things, animals, or people; Verbs for actions; Adjectives for nouns; Adverbs for verbs. |
Modulation | Text-level linearity; Sentence-level linearity: Chronological order of clauses; Logical order of complements. |
Anaphora | Repetition replaces synonyms. |
Omission | Useless elements: Nouns; Verbs; Complements; Sentences. Rhetorical constructs; Diamesic elements. |
Illocutionary Change | Implicit meaning made explicit. |
Compression | Grammatical constructs simplified; Rhetorical constructs simplified. |
The annotated texts and necessary resources for training and evaluation are stored in the directory texts/
.
To train the model, run the following command:
python train.py -m <PLM>/
-l <checkpoint_dir>/
-i <train_file>/
--weights_file <class_weights_file> [hyperparameters]
-m
,--model_name
: The name of the pre-trained language model to use (e.g., bert-base-multilingual-cased, roberta-base).-l
,--local
: Directory to save model checkpoints and logs.-i
,--train_file
: Path to the training dataset (e.g.,texts/train.csv
).--weights_file
: Path to the file containing class weights (e.g.,texts/class_weights.txt
).
Available Pre-Trained Language Models (PLMs) We recommend experimenting with widely-used transformer models like BERT, RoBERTa, and mBERT
-e
,--epochs
: Number of training epochs (default:4
).--batch_size
: Batch size for training (default:8
).--learning_rate
: Learning rate for the optimizer (default:1e-5
).--weight_decay
: Weight decay (L2 regularization) for the optimizer (default:0.01
).--eval_steps
: Number of steps between evaluations (default:1000
).--fp16
: Use mixed-precision training for faster execution and reduced memory usage (flag; no value needed).--max_grad_norm
: Maximum gradient norm for clipping (default:1.0
).--seed
: Random seed for reproducibility (default:42
).--gradient_accumulation_steps
: Number of steps to accumulate gradients before updating model weights (default:1
).--save_total_limit
: Maximum number of checkpoints to keep (default:2
).--logging_dir
: Directory to store logs (default:./results/logs
).--evaluation_strategy
: Evaluation strategy (epoch
,steps
, orno
; default:epoch
).--save_strategy
: Strategy for saving checkpoints (epoch
,steps
, orno
; default:epoch
).
python train.py \
-m bert-base-multilingual-cased \
-l ./results \
-i texts/train.csv \
--weights_file texts/class_weights.txt \
--epochs 10 \
--batch_size 16 \
--learning_rate 5e-5 \
--eval_steps 500 \
--fp16 \
--max_grad_norm 1.0
To view the full list of parameters and their descriptions:
python train.py -h
To evaluate the trained model, run the following command:
python test.py -m <checkpoint_dir> -t <test_file>
-m
,--model_dir
: Path to the saved model directory (e.g.,./results
).-t
,--test_file
: Path to the test dataset (e.g.,texts/test.csv
).
python test.py \
-m ./results \
-t texts/test.csv
Before training and evaluation, you can split your dataset into training and testing subsets using split_data.py
:
python split_data.py \
-i texts/annotated.csv \
--output_train texts/train.csv \
--output_test texts/test.csv \
--output_weights texts/class_weights.txt \
--test_size 0.2
-i
,--inputfile
: Path to the annotated dataset.--output_train
: Path to save the training dataset.--output_test
: Path to save the test dataset.--output_weights
: Path to save the class weights file.--test_size
: Proportion of data to use for testing (default:0.2
).
After training the model, you can use the inference.py
script to classify new texts using the saved model and label classes.
Run the inference.py
script as follows:
python inference.py \
-m results/ \
-i "The patient was admitted to the hospital with symptoms of severe flu." \
"The weather today is sunny and warm." \
-l label_classes.npy
-m
or--model_path
: Path to the directory containing the saved model files (e.g.,results/
).-i
or--input_texts
: One or more input texts to classify. Provide each text in quotes, separated by spaces.-l
or--label_classes_path
: Path to the label classes file (e.g.,label_classes.npy
), required to decode predictions.
- Ensure the
class_weights.txt
is generated usingsplit_data.py
. This ensures the model effectively handles class imbalances. - The
label_classes.npy
file maps the numeric predictions from the model to their respective class names. - You can provide multiple input texts by separating them with spaces and enclosing each text in quotes.
The dataset and the script are fully described in our paper:
Reading Between the Lines: A Dataset and a Study on Why Some Texts Are Tougher Than Others,
Nouran Khallaf, Carlo Eugeni, and Serge Sharoff
Presented at Writing Aids at the Crossroads of AI, Cognitive Science, and NLP (WR-AI-CogS), COLING 2025, Abu Dhabi.
arXiv:2501.01796
@inproceedings{khallaf2025readinglinesdatasetstudy,
title={Reading Between the Lines: A dataset and a study on why some texts are tougher than others},
author={Nouran Khallaf and Carlo Eugeni and Serge Sharoff},
booktitle={Writing Aids at the Crossroads of AI, Cognitive Science and NLP WR-AI-CogS, at COLING'2025},
address={Abu Dhabi},
year={2025},
eprint={2501.01796},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.01796}
}