Reading Between the Lines: A dataset and a study on why some texts are tougher than others

This study is driven by the universal right to accessible information, see the full description of our project https://idemproject.eu/en, which aims at Text Simplification (TS) while leveraging strategies from both computational and translation studies. Particularly, intralingual translation, such as Diastratic Translation, focuses on shifting from Standard English to Easy-to-Read (E2R) English, making information accessible to audiences with reading difficulties, including people with disabilities and low literacy levels. The study's contributions include (1) an extended taxonomy of text simplification strategies that integrates insights from translation studies, (2) a corpus of complex and simplified texts sourced from public services in Scotland, (3) experiments using transformer-based models to predict simplification strategies, and (4) the use of Explainable AI (XAI) techniques, such as Integrated Gradients, to interpret model predictions.

1-Corpus of complex and simplified texts

This repository contains a novel dataset, which is driven by taxonomy for sentence-level TS tasks. The dataset is located in the dataset/ folder. Unlike previous resources like WikiLarge and ASSET, which emphasize word-level or predefined operations, this study focuses on why corrections are needed by providing annotations for lexical, syntactic, and semantic changes. The dataset itself stems from diverse public services in Scotland:

Source	# Texts	Complex			Simple
Source	# Texts	# Words	# Sentences	IQR	# Words	# Sentences	IQR
Health	21	183,677	7,258	(15.0–31.0)	30,253	1,519	(10.0–21.0)
Public Info	4	12,217	527	(12.0–30.5)	3,378	217	(9.0–18.0)
Politics	9	113,412	4,824	(15.0–29.0)	12,474	832	(9.0–17.0)
Data Selection	–	4,166	155	(12–27)	3,259	161	(9–20)

2-Text Simplification Macro-Strategies

an extended taxonomy of text simplification strategies that integrates insights from translation studies

Macro-Strategy	Strategies
Transcription	No simplification needed.
Synonymy	- Pragmatic: Acronyms spelled out; Proper names to common names; Contextual synonyms made explicit. - Semantic: Hypernyms; Hyponyms; Stereotypes. - Grammatical: Negative to positive sentences; Passive to active sentences; Pronouns to referents; Tenses simplified.
Explanation	Words given for known; Expressions given for known; Tropes explained; Schemes explained; Deixis clarified; Hidden grammar made explicit; Hidden concepts made explicit.
Syntactic Changes	Word → Group; Word → Clause; Word → Sentence; Group → Word; Group → Clause; Group → Sentence; Clause → Word; Clause → Group; Clause → Sentence; Sentence → Word; Sentence → Group; Sentence → Clause.
Transposition	Nouns for things, animals, or people; Verbs for actions; Adjectives for nouns; Adverbs for verbs.
Modulation	Text-level linearity; Sentence-level linearity: Chronological order of clauses; Logical order of complements.
Anaphora	Repetition replaces synonyms.
Omission	Useless elements: Nouns; Verbs; Complements; Sentences. Rhetorical constructs; Diamesic elements.
Illocutionary Change	Implicit meaning made explicit.
Compression	Grammatical constructs simplified; Rhetorical constructs simplified.

3-Using the simplification stratigies Classification Model

The annotated texts and necessary resources for training and evaluation are stored in the directory texts/.

Training the Model

To train the model, run the following command:

python train.py -m <PLM>/
                -l <checkpoint_dir>/
                -i <train_file>/
                --weights_file <class_weights_file> [hyperparameters]

Required Parameters:

-m, --model_name: The name of the pre-trained language model to use (e.g., bert-base-multilingual-cased, roberta-base).
-l, --local: Directory to save model checkpoints and logs.
-i, --train_file: Path to the training dataset (e.g., texts/train.csv).
--weights_file: Path to the file containing class weights (e.g., texts/class_weights.txt).

Available Pre-Trained Language Models (PLMs) We recommend experimenting with widely-used transformer models like BERT, RoBERTa, and mBERT

Optional Hyperparameters:

-e, --epochs: Number of training epochs (default: 4).
--batch_size: Batch size for training (default: 8).
--learning_rate: Learning rate for the optimizer (default: 1e-5).
--weight_decay: Weight decay (L2 regularization) for the optimizer (default: 0.01).
--eval_steps: Number of steps between evaluations (default: 1000).
--fp16: Use mixed-precision training for faster execution and reduced memory usage (flag; no value needed).
--max_grad_norm: Maximum gradient norm for clipping (default: 1.0).
--seed: Random seed for reproducibility (default: 42).
--gradient_accumulation_steps: Number of steps to accumulate gradients before updating model weights (default: 1).
--save_total_limit: Maximum number of checkpoints to keep (default: 2).
--logging_dir: Directory to store logs (default: ./results/logs).
--evaluation_strategy: Evaluation strategy (epoch, steps, or no; default: epoch).
--save_strategy: Strategy for saving checkpoints (epoch, steps, or no; default: epoch).

Example Training Command

python train.py \
    -m bert-base-multilingual-cased \
    -l ./results \
    -i texts/train.csv \
    --weights_file texts/class_weights.txt \
    --epochs 10 \
    --batch_size 16 \
    --learning_rate 5e-5 \
    --eval_steps 500 \
    --fp16 \
    --max_grad_norm 1.0

To view the full list of parameters and their descriptions:

python train.py -h

Evaluating the Model

To evaluate the trained model, run the following command:

python test.py -m <checkpoint_dir> -t <test_file>

Required Parameters:

-m, --model_dir: Path to the saved model directory (e.g., ./results).
-t, --test_file: Path to the test dataset (e.g., texts/test.csv).

Example Evaluation Command

python test.py \
    -m ./results \
    -t texts/test.csv

Splitting the Data

Before training and evaluation, you can split your dataset into training and testing subsets using split_data.py:

python split_data.py \
    -i texts/annotated.csv \
    --output_train texts/train.csv \
    --output_test texts/test.csv \
    --output_weights texts/class_weights.txt \
    --test_size 0.2

Parameters for `split_data.py`:

-i, --inputfile: Path to the annotated dataset.
--output_train: Path to save the training dataset.
--output_test: Path to save the test dataset.
--output_weights: Path to save the class weights file.
--test_size: Proportion of data to use for testing (default: 0.2).

Performing Inference with the Trained Model

After training the model, you can use the inference.py script to classify new texts using the saved model and label classes.

Usage

Run the inference.py script as follows:

python inference.py \
    -m results/ \
    -i "The patient was admitted to the hospital with symptoms of severe flu." \
       "The weather today is sunny and warm." \
    -l label_classes.npy

Arguments

-m or --model_path: Path to the directory containing the saved model files (e.g., results/).
-i or --input_texts: One or more input texts to classify. Provide each text in quotes, separated by spaces.
-l or --label_classes_path: Path to the label classes file (e.g., label_classes.npy), required to decode predictions.

Notes

Ensure the class_weights.txt is generated using split_data.py. This ensures the model effectively handles class imbalances.
The label_classes.npy file maps the numeric predictions from the model to their respective class names.
You can provide multiple input texts by separating them with spaces and enclosing each text in quotes.

Citation

The dataset and the script are fully described in our paper: Reading Between the Lines: A Dataset and a Study on Why Some Texts Are Tougher Than Others, Nouran Khallaf, Carlo Eugeni, and Serge Sharoff
Presented at Writing Aids at the Crossroads of AI, Cognitive Science, and NLP (WR-AI-CogS), COLING 2025, Abu Dhabi.
arXiv:2501.01796

@inproceedings{khallaf2025readinglinesdatasetstudy,
  title={Reading Between the Lines: A dataset and a study on why some texts are tougher than others},
  author={Nouran Khallaf and Carlo Eugeni and Serge Sharoff},
  booktitle={Writing Aids at the Crossroads of AI, Cognitive Science and NLP WR-AI-CogS, at COLING'2025},
  address={Abu Dhabi},
  year={2025},
  eprint={2501.01796},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2501.01796}
}

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
dataset		dataset
scripts		scripts
texts		texts
.gitignore		.gitignore
Classification_interpertability.ipynb		Classification_interpertability.ipynb
README.md		README.md
inference.py		inference.py
label_classes.npy		label_classes.npy
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reading Between the Lines: A dataset and a study on why some texts are tougher than others

1-Corpus of complex and simplified texts

2-Text Simplification Macro-Strategies

3-Using the simplification stratigies Classification Model

Training the Model

Required Parameters:

Optional Hyperparameters:

Example Training Command

Evaluating the Model

Required Parameters:

Example Evaluation Command

Splitting the Data

Parameters for `split_data.py`:

Performing Inference with the Trained Model

Usage

Arguments

Notes

Citation

About

Releases

Packages

Contributors 3

Languages

Nouran-Khallaf/why-tough

Folders and files

Latest commit

History

Repository files navigation

Reading Between the Lines: A dataset and a study on why some texts are tougher than others

1-Corpus of complex and simplified texts

2-Text Simplification Macro-Strategies

3-Using the simplification stratigies Classification Model

Training the Model

Required Parameters:

Optional Hyperparameters:

Example Training Command

Evaluating the Model

Required Parameters:

Example Evaluation Command

Splitting the Data

Parameters for split_data.py:

Performing Inference with the Trained Model

Usage

Arguments

Notes

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Parameters for `split_data.py`:

Packages