Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups
This repository contains the list of fine-tuned models used for "Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups" accepted at EMNLP 2024 main conference. The public fine-tuned LLama-2-based models can be found on HuggingFace.
Note
Will be added soon.
These are extracts from CWI 2018 and LCP 2021 datasets used during fine-tuning ChanGPT-3.5-turbo.
Base model | Dataset | Training Data | Validation Data | Trained tokens | Epochs | Batch size | LR multiplier |
---|---|---|---|---|---|---|---|
gpt-3.5-turbo-1106 | CWI Shared 2018 EN | train set |
validation set |
163,749 | 3 | 1 | 2 |
gpt-3.5-turbo-1106 | CWI Shared 2018 ES | train set |
validation set |
224,784 | 3 | 1 | 2 |
gpt-3.5-turbo-1106 | CWI Shared 2018 DE | train set |
validation set |
218,364 | 3 | 1 | 2 |
gpt-3.5-turbo-1106 | CompLex LCP 2021 | train set |
validation set |
185,613 | 3 | 1 | 2 |
Llama 2-based models are available under the Llama 2 Community License.
You can cite our work as follows:
@misc{smădu2024investigatinglargelanguagemodels,
title={Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups},
author={Răzvan-Alexandru Smădu and David-Gabriel Ion and Dumitru-Clementin Cercel and Florin Pop and Mihaela-Claudia Cercel},
year={2024},
eprint={2411.01706},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.01706},
}