Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups

This repository contains the list of fine-tuned models used for "Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups" accepted at EMNLP 2024 main conference. The public fine-tuned LLama-2-based models can be found on HuggingFace.

🚀 Fine-Tuned Models

Base model	Dataset	HuggingFace URL
Llama-2-7b-chat	CWI Shared 2018 EN	unstpb-nlp/llama-2-7b-ft-cwi-2018-en
Llama-2-7b-chat	CWI Shared 2018 ES	unstpb-nlp/llama-2-7b-ft-cwi-2018-es
LLama-2-7b-chat	CWI Shared 2018 DE	unstpb-nlp/llama-2-7b-ft-cwi-2018-de
Llama-2-7b-chat	CompLex LCP 2021	unstpb-nlp/llama-2-7b-ft-CompLex-2021
Llama-2-13b-chat	CWI Shared 2018 EN	unstpb-nlp/llama-2-13b-ft-cwi-2018-en
Llama-2-13b-chat	CWI Shared 2018 ES	unstpb-nlp/llama-2-13b-ft-cwi-2018-es
Llama-2-13b-chat	CWI Shared 2018 DE	unstpb-nlp/llama-2-13b-ft-cwi-2018-de
Llama-2-13b-chat	CompLex LCP 2021	unstpb-nlp/llama-2-13b-ft-CompLex-2021
Vicuna-v1.5-7b	CWI Shared 2018 EN	unstpb-nlp/vicuna-v15-7b-ft-cwi-2018-en
Vicuna-v1.5-7b	CWI Shared 2018 ES	unstpb-nlp/vicuna-v15-7b-ft-cwi-2018-es
Vicuna-v1.5-7b	CWI Shared 2018 DE	unstpb-nlp/vicuna-v15-7b-ft-cwi-2018-de
Vicuna-v1.5-7b	CompLex LCP 2021	unstpb-nlp/vicuna-v15-7b-ft-CompLex-2021
Vicuna-v1.5-13b	CWI Shared 2018 EN	unstpb-nlp/vicuna-v15-13b-ft-cwi-2018-en
Vicuna-v1.5-13b	CWI Shared 2018 ES	unstpb-nlp/vicuna-v15-13b-ft-cwi-2018-es
Vicuna-v1.5-13b	CWI Shared 2018 DE	unstpb-nlp/vicuna-v15-13b-ft-cwi-2018-de
Vicuna-v1.5-13b	CompLex LCP 2021	unstpb-nlp/vicuna-v15-13b-ft-CompLex-2021
Llama-3-8b-chat	CWI Shared 2018 EN	unstpb-nlp/llama-3-8b-ft-cwi-2018-en
Llama-3-8b-chat	CWI Shared 2018 ES	unstpb-nlp/llama-3-8b-ft-cwi-2018-es
Llama-3-8b-chat	CWI Shared 2018 DE	unstpb-nlp/llama-3-8b-ft-cwi-2018-de
Llama-3-8b-chat	CompLex LCP 2021	unstpb-nlp/llama-3-8b-ft-CompLex-2021

📚 Meta-Learning Models

Note

Will be added soon.

🤖 Datasets for Fine-Tuning ChatGPT

These are extracts from CWI 2018 and LCP 2021 datasets used during fine-tuning ChanGPT-3.5-turbo.

Base model	Dataset	Training Data	Validation Data	Trained tokens	Epochs	Batch size	LR multiplier
gpt-3.5-turbo-1106	CWI Shared 2018 EN	`train set`	`validation set`	163,749	3	1	2
gpt-3.5-turbo-1106	CWI Shared 2018 ES	`train set`	`validation set`	224,784	3	1	2
gpt-3.5-turbo-1106	CWI Shared 2018 DE	`train set`	`validation set`	218,364	3	1	2
gpt-3.5-turbo-1106	CompLex LCP 2021	`train set`	`validation set`	185,613	3	1	2

⚖️ License

Llama 2-based models are available under the Llama 2 Community License.

📖 Citation

You can cite our work as follows:

@misc{smădu2024investigatinglargelanguagemodels,
      title={Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups}, 
      author={Răzvan-Alexandru Smădu and David-Gabriel Ion and Dumitru-Clementin Cercel and Florin Pop and Mihaela-Claudia Cercel},
      year={2024},
      eprint={2411.01706},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.01706}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
chatgpt_data		chatgpt_data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups

🚀 Fine-Tuned Models

📚 Meta-Learning Models

🤖 Datasets for Fine-Tuning ChatGPT

⚖️ License

📖 Citation

About

Releases

Packages

License

razvanalex-phd/cwi_llm

Folders and files

Latest commit

History

Repository files navigation

Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups

🚀 Fine-Tuned Models

📚 Meta-Learning Models

🤖 Datasets for Fine-Tuning ChatGPT

⚖️ License

📖 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages