Biomedical Word Embeddings for Spanish: Development and Evaluation (v2.0)

This repository contains the second version of the word embeddings generated from Spanish corpora. In this version, we preprocessed the Corpora, obtaining better embeddings.

Digital Object Identifier (DOI) and access to files

https://doi.org/10.5281/zenodo.3626806

Directory Structure

The example below shows the structure for the Wikipedia subset. All other subsets have the same structure:

wikipedia/
    cbow/
	cased/
    		Wikipedia_cbow_cased.bin: fastText embedding in binary file.
        	Wikipedia_cbow_cased.vec: fastText embedding in text file.
	uncased/
    		Wikipedia_cbow_uncased.bin: fastText embedding in binary file.
        	Wikipedia_cbow_uncased.vec: fastText embedding in text file.
    skipgram/
	cased/
    		Wikipedia_skipgram_cased.bin: fastText embedding in binary file.
        	Wikipedia_skipgram_cased.vec: fastText embedding in text file.
	uncased/
    		Wikipedia_skipgram_uncased.bin: fastText embedding in binary file.
        	Wikipedia_skipgram_uncased.vec: fastText embedding in text file.

For usability and disk space reasons, the files in the directories are compressed in a tar.gz file. Uncompress only those embeddings you want to use.

Corpora used

SciELO Full-Text in Spanish: We retrieved all the full-text available in SciELO.org (until December/2018) and processed them into sentences. SciELO.org node contains all Spanish articles, thus includes Latin and European Spanish.
- Sentences: 3,267,556
- Tokens: 100,116,298
Wikipedia Health: We retrieved all articles from the following Wikipedia categories: Pharmacology, Pharmacy, Medicine and Biology. Data were retrieved during December/2018.
- Sentences: 4,030,833
- Tokens: 82,006,270
SciELO + Wikipedia Health: We concatenated the previous two corpora.

Preprocessing Step

Removing not Spanish sentences using Lingua library.
- It does not work with high accuracy for the Spanish language, thus we used a reversed way by removing English, German, French, Dutch and Danish sentences from our retrieved Spanish corpora.
Sentence splitting and tokenization using Freeling toolkit.
- Freeling (Padro and Stanilovsky, 2012) is a C++ library providing language analysis functionalities (Morphological Analysis, Named Entity Detection, PoS-Tagging, Parsing, Word Sense Disambiguation, Semantic Role Labelling, so forth) for a variety of languages.
Final pre-processing using Indra-Indexer toolkit.
- Removing punctuations.
- Replacing numbers for the place holder <NUMBER>.
- Setting a minimum acceptable token size to 1.
- Setting a maximum acceptable token size to 100.
- Applying lowercase the tokens (we generated both cased and uncased versions).

Corpora statistics after preprocessing:

Corpora	# Sentences	# Unique Tokens (Cased)	# Unique Tokens (Uncased)
SciELO	9,116,784	385,399	335,790
Wikipedia Health	4,623,438	255,057	222,954
SciELO + Wikipedia Health	13,740,222	503,647	435,652

Embeddings generation using fastText

We used fastText to train the word embeddings.

We kept all standard options for training.

- Minimum number of word occurrences: 5
- Phrase representation: No (i.e. length of word n-gram = 1)
- Minimum length of character n-gram: 3
- Maximum length of character n-gram: 6
- Size of word vectors: 300
- Epochs: 20
- Size of the context window: 5
- Word-Representation Modes: CBOW and Skip-gram

We generated word embedding for both versions (cased and uncased corpora):

- Wiki_cased
- Wiki_uncased
- Scielo_cased
- Scielo_uncased
- Scielo+Wiki_cased
- Scielo+Wiki_uncased

Evaluation

The evaluation of the original embeddings (version 1.0) was carried out by both extrinsically (with a Named Entity Recognition framework) and intrinsically, with the three already available datasets for such task UMNSRS-sim, UMNSRS-rel, and MayoSRS. In the NER scenario, we concluded that the best model was the one genenated using Skip-gram, with 300 dimensions, and trained with SciELO and Wikipedia. We projected the words using Principal Component Analysis.

Further details about evaluation and the steps performed can be found in our paper: Medical Word Embeddings for Spanish: Development and Evaluation.

The PCA plots for our embeddings (version 1.0) and a general-domain embedding are available in this repository also:

The translations for Spanish in TSV format of the UMNSRS and Mayo datasets one are also availabe in this Github repository:

Evaluation of the new version

To evaluate the new version of the embeddings (version v2.0), we replicated the extrinsic evaluation from the original paper. The new embeddings outperform the previous ones by up to 1.68 points, while also solving some other minor issues.

			Cased		Uncased
Version	Method	Corpora	Validation	Test	Validation	Test
v1.0	Skip-gram	Wiki	88.55	87.78	-	-
		SciELO	89.47	87.31	-	-
		SciELO+Wiki	89.42	88.17	-	-
v2.0	CBOW	Wiki	86.55	85.46	85.12	85.74
		SciELO	88.11	87.75	89.99	87.24
		SciELO+Wiki	87.95	87.78	86.56	88.10
	Skip-gram	Wiki	88.82	87.16	88.77	87.21
		SciELO	89.66	88.77	89.07	89.17
		SciELO+Wiki	88.76	88.64	89.71	89.74

Troubleshooting

If you get an Unicode/Decode error, please load the embedding files adding a new argument to specify the encoding, as shown below:

f = open(filename, "r", encoding="utf-8")

Citing

Please cite our paper if you use it in your experiments or project.

@inproceedings{soares2019medical,
  title={Medical word embeddings for Spanish: Development and evaluation},
  author={Soares, Felipe and Villegas, Marta and Gonzalez-Agirre, Aitor and Krallinger, Martin and Armengol-Estap{\'e}, Jordi},
  booktitle={Proceedings of the 2nd Clinical Natural Language Processing Workshop},
  pages={124--133},
  year={2019}
}

Contact

Siamak Barzegar ([email protected])

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Disclaimer

The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.

When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence.

In no event shall the owner of the models (SEDIA – State Secretariat for digitalization and artificial intelligence) nor the creator (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.

Los modelos publicados en este repositorio tienen una finalidad generalista y están a disposición de terceros. Estos modelos pueden tener sesgos y/u otro tipo de distorsiones indeseables.

Cuando terceros desplieguen o proporcionen sistemas y/o servicios a otras partes usando alguno de estos modelos (o utilizando sistemas basados en estos modelos) o se conviertan en usuarios de los modelos, deben tener en cuenta que es su responsabilidad mitigar los riesgos derivados de su uso y, en todo caso, cumplir con la normativa aplicable, incluyendo la normativa en materia de uso de inteligencia artificial.

En ningún caso el propietario de los modelos (SEDIA – Secretaría de Estado de Digitalización e Inteligencia Artificial) ni el creador (BSC – Barcelona Supercomputing Center) serán responsables de los resultados derivados del uso que hagan terceros de estos modelos.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
datasets		datasets
plots		plots
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Biomedical Word Embeddings for Spanish: Development and Evaluation (v2.0)

Digital Object Identifier (DOI) and access to files

Directory Structure

Corpora used

Preprocessing Step

Embeddings generation using fastText

Evaluation

Evaluation of the new version

Troubleshooting

Citing

Contact

License

Disclaimer

About

Releases

Packages

Contributors 4

License

TeMU-BSC/Embeddings

Folders and files

Latest commit

History

Repository files navigation

Biomedical Word Embeddings for Spanish: Development and Evaluation (v2.0)

Digital Object Identifier (DOI) and access to files

Directory Structure

Corpora used

Preprocessing Step

Embeddings generation using fastText

Evaluation

Evaluation of the new version

Troubleshooting

Citing

Contact

License

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages