The Negative Phenotype-Disease Relations (NPDR) dataset describes a subset of negative disease-phenotype relations from a gold-standard knowledge base made available by the Human Phenotype Ontology. The NPDR dataset was constructed by analysing 177 medical documents, and consists of 347 manually annotated at the document-level relations, from which 222 are inferred from the HPO gold-standard knowledge base, and 125 are new annotated relations. The dataset is available here.
In order to automatically annotate the entities mentioned in the NPDR dataset and extract their negative relations, an automatic extraction system was developed. If you intend to annotate entities using the lexica generated from the NPDR dataset and extract negative relations from biomedical documents, you can follow the below guidelines.
-
Python >= 3.8
-
Pre-processing:
-
Term Recognition:
- MER (Minimal Named-Entity Recognizer) (Phenotype, Disease and Gene Entities)
-
Relation Extraction:
- Human Phenotype Ontology Gold Standard Negative Relations (Knowledge Base)
cd bin/
git clone https://github.com/lasigeBioTM/MER
cd ../corpora/
git clone https://github.com/pdfminer/pdfminer.six
git clone https://github.com/lasigeBioTM/MER
There are two approaches that can be used to gather the biomedical documents:
- By automatically retrieving PubMed articles using the Entrez Programming Utilities (E-utilities) program (Corpus A). A list of PMIDs from the NPDR dataset is provided by the pmids.txt file.
- By converting PDF articles into machine-readable text format using the PDFMiner text converter tool (Corpus B).
If you intend to automatically retrieve the biomedical documents run:
python3 src/pubmed.py
- Creates:
- corpora/corpus_A/articles
- corpora/corpus_A/abstracts
If you intend to convert PDF documents, place the documents in the PDF_files directory and run:
python3 src/pdf2text.py
- If using Corpus A run:
python3 src/annotations_corpus_A.py
- Creates:
- corpora/corpus_A/phenotypes/
- corpora/corpus_A/phenotype_synonyms/
- corpora/corpus_A/abstract_genes/
- corpora/corpus_A/article_genes/
- corpora/corpus_A/abstract_diseases/
- corpora/corpus_A/article_diseases/
- corpora/corpus_A/abstract_disease_abbreviations/
- corpora/corpus_A/article_disease_abbreviations/
- corpora/corpus_A/abstract_disease_synonyms/
- corpora/corpus_A/article_disease_synonyms/
- corpora/corpus_A/final_annotations/
- corpora/corpus_A/negation_in_articles/
- corpora/corpus_A/relations_corpus_A.tsv
- If using Corpus B run:
python3 src/annotations_corpus_B.py
- Creates:
- corpora/corpus_B/phenotypes/
- corpora/corpus_B/phenotype_synonyms/
- corpora/corpus_B/genes/
- corpora/corpus_B/diseases/
- corpora/corpus_B/disease_abbreviations/
- corpora/corpus_B/disease_synonyms/
- corpora/corpus_B/final_annotations_corpus_B/
- corpora/corpus_B/relations_corpus_B.tsv
-
- MER/
- data/
- diseases.txt
- diseases_links.tsv
- hpsynonym.txt
- hpsynonym_links.tsv
- hp.owl
- hp_links.tsv
- data/
- geniass/
- MER/
-
- MER/
- data/
- diseaseabbreviations.txt
- diseaseabbreviations_links.tsv
- diseasesynonyms.txt
- diseasesynonyms_links.tsv
- genes.txt
- genes_links.tsv
- data/
- corpus_A/
- abstract_disease_abbreviations/
- abstract_disease_synonyms/
- abstract_diseases/
- abstract_genes/
- abstracts/
- article_disease_abbreviations/
- article_disease_synonyms/
- article_diseases
- article_genes
- articles/
- final_annotations/
- negation_in_articles/
- phenotype_synonyms/
- phenotypes/
- corpus_B/
- articles/
- disease_abbreviations/
- disease_synonyms/
- diseases/
- final_annotations/
- genes/
- negation_in_articles/
- phenotype_synonyms/
- phenotypes/
- MER/
-
- get_entities.sh
- phenotype_annotation_negated.txt
- pmids.txt
-
- annotations_corpus_A.py
- annotations_corpus_B.py
- pdf2text.py
- pubmed.py