NPDR: A Dataset of Negative Human Phenotype-Disease Relations

The Negative Phenotype-Disease Relations (NPDR) dataset describes a subset of negative disease-phenotype relations from a gold-standard knowledge base made available by the Human Phenotype Ontology. The NPDR dataset was constructed by analysing 177 medical documents, and consists of 347 manually annotated at the document-level relations, from which 222 are inferred from the HPO gold-standard knowledge base, and 125 are new annotated relations. The dataset is available here.

In order to automatically annotate the entities mentioned in the NPDR dataset and extract their negative relations, an automatic extraction system was developed. If you intend to annotate entities using the lexica generated from the NPDR dataset and extract negative relations from biomedical documents, you can follow the below guidelines.

Dependencies

Python >= 3.8
Pre-processing:
- PDFMiner
- Genia Sentence Splitter
Term Recognition:
- MER (Minimal Named-Entity Recognizer) (Phenotype, Disease and Gene Entities)
Relation Extraction:
- Human Phenotype Ontology Gold Standard Negative Relations (Knowledge Base)

Getting Started

cd bin/
git clone https://github.com/lasigeBioTM/MER 

cd ../corpora/
git clone https://github.com/pdfminer/pdfminer.six
git clone https://github.com/lasigeBioTM/MER

Preparing the Biomedical Documents

There are two approaches that can be used to gather the biomedical documents:

By automatically retrieving PubMed articles using the Entrez Programming Utilities (E-utilities) program (Corpus A). A list of PMIDs from the NPDR dataset is provided by the pmids.txt file.
By converting PDF articles into machine-readable text format using the PDFMiner text converter tool (Corpus B).

If you intend to automatically retrieve the biomedical documents run:

 python3 src/pubmed.py

Creates:
- corpora/corpus_A/articles
- corpora/corpus_A/abstracts

If you intend to convert PDF documents, place the documents in the PDF_files directory and run:

 python3 src/pdf2text.py

Annotating Genes, Diseases, Human Phenotypes and Relations

If using Corpus A run:

 python3 src/annotations_corpus_A.py

Creates:
- corpora/corpus_A/phenotypes/
- corpora/corpus_A/phenotype_synonyms/
- corpora/corpus_A/abstract_genes/
- corpora/corpus_A/article_genes/
- corpora/corpus_A/abstract_diseases/
- corpora/corpus_A/article_diseases/
- corpora/corpus_A/abstract_disease_abbreviations/
- corpora/corpus_A/article_disease_abbreviations/
- corpora/corpus_A/abstract_disease_synonyms/
- corpora/corpus_A/article_disease_synonyms/
- corpora/corpus_A/final_annotations/
- corpora/corpus_A/negation_in_articles/
- corpora/corpus_A/relations_corpus_A.tsv

If using Corpus B run:

 python3 src/annotations_corpus_B.py

Creates:
- corpora/corpus_B/phenotypes/
- corpora/corpus_B/phenotype_synonyms/
- corpora/corpus_B/genes/
- corpora/corpus_B/diseases/
- corpora/corpus_B/disease_abbreviations/
- corpora/corpus_B/disease_synonyms/
- corpora/corpus_B/final_annotations_corpus_B/
- corpora/corpus_B/relations_corpus_B.tsv

Configuration

bin/
- MER/
  - data/
    - diseases.txt
    - diseases_links.tsv
    - hpsynonym.txt
    - hpsynonym_links.tsv
    - hp.owl
    - hp_links.tsv
- geniass/
corpora/
- MER/
  - data/
    - diseaseabbreviations.txt
    - diseaseabbreviations_links.tsv
    - diseasesynonyms.txt
    - diseasesynonyms_links.tsv
    - genes.txt
    - genes_links.tsv
- corpus_A/
  - abstract_disease_abbreviations/
  - abstract_disease_synonyms/
  - abstract_diseases/
  - abstract_genes/
  - abstracts/
  - article_disease_abbreviations/
  - article_disease_synonyms/
  - article_diseases
  - article_genes
  - articles/
  - final_annotations/
  - negation_in_articles/
  - phenotype_synonyms/
  - phenotypes/
- corpus_B/
  - articles/
  - disease_abbreviations/
  - disease_synonyms/
  - diseases/
  - final_annotations/
  - genes/
  - negation_in_articles/
  - phenotype_synonyms/
  - phenotypes/
data/
- get_entities.sh
- phenotype_annotation_negated.txt
- pmids.txt
src/
- annotations_corpus_A.py
- annotations_corpus_B.py
- pdf2text.py
- pubmed.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NPDR: A Dataset of Negative Human Phenotype-Disease Relations

Dependencies

Getting Started

Preparing the Biomedical Documents

Annotating Genes, Diseases, Human Phenotypes and Relations

Configuration

bin/

corpora/

data/

src/

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
bin		bin
corpora		corpora
data		data
src		src
LICENSE		LICENSE
README.md		README.md

License

lasigeBioTM/NPDR

Folders and files

Latest commit

History

Repository files navigation

NPDR: A Dataset of Negative Human Phenotype-Disease Relations

Dependencies

Getting Started

Preparing the Biomedical Documents

Annotating Genes, Diseases, Human Phenotypes and Relations

Configuration

bin/

corpora/

data/

src/

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages