Skip to content

Latest commit

 

History

History
145 lines (122 loc) · 5.33 KB

README.md

File metadata and controls

145 lines (122 loc) · 5.33 KB

NPDR: A Dataset of Negative Human Phenotype-Disease Relations

The Negative Phenotype-Disease Relations (NPDR) dataset describes a subset of negative disease-phenotype relations from a gold-standard knowledge base made available by the Human Phenotype Ontology. The NPDR dataset was constructed by analysing 177 medical documents, and consists of 347 manually annotated at the document-level relations, from which 222 are inferred from the HPO gold-standard knowledge base, and 125 are new annotated relations. The dataset is available here.

In order to automatically annotate the entities mentioned in the NPDR dataset and extract their negative relations, an automatic extraction system was developed. If you intend to annotate entities using the lexica generated from the NPDR dataset and extract negative relations from biomedical documents, you can follow the below guidelines.

Dependencies

Getting Started

cd bin/
git clone https://github.com/lasigeBioTM/MER 

cd ../corpora/
git clone https://github.com/pdfminer/pdfminer.six
git clone https://github.com/lasigeBioTM/MER 

Preparing the Biomedical Documents

There are two approaches that can be used to gather the biomedical documents:

  1. By automatically retrieving PubMed articles using the Entrez Programming Utilities (E-utilities) program (Corpus A). A list of PMIDs from the NPDR dataset is provided by the pmids.txt file.
  2. By converting PDF articles into machine-readable text format using the PDFMiner text converter tool (Corpus B).

If you intend to automatically retrieve the biomedical documents run:

 python3 src/pubmed.py
  • Creates:
    • corpora/corpus_A/articles
    • corpora/corpus_A/abstracts

If you intend to convert PDF documents, place the documents in the PDF_files directory and run:

 python3 src/pdf2text.py

Annotating Genes, Diseases, Human Phenotypes and Relations

  1. If using Corpus A run:
 python3 src/annotations_corpus_A.py
  • Creates:
    • corpora/corpus_A/phenotypes/
    • corpora/corpus_A/phenotype_synonyms/
    • corpora/corpus_A/abstract_genes/
    • corpora/corpus_A/article_genes/
    • corpora/corpus_A/abstract_diseases/
    • corpora/corpus_A/article_diseases/
    • corpora/corpus_A/abstract_disease_abbreviations/
    • corpora/corpus_A/article_disease_abbreviations/
    • corpora/corpus_A/abstract_disease_synonyms/
    • corpora/corpus_A/article_disease_synonyms/
    • corpora/corpus_A/final_annotations/
    • corpora/corpus_A/negation_in_articles/
    • corpora/corpus_A/relations_corpus_A.tsv
  1. If using Corpus B run:
 python3 src/annotations_corpus_B.py
  • Creates:
    • corpora/corpus_B/phenotypes/
    • corpora/corpus_B/phenotype_synonyms/
    • corpora/corpus_B/genes/
    • corpora/corpus_B/diseases/
    • corpora/corpus_B/disease_abbreviations/
    • corpora/corpus_B/disease_synonyms/
    • corpora/corpus_B/final_annotations_corpus_B/
    • corpora/corpus_B/relations_corpus_B.tsv

Configuration

  • bin/

    • MER/
      • data/
        • diseases.txt
        • diseases_links.tsv
        • hpsynonym.txt
        • hpsynonym_links.tsv
        • hp.owl
        • hp_links.tsv
    • geniass/
  • corpora/

    • MER/
      • data/
        • diseaseabbreviations.txt
        • diseaseabbreviations_links.tsv
        • diseasesynonyms.txt
        • diseasesynonyms_links.tsv
        • genes.txt
        • genes_links.tsv
    • corpus_A/
      • abstract_disease_abbreviations/
      • abstract_disease_synonyms/
      • abstract_diseases/
      • abstract_genes/
      • abstracts/
      • article_disease_abbreviations/
      • article_disease_synonyms/
      • article_diseases
      • article_genes
      • articles/
      • final_annotations/
      • negation_in_articles/
      • phenotype_synonyms/
      • phenotypes/
    • corpus_B/
      • articles/
      • disease_abbreviations/
      • disease_synonyms/
      • diseases/
      • final_annotations/
      • genes/
      • negation_in_articles/
      • phenotype_synonyms/
      • phenotypes/
  • data/

    • get_entities.sh
    • phenotype_annotation_negated.txt
    • pmids.txt
  • src/

    • annotations_corpus_A.py
    • annotations_corpus_B.py
    • pdf2text.py
    • pubmed.py