Skip to content

Latest commit

 

History

History
108 lines (65 loc) · 4.42 KB

File metadata and controls

108 lines (65 loc) · 4.42 KB

representation_generation

This repository contains Python scripts to generate text representations using different techniques. It provides options to create TF-IDF representations, BioSentVec representations, and BioWordVec representations for input data. It takes input data from UniProt and PubMed sources and generates vector representations for each entry.

Dataset is temporarily limited to 20 entries to make testing easier. PCA analysis for tfidf vectors also disabled because the limit is less than the PCA components.

Dependencies

  1. Python 3.7.3
  2. pandas 1.1.4
  3. sklearn
  4. os
  5. fasttext
  6. string
  7. nltk
  8. sent2vec

Data

Uniprot and Pubmed files must be in text format and named with the uniprot ids. Download and unzip the files to the data folder from the urls given below.

https://drive.google.com/file/d/1jZJiL6R9c4hsxh_k5pCBsX6LG1zzbITX/view?usp=drive_link https://drive.google.com/file/d/1BwU2DXCXdtHGxtY1TlQxTuNbc7xVBzDp/view?usp=drive_link

Models

biosentvec and biowordvec models must be downloaded to "models" folder from the urls below. Alternatively model_download parameter must be set as "y" to download models automatically if biosentvec or biowordvec representations selected to be generated.

https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/BioSentVec_PubMed_MIMICIII-bigram_d700.bin https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/BioWordVec_PubMed_MIMICIII_d200.bin

Options

The script allows users to specify different options to create specific types of text representations (TFIDF, biosentvec, and biowordvec) and provides flexibility by allowing the creation of all representation types if the -a or --all option is specified.

-tfidf or --tfidf: Creates TFIDF representations.

-bsv or --biosentvec: Creates biosentvec representations.

-bwv or --biowordvec: Creates biowordvec representations.

-upfp or --uniprotfilespath: Specifies the path for the uniprot files. This option is required.

-pmfp or --pubmedfilespath: Specifies the path for the pubmed files. This option is required.

-a or --all: Creates all types of representations (TFIDF, biosentvec, and biowordvec).

-mdw or --model_download: download biosentvec and biowordvec model automatically.

How to Run

Our users who have installed Hoper do not need to perform the following operations.

If you have not installed Hoper, you must perform the steps below to run text representation generation.

Step by step operation:

  1. Clone repository
  2. Install dependencies(given above)
  3. Download biosentvec and biowordvec models to models folder
  4. Download and unzip uniprot and pubmed files to data folder
  5. Run the script

Examples:

  1. To create TF-IDF representations:
python createtextrep.py --tfidf -upfp /path/to/uniprot/files -pmfp /path/to/pubmed/files -mdw y
  1. To create biosentvec representations:
python createtextrep.py --bsv -upfp /path/to/uniprot/files -pmfp /path/to/pubmed/files -mdw y
  1. To create biowordvec representations:
python createtextrep.py --bwv -upfp /path/to/uniprot/files -pmfp /path/to/pubmed/files -mdw y
  1. To create all three representations:
python createtextrep.py --a -upfp /path/to/uniprot/files -pmfp /path/to/pubmed/files -mdw y

Definition of Output

The script will load the text files and perform the selected actions based on the provided options. The output will be generated in the following manner:

  • If the -tfidf option is selected, a csv file inluding TF-IDF vectors and four csv files (PCA 256, PCA 512, PCA 1024 and PCA 2048) including vectors generated by PCA analysis will be created and saved.

  • If the -bsv option is selected, a csv file including biosentvec vectors will be created and saved.

  • If the -bwv option is selected, a csv file including biowordvec vectors will be created and saved.

License

Copyright (C)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.