This repository contains Python scripts to generate text representations using different techniques. It provides options to create TF-IDF representations, BioSentVec representations, and BioWordVec representations for input data. It takes input data from UniProt and PubMed sources and generates vector representations for each entry.
Dataset is temporarily limited to 20 entries to make testing easier. PCA analysis for tfidf vectors also disabled because the limit is less than the PCA components.
- Python 3.7.3
- pandas 1.1.4
- sklearn
- os
- fasttext
- string
- nltk
- sent2vec
Uniprot and Pubmed files must be in text format and named with the uniprot ids. Download and unzip the files to the data folder from the urls given below.
https://drive.google.com/file/d/1jZJiL6R9c4hsxh_k5pCBsX6LG1zzbITX/view?usp=drive_link https://drive.google.com/file/d/1BwU2DXCXdtHGxtY1TlQxTuNbc7xVBzDp/view?usp=drive_link
biosentvec and biowordvec models must be downloaded to "models" folder from the urls below. Alternatively model_download parameter must be set as "y" to download models automatically if biosentvec or biowordvec representations selected to be generated.
https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/BioSentVec_PubMed_MIMICIII-bigram_d700.bin https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/BioWordVec_PubMed_MIMICIII_d200.bin
The script allows users to specify different options to create specific types of text representations (TFIDF, biosentvec, and biowordvec) and provides flexibility by allowing the creation of all representation types if the -a or --all option is specified.
-tfidf
or --tfidf
: Creates TFIDF representations.
-bsv
or --biosentvec
: Creates biosentvec representations.
-bwv
or --biowordvec
: Creates biowordvec representations.
-upfp
or --uniprotfilespath
: Specifies the path for the uniprot files. This option is required.
-pmfp
or --pubmedfilespath
: Specifies the path for the pubmed files. This option is required.
-a
or --all
: Creates all types of representations (TFIDF, biosentvec, and biowordvec).
-mdw
or --model_download
: download biosentvec and biowordvec model automatically.
Our users who have installed Hoper do not need to perform the following operations.
If you have not installed Hoper, you must perform the steps below to run text representation generation.
Step by step operation:
- Clone repository
- Install dependencies(given above)
- Download biosentvec and biowordvec models to models folder
- Download and unzip uniprot and pubmed files to data folder
- Run the script
Examples:
- To create TF-IDF representations:
python createtextrep.py --tfidf -upfp /path/to/uniprot/files -pmfp /path/to/pubmed/files -mdw y
- To create biosentvec representations:
python createtextrep.py --bsv -upfp /path/to/uniprot/files -pmfp /path/to/pubmed/files -mdw y
- To create biowordvec representations:
python createtextrep.py --bwv -upfp /path/to/uniprot/files -pmfp /path/to/pubmed/files -mdw y
- To create all three representations:
python createtextrep.py --a -upfp /path/to/uniprot/files -pmfp /path/to/pubmed/files -mdw y
The script will load the text files and perform the selected actions based on the provided options. The output will be generated in the following manner:
-
If the
-tfidf
option is selected, a csv file inluding TF-IDF vectors and four csv files (PCA 256, PCA 512, PCA 1024 and PCA 2048) including vectors generated by PCA analysis will be created and saved. -
If the
-bsv
option is selected, a csv file including biosentvec vectors will be created and saved. -
If the
-bwv
option is selected, a csv file including biowordvec vectors will be created and saved.
Copyright (C)
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.