This repository aims at reproducing results from this paper. The project uses tensorflow, scikit, numpy, pandas and nltk, The model achieved F1 score of 0.83 on cv dataset. Dataset used is swissprot-kB. Families with < 200 examples and sequences with length > 1000 were removed at the time of preprocessing. Glove model was used to create embeddings.
Downloading the dataset - For details checkout the readme file in data directory
- Tensorflow
- Scikit-learn
- Numpy
- Pandas
- NLTK
All the libraries can be installed using pip3. A shell script to install all dependecies would be available in this repository.
Do the following to run the model :
- chmod +x run.sh
- ./run.sh
If there is some bug, check the script run.sh. Steps inside the script are as follows :
- Download dataset in data folder, rename it to uniprot-all.tab.
- Go to utils folder, run script1.py.
- Go to data folder, clone Glove and use "make" command.
- Run the GloVe model with appropriate parameters (check run.sh line no 17, 19, 21, 23)
- Go to utils folder, run script2.py
- Run model.py
This would run the model on the dataset.
Each epoch using Tesla-K80 took approx ~ 4 secs for batch size of 128.
- If interested, check out the repo for Protein Secondary Structure Prediction.