PPI typed relation extraction

Protein - protein interactions (PPI) play a very important role in various aspects of cell biology (Zhou & He, 2008). The PPI interactions form complex networks and can be represented as a graph, where each node represents a protein and an edge represents a type of relationship between the 2 proteins. Manually curating these networks by reading journals and regularly maintaining them with the latest information is beyond human lifespan (Baumgartner, Cohen, Fox, Acquaah-Mensah, & Hunter, 2007).

Task definition

For instance, in the sentence “Full-length cPLA2 was phosphorylated stoichiometrically by p42 mitogen-activated protein (MAP) kinase in vitro” ,

The protein name recognition phase recognizes “cPLA2” & “p42 mitogen-activated protein (MAP) kinase” as protein names. Some entity recognition tasks also involve recognizing the entity roles, such as “cPLA2” as the theme or the target protein, and “p42 mitogen-activated protein (MAP) kinase” as the agent protein or the source of the interaction.
The protein-protein interaction extraction task recognizes “phosphorylate” as the relationship between “cPLA2” & “p42 mitogen-activated protein (MAP) kinase”.

Run via Docker

Download and analyse the dataset with elastic search

Download IMEX raw xml files from intact

Download dataset from Imex ftp site ftp.ebi.ac.uk

basedata=/home/ubuntu/data

sudo docker run -v ${basedata}:/data  lanax/kegg-pathway-extractor:latest scripts/dowloadintactinteractions.sh /data  "<filepattern e.g. human*.xml>" "<optional s3 destination>"

Optional: Visualise through elastic search on AWS

Sample download intact files with pattern human0* and elastic search index

region=$1
esdomain=$2
accesskey=$3
accesssecret=$4
s3path=$5

basedata=/home/ubuntu/data
file_pattern=human0*

script=scripts/run_pipeline_download_esindex.sh
sudo docker run -v ${basedata}:/data --env elasticsearch_domain_name=$esdomain --env AWS_ACCESS_KEY_ID=$accesskey   --env AWS_REGION=$region --env AWS_SECRET_ACCESS_KEY=$accesssecret lanax/kegg-pathway-extractor:latest $script /data $file_pattern $s3path

Run locally from source

Training and Validation

Download dataset from Imex ftp site ftp.ebi.ac.uk

cd ./source
bash scripts/dowloadintactinteractions.sh "<localdir>" "<filepattern e.g. human*.xml>" "<optional s3 destination>"

Create raw but json formatted dataset locally from source

export PYTHONPATH=./source
python source/pipeline/main_pipeline_dataprep.py "<inputdir containing imex xml files>" "outputdir"

Create pubtator formatted abstracts so that GnormPlus can recognises entities

export PYTHONPATH=./source
python source/dataformatters/main_formatter.py "<datafilecreatedfrompreviousstep>" "<outputfile>"

Extract entities using GNormPlus

docker pull lanax/gnormplus
cp 
docker run -it -v /data/:/gnormdata lanax/gnormplus

# within docker
# Step1  edit the setup.txt to human specifies only..
# Step 2 run process
java -Xmx10G -Xms10G -jar /GNormPlusJava/GNormPlus.jar /gnormdata/input /gnormdata/output setup.txt > /gnormdata/gnormplus.log 2>&1 &

Sample setup.txt


#===Annotation
#Attribution setting:
#FocusSpecies = Taxonomy ID
#	All: All species
#	9606: Human
#	4932: yeast
#	7227: Fly
#	10090: Mouse
#	10116: Rat
#	7955: Zebrafish
#	3702: Arabidopsis thaliana
#open: True
#close: False

[Focus Species]
    FocusSpecies = 9606
[Dictionary & Model]
    DictionaryFolder = Dictionary
    GNRModel = Dictionary/GNR.Model
    SCModel = Dictionary/SimConcept.Model
    GeneIDMatch = True
    Normalization2Protein = False
    DeleteTmp = True

Download NCBI to Uniprot Id mapping file

From https://www.uniprot.org/downloads , download the ID mapping file ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/. This contains the ID mapping between UNIPROT and NCBI. We need this as GNormplus use NCBI gene id and we need the protein names. The dat file contains three columns, delimited by tab:
- UniProtKB-AC
- ID_type
- ID
e.g
```
P43405	DNASU	6850
P43405	GeneID	6850
P43405	GenomeRNAi	6850
A0A024R244	GeneID	6850
A0A024R244	GenomeRNAi	6850
A0A024R273	GeneID	6850
A0A024R273	GenomeRNAi	6850
A8K4G2	DNASU	6850
```

Download wordtovec pretrained models (either pubmed+pmc trained or pubmed+pmc+wikipedia)and convert to text format

# Download word to vec trained only on pubmed and pmc
wget -O /data/PubMed-and-PMC-w2v.bin http://evexdb.org/pmresources/vec-space-models/PubMed-and-PMC-w2v.bin

python ./source/dataformatters/main_wordToVecBinToText.py /data/PubMed-and-PMC-w2v.bin /data/PubMed-and-PMC-w2v.bin.txt

# Download word to vec trained only on pubmed and pmc and wikipedia
wget  -O /data/wikipedia-pubmed-and-PMC-w2v.bin http://evexdb.org/pmresources/vec-space-models/wikipedia-pubmed-and-PMC-w2v.bin

python ./source/dataformatters/main_wordToVecBinToText.py /data/wikipedia-pubmed-and-PMC-w2v.bin /data/wikipedia-pubmed-and-PMC-w2v.bin.txt

Run the data exploration notebook DataExploration.ipynb to clean data, generate negative samples and normalise abstract

Run train job

export PYTHONPATH=./source
python source/algorithms/main_train.py  --dataset PpiDatasetFactory --trainfile sample_train.json --traindir tests/data/ --valfile sample_train.json  --valdir tests/data --embeddingfile sample_PubMed-and-PMC-w2v.bin.txt --embeddingdir ./tests/test_algorithms --embeddim 200 --outdir outdir --modeldir outdir

Consolidated train + predict

 #!/usr/bin/env bash

    
 set -e
 set -x

 # Init
 base_dir=/data
 s3_dest=s3://yourbucket/results


 fmtdt=`date +"%y%m%d_%H%M"`
 base_name=model_${fmtdt}
 outdir=${base_dir}/${base_name}
 echo ${outdir}
 mkdir ${outdir}
  
 export PYTHONPATH="./source"
 
 mkdir ${outdir}
 
 # Git head to trace to source to reproduce run
 git log -1 > ${outdir}/run.log
 
 # Train
 python ./source/algorithms/main_train.py --dataset PpiDatasetFactory --trainfile train_unique_pub_v3_lessnegatve.json --traindir /data  --valfile val_unique_pub_v3_lessnegatve.json --valfile /data --embeddingfile wikipedia-pubmed-and-PMC-w2v.bin.txt  --embeddingdir /data --embeddim 200 --outdir ${outdir}  --epochs 50  --log-level INFO >> ${outdir}/run.log 2>&1
 
 # Predict on validation set
 python ./source/algorithms/main_predict.py PpiDatasetFactory /data/test_unique_pub_v3_lessnegatve.json ${outdir}  ${outdir} >> ${outdir}/run.log 2>&1
 mv ${outdir}/predicted.json ${outdir}/predicted_test_unique_pub_v3_lessnegatve.json
 
 # Predict on test set
 python ./source/algorithms/main_predict.py PpiDatasetFactory  /data/val_unique_pub_v3_lessnegatve.json ${outdir}  ${outdir} >> ${outdir}/run.log 2>&1
 mv ${outdir}/predicted.json ${outdir}/predicted_val_unique_pub_v3_lessnegatve.json

 #Copy results to s3
 aws s3 sync ${outdir} ${s3_dest}/${base_name} >> ${outdir}/synclog 2>&1

Large scale inference on pubmed abstracts

Download pubmed FTP and convert to json. For more details see https://github.com/elangovana/pubmed-downloader/tree/master

Convert json to pubtator format to prepare the dataset for GNormPlus

export PYTHONPATH=./source
python source/dataformatters/pubmed_abstracts_to_pubtator_format.py "<inputdir_jsonfiles_result_of_step_1>" "<destination_dir_pubtator_format>"

Run GNormPlus to recognise entities using the pubtator formatted files from the previous step, without protein names normalisation. See https://github.com/elangovana/docker-gnormplus

Generate json using the results of GNormplus annotations in pubtator format.

export PYTHONPATH=./source
python ./source/datatransformer/pubtator_annotations_inference_transformer.py tests/test_datatransformer/data_sample_annotation /tmp tmpmap.dat

Run inference

 python ./algorithms/main_predict.py PPIDataset /data/val_unique_pub_v6_less_negative.json /tmp/model_artefacts /tmp --positives-filter-threshold .95

Other datasets

AImed

Download from ftp://ftp.cs.utexas.edu/pub/mooney/bio-data/interactions.tar.gz"
Convert the raw dataset into XML for using instructions in http://mars.cs.utu.fi/PPICorpora/
```
convert_aimed.py -i  aimed_interactions_input_dir -o aimed.xml
```

Next convert the xml to dataframe json

Option A: This convert xml to dataframe json, and pre-processes so that protien names that are not relevant are masked as "PROTEIN"

 export PYTHONPATH=./source
 python source/datatransformer/AimedXmlToDataFramePreprocessed.py tests/test_datatransformer/data/sample_aimed_pyyasaol_converted.xml /tmp/df.json

Option B : If you want to just convert the AIMedXML to json without replacing non-participating protein names to "PROTEIN", then use this script instead

export PYTHONPATH=./source
python source/datatransformer/AimedXmlToDataFrame.py tests/test_datatransformer/data/sample_aimed_pyyasaol_converted.xml /tmp/df.json

Run 10 fold training with unique docid

python ./source/algorithms/main_train_k_fold.py  --trainfile aimedsample.json --traindir tests/data --embeddingfile tests/test_algorithms/sample_PubMed-and-PMC-w2v.bin.txt --outdir /tmp --modeldir /tmp --embeddim 200 --epochs 2 --dataset PpiAimedDatasetPreprocessedFactory --labelfieldname isValid --docidfieldname docid

Run 10 fold training ignore unique doc id

python ./source/algorithms/main_train_k_fold.py  --trainfile aimedsample.json --traindir tests/data --embeddingfile tests/test_algorithms/sample_PubMed-and-PMC-w2v.bin.txt --outdir /tmp --modeldir /tmp --embeddim 200 --epochs 2 --dataset PpiAimedDatasetPreprocessedFactory --labelfieldname isValid

To use use the pretrained embeddings by Chiu et al. How to Train good Word Embeddings for Biomedical NLP download the embeddings from https://github.com/cambridgeltl/BioNLP-2016

Other embeddings

To use the embedding from Collobert https://ronan.collobert.com/pub/matos/2014_hellinger_eacl.pdf . They can be downloaded from http://www.lebret.ch/words/embeddings/. First convert the vocab / words format into a single file as shown here. The resulting file can then be used as the normal embedding file.
```
python ./source/algorithms/collobert_embedding_formatter.py  --vocabfile vocab.txt --embedfile words.txt --outputfile words_vocab_collabert.txt
```

Pretrained biobert

Download pretrained biobert from https://github.com/naver/biobert-pretrained/releases, specifically https://github.com/naver/biobert-pretrained/releases/download/v1.1-pubmed/biobert_v1.1_pubmed.tar.gz

Convert the tf model to a pytorch model

PYTHONPATH=./source
python ./source/algorithms/BiobertTfConverter.py  --modeldir "<modeldir>" --outputdir "<outputdir>"

Train

export PYTHONPATH=./source
python ./source/algorithms/main_train_bert.py --dataset PpiAimedDatasetFactory --trainfile Aimedsample.json --traindir tests/data/ --valfile Aimedsample.json --valdir tests/data --pretrained_biobert_dir "<biobertdir>"

Name		Name	Last commit message	Last commit date
Latest commit History 1,146 Commits
.idea		.idea
aws_batch		aws_batch
sm_container		sm_container
source		source
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
AIMedDataExploration-Ylhsieh.ipynb		AIMedDataExploration-Ylhsieh.ipynb
AIMedDataExploration.ipynb		AIMedDataExploration.ipynb
ClassificationDataExploration.ipynb		ClassificationDataExploration.ipynb
DataAnalysisBioCreativeMutation.ipynb		DataAnalysisBioCreativeMutation.ipynb
DataAnalysisBioCreativeMutationResults.ipynb		DataAnalysisBioCreativeMutationResults.ipynb
DataExploration.ipynb		DataExploration.ipynb
LICENSE		LICENSE
LargeScalePredictionAnalysis.ipynb		LargeScalePredictionAnalysis.ipynb
README.md		README.md
RecallPrecisionPlot.ipynb		RecallPrecisionPlot.ipynb
ResultsExplorer-AImed.ipynb		ResultsExplorer-AImed.ipynb
ResultsExplorer-Untyped.ipynb		ResultsExplorer-Untyped.ipynb
ResultsExplorer.ipynb		ResultsExplorer.ipynb
SageMakerLargeScalePrediction.ipynb		SageMakerLargeScalePrediction.ipynb
Sagemaker - AIMED all experiments.ipynb		Sagemaker - AIMED all experiments.ipynb
Sagemaker-AIMed-Bert.ipynb		Sagemaker-AIMed-Bert.ipynb
Sagemaker-AIMed.ipynb		Sagemaker-AIMed.ipynb
Sagemaker-Bert-Classification.ipynb		Sagemaker-Bert-Classification.ipynb
Sagemaker-Bert.ipynb		Sagemaker-Bert.ipynb
Sagemaker.ipynb		Sagemaker.ipynb
Spark-pubmed-abstracts-analysis.ipynb		Spark-pubmed-abstracts-analysis.ipynb
TokeniserAnalysis.ipynb		TokeniserAnalysis.ipynb
Training verifcation.ipynb		Training verifcation.ipynb
buildspec.yml		buildspec.yml
cloudformation.json		cloudformation.json
codebuild_cloudformation.json		codebuild_cloudformation.json
requirements_notebooks.txt		requirements_notebooks.txt
training_groundtruth.ipynb		training_groundtruth.ipynb
validation_set_with_gnorm_relation.xml.json		validation_set_with_gnorm_relation.xml.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PPI typed relation extraction

Task definition

Run via Docker

Download and analyse the dataset with elastic search

Download IMEX raw xml files from intact

Optional: Visualise through elastic search on AWS

Run locally from source

Training and Validation

Large scale inference on pubmed abstracts

Other datasets

AImed

Other embeddings

Pretrained biobert

About

Releases

Packages

Languages

License

elangovana/PPI-typed-relation-extractor

Folders and files

Latest commit

History

Repository files navigation

PPI typed relation extraction

Task definition

Run via Docker

Download and analyse the dataset with elastic search

Download IMEX raw xml files from intact

Optional: Visualise through elastic search on AWS

Run locally from source

Training and Validation

Large scale inference on pubmed abstracts

Other datasets

AImed

Other embeddings

Pretrained biobert

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages