CROssBAR: Comprehensive Resource of Biomedical Relations with Deep Learning Applications and Knowledge Graph Representations
The purpose of the CROssBAR project is to address the limitations related to data diversity and connectivity in biological data resources, which hamper their real-world applications to biomedical problems. Within CROssBAR, we developed a comprehensive computational resource by linking various biomedical resources, generating relation predictions using machine/deep learning, and developing information rich knowledge graphs that incorporate available and predicted biomedical relationships with the aim of providing aid to biomedical researchers to further understand disease mechanisms and to discover/develop new drugs. Detailed information can be obtained from CROssBAR paper.
Sub-projects under CROssBAR:
1) Biomedical data integration: CROssBAR database is constructed by collecting relational data from various biomedical data resources UniProt, IntAct, InterPro, Reactome, Ensembl, DrugBank, ChEMBL, PubChem, KEGG, OMIM, Orphanet, Gene Ontology, Experimental Factor Ontology (EFO) and Human Phenotype Ontology (HPO) by persisting specific data attributes with the implementation of logic rules, in MongoDB collections. Open access CROssBAR-DB can be queried via our public RESTful API, which provides a multi-faceted view of the stored data.
2) Deep learning-based relation prediction: the main purpose here was to enrich the integrated biomedical data by identifying the unknown interactions between drugs / drug candidate compounds and target proteins. We re-trained our previously developed systems using carefully filtered and up-to-date data in the CROssBAR database, and ran our models on large-scale compound and protein spaces to obtain comprehensive bio-interaction predictions, including drug predictions for COVID-19.
3) Biomedical knowledge graphs: Different biological components; drugs/compounds, genes/proteins, pathways/mechanisms, phenotypes/diseases are represented as nodes, and their known (reported) and computationally predicted relationships are annotated as edges. At each step of process, overrepresentation-based enrichment analyses are applied to construct a graph that is highly relevant to the query term(s). These intensely-processed heterogeneous biological networks is expected to be utilized to aid biomedical research, especially to infer mechanisms of diseases in relation to biomolecules, systems and candidate drugs.
4) CROssBAR web-service: Here we developed a service to make the CROssBAR data available to the public in an easily interpretable, interactive way via an online graphical user interface. Knowledge graphs are presented visually on web-browsers as Cytoscape networks. Users can make searches with CROssBAR components by simply typing the names or ids of the query terms individually or in combination, to obtain relevant sub-graphs, constructed on-the-fly.
5) COVID-19 use case and other data exploration examples: CROssBAR COVID-19 knowledge graphs are constructed with aim of collecting the related data from various biomedical resources, applying filtering operations and presenting it in a coherent and standardized form to the research community. Along with up-to-date information reported in source databases, our COVID-19 KGs also incorporates several new drugs (either by enrichment analysis or predicted by our deep-learning models) that can contribute to the studies on developing novel medications against SARS-CoV-2. We also conducted in vitro cell based wet-lab experiments (i.e., gene expression analysis) to compare its results with the computationally-inferred information.
We constructed the CROssBAR database to integrate vast amounts of biological information from various well-known resources. Data pipelines are developed for the heavy lifting of data from different databases such as UniProt, IntAct, DrugBank, ChEMBL, PubChem, Reactome, KEGG, OMIM, Orphanet and EFO, by persisting specific data attributes with the implementation of logic rules.
The CROssBAR database of attributes is hosted in self-sufficient, easy to access collections in MongoDB and it is available both to end-users and to the CROssBAR webservice through an API at: CROssBAR-API.
Technologies used:
- Java 8,
- Mongo DB v3.4.9,
- Groovy and Spock framework for tests,
- Maven dependency management
For more information about the CROssBAR Database & API, please refer to our project paper or visit CROssBAR_DB_API folder.
DEEPScreen:
DEEPScreen is a high performance drug–target interaction predictor that utilizes convolutional neural networks and 2-D structural compound representations to predict their activity against intended target proteins. DEEPScreen system is composed of 704 target protein specific prediction models, each independently trained using experimental bioactivity measurements against many drug candidate small molecules, and optimized according to the binding properties of the target proteins.
DEEPScreen can be exploited in the fields of drug discovery and repurposing for in silico screening of the chemogenomic space, to provide novel DTIs which can be experimentally pursued. The source code, trained "ready-to-use" prediction models, all datasets and the results of this study are available at DEEPScreen GitHub repository. More information is available at DEEPScreen journal paper.
MDeePred:
MDeePred is a deep-learning method that produces compound-target binding affinity predictions to be used for the purposes of computational drug discovery and repositioning. The method adopts the chemogenomic approach, where both the compound and target protein features are employed at the input level to model their interaction, which enables the prediction of inhibitors to under-studied or completely non-targeted proteins. In MDeePred, multiple types of protein features such as sequence, structural, evolutionary and physicochemical properties are incorporated within multi-channel 2-D vectors, which is then fed to state-of-the-art pairwise input hybrid deep neural networks to predict the real-valued compound-target protein interactions. The source code and datasets of MDeePred are available at MDeePred GitHub repository.
In CROssBAR knowledge graphs, different biological components, such as;
- genes/proteins,
- pathways,
- diseases,
- phenotypes, and
- drugs/compounds
are represented as nodes, and the known and predicted pairwise relationships are annotated and displayed as labeled edges. The knowledge graphs are constructed on the fly, each time the CROssBAR database is queried by the user. To convert the full output of user queries, which are initially extremely large biological networks, into biologically meaningful and interpretable representations without losing primary relationships, we applied intensive node enrichment operations. The knowledge graphs are displayed to the user as heterogeneous biological networks and their purpose is to aid biomedical research, especially in the fields of drug discovery and repositioning, by providing a concise piece of relevant biological information to the user in real time.
For COVID-19 knowledge graph use case please refer to the corresponding section entitled "COVID-19 Knowledge Graphs" below and please visit COVID-19 KGs use case folder. For the manually constructed prototype hepatocellular carcinoma (HCC) disease network please visit CROssBAR HCC network folder.
In order to make the CROssBAR knowledge graphs (KG) available to the public in an easily interpretable way, we developed a web service and an easy to use web interface. Here, KGs are presented on a web browser as Cytoscape networks. The web service is available at: crossbar.kansil.org. The users can make a search for the following entities individually or in combination:
- gene/protein entries,
- biological process/pathway terms,
- disease terms,
- phenotype terms (HPO), and
- drug and drug candidate compound entries.
In response to a query started by the user, the input containing the search term(s) and the components that have a biological relationship with this input (e.g. a signalling pathway, of which the searched protein is a member, or a disease known to occur as a result of a mutation in the protein sought, or target proteins known to interact with the searched drug molecule) are extracted from the CROssBAR database via the API. For the arrangement of components/terms on graphs, CROssBAR-layout is developed, in which biological components of a specific type are placed on circular points within fixed radii.
For CROssBAR Web-Service data exploration examples please visit CROssBAR_Web-Service folder.
CROssBAR COVID-19 knowledge graphs (KGs) are constructed with aim of collecting the related data from various biomedical resources, applying filtering operations and presenting it in a coherent and standardized form to the research community. We are periodically updating our COVID-19 KGs with the new evidence that is being accumulated in our resources. On top of the data reported in source databases, our COVID-19 KGs also incorporates several new drugs (either by enrichment analysis or predicted by our deep-learning models) that can contribute to the studies on developing novel medications against SARS-CoV-2 (literature-based Investigation for predictions: Supplementary Information section 2). We also conducted simple in vitro cell based wet-lab experiments (i.e., gene expression analysis) to compare its results with the computationally-inferred information.
Large-scale COVID-19 Knowledge Graph:
Simplified COVID-19 Knowledge Graph:
The large-scale KG (1289 nodes and 6743 edges) and the simplified KG (435 nodes and 1061 edges). Both of these graphs reveal the most overrepresented biological processes during a SARS-CoV-2 infection, as well as, the potential treatment options with COVID-19 related pre-clinical/clinical results and our novel in silico predictions (for both virus and host proteins) considering long-term drug discovery or short-term drug repositioning applications.
For more information about the COVID-19 knowledge graphs, please refer to our project paper or visit CROssBAR COVID-19 KG folder. For information about the comparative in vitro cell-based analysis together with the datasets please visit CROssBAR wet-lab analysis folder.
If you use CROssBAR please consider citing:
Doğan, T., Atas, H., Joshi, V., Atakan, A., Rifaioglu, A.S., Nalbat, E., Nightingale, A., Saidi, R., Volynkin, V., Zellner, H., Cetin-Atalay, R., Martin, M. J. & Atalay, V. (2021). CROssBAR: Comprehensive Resource of Biomedical Relations with Knowledge Graph Representations. Nucleic Acids Research, 49(16), e96-e96. Link
CROssBAR (c) by CanSyL
CROssBAR is licensed under a Creative Commons Attribution 4.0 Unported License.
You should have received a copy of the license along with this work. If not, see http://creativecommons.org/licenses/by/4.0/.