This repo will host open knowledge graphs from VaidhyaMegha.
VaidhyaMegha has built an open knowledge graph on clinical trials.
- This repository contains the source code along with instructions to generate and use this knowledge graph.
- More information, including references, is available in article and also here
VaidhyaMegha is building an open knowledge graph on technical decision making.
- This repository contains the source code along with instructions to use this periodically curated knowledge graph.
- More information, including references, is available in article and also here
Pre-requisite steps
- Create a folder 'lib'. Download algs4.jar file from here and place in 'lib' folder.
- Download hypergraphql jar file from here and place in 'lib' folder.
- Dowload 'vocabulary_1.0.0.ttl' file from here and place in 'data/open_knowledge_graph_on_clinical_trials' folder.
- Download mesh2022.nt.gz from here and unzip it. Place mesh2022.nt file 'data/open_knowledge_graph_on_clinical_trials' folder.
- Download PheGenI from here and place file 'data/open_knowledge_graph_on_clinical_trials' folder.
- Download detailed_CoOccurs_2021.txt.gz from here and unzip it. Place detailed_CoOccurs_2021.txt file in 'data/open_knowledge_graph_on_clinical_trials' folder.
- Generate detailed_CoOccurs_2021_selected_fields.txt and detailed_CoOccurs_2021_selected_fields_sorted.txt files using following commands. Place both detailed_CoOccurs_2021_selected_fields.txt and detailed_CoOccurs_2021_selected_fields_sorted.txt files in 'data/open_knowledge_graph_on_clinical_trials' folder.
cut -d '|' -f1,9,15 data/open_knowledge_graph_on_clinical_trials/detailed_CoOccurs_2021.txt > data/open_knowledge_graph_on_clinical_trials/detailed_CoOccurs_2021_selected_fields.txt
sort -u data/open_knowledge_graph_on_clinical_trials/detailed_CoOccurs_2021_selected_fields.txt > data/open_knowledge_graph_on_clinical_trials/detailed_CoOccurs_2021_selected_fields_sorted.txt
To compile and package
mvn clean package assembly:single -DskipTests
To build RDF
java -jar -Xms4096M -Xmx8192M target/vaidhyamegha-knowledge-graphs-v0.9-jar-with-dependencies.jar
To query using SparQL
java -jar -Xms4096M -Xmx8144M target/vaidhyamegha-knowledge-graphs-v0.9-jar-with-dependencies.jar -m cli -q src/main/sparql/1_count_of_records.rq ... Results: -------- 5523173^^
To query using GraphQL (via HyperGraphQL)
java -cp "target/vaidhyamegha-knowledge-graphs-v0.9-jar-with-dependencies.jar:lib/*" -m server
- From Postman with ntriples response
- From Postman with json response
- In a separate terminal execute GraphQL query using curl (alternatively use Postman)
$ curl --location --request POST 'http://localhost:8080/graphql' --header 'Accept: application/ntriples' --header 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8,kn;q=0.7' --header 'Content-Type: application/json' --data-raw '{"query":"{\n trial_GET(limit: 30, offset: 1) {\n label\n }\n \n}","variables":{}}' <> <> <> . <> <> "EUCTR2007-006072-11-SE"^^<> . <> <> <> . <> <> "NCT02954757"^^<> . <> <> <> . <> <> "EUCTR2014-005525-13-FI"^^<> . <> <> <> . <> <> "NCT02721914"^^<> . ... <> <> <> . <> <> <> . <> <> <> .
- From Postman with ntriples response
Summary : Using any trial id from across the globe find the associated diseases/interventions, research articles and genes. Also discover relationships b/w various medical topics through co-occurrences in articles. Query the graph using SparQL from cli or GraphQL using any API client tool ex: Postman or curl
Feature list :
- Using GraphQL API knowledge graph can be queried using any API client tool ex: curl or Postman.
- Graph includes trials from across the globe. Data is sourced from WHO's ICTRP and
- Links from trial to MeSH vocabulary are added for conditions and interventions employed in the trial.
- Links from trial to PubMed articles are added. PubMed's experts curate this metadata information for each article.
- Added MRCOC to the graph for the selected articles linked to clinical trials.
- Added PheGenI links i.e. links from phenotype to genotype as links between MeSH DUI and GeneID.
- Added SparQL query execution feature. Adding CLI mode. Adding a count SparQL query for demo.
- 5 co-existing bi-partite graphs b/w trial--> condition, trial--> intervention, trial --> articles, article --> MeSH DUIs, gene id --> MeSH DUIs together comprise this knowledge graph.
Changes in this release : Server mode of execution is added.
- v0.9
- Enable server mode of execution using HyperGraphQL
java -cp "target/vaidhyamegha-knowledge-graphs-v0.9-jar-with-dependencies.jar:lib/*" -m server
- v0.8
- Enable GraphQL interface to the knowledge graph using HyperGraphQL
java -Dorg.slf4j.simpleLogger.defaultLogLevel=debug -jar lib/hypergraphql-3.0.1-exe.jar --config src/main/resources/hql-config.json
- v0.7
- Enable SparQL queries
$ cat src/main/sparql/1_count_of_records.rq SELECT (count(*) as ?count) where { ?s ?p ?o} $ sparql --data=data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt --query=src/main/sparql/1_count_of_records.rq ----------- | count | =========== | 4766048 | ----------- $ wc -l data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt 4766048 data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt
- v0.6.1
- Externalize the Entrez API invocation threshold probability
- Patch for below issue
$ sparql --data=data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt --query=src/main/sparql/example.rq 04:33:04 ERROR riot :: [line: 1085476, col: 71] Bad character in IRI (Tab character): <[tab]...> Failed to load data $ grep "SLCTR/2020/014" data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt < > <TrialId> "SLCTR/2020/014\t" .
- v0.6
- Added PheGenI links i.e. links from phenotype to genotype as links between MeSH DUI and GeneID.
<> <Gene> <> . <> <GeneID> "10014" . <> <Gene> <> . <> <GeneID> "6923" . <> <Gene> <> . <> <GeneID> "3198" .
- v0.5
- Adding MRCOC to the graph for the selected articles linked to clinical trials.
<> <MeSH_DUI> <> . <> <MeSH_DUI> <> . <> <MeSH_DUI> <> .
- v0.4
- List of trial ids to be incrementally bounced against Entrez API to generate the necessary incremental mappings b/w trials and PubMed articles
$ grep "Pubmed_Article" data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt <> <Pubmed_Article> "25153486" . <> <Pubmed_Article> "34064657" .
- v0.3
- Adding links between trials and interventions in addition to trials and conditions.
- conditions and interventions are fetched from database (instead of files). Corresponding edges b/w trials and conditions, trials and interventions are added to RDF. For example :
<> <Condition> <> . <> <Intervention> <> .
- All global trial's - 756,169 - are added to RDF. For example :
<> <TrialId> "NCT00172328" . <> <TrialId> "CTRI/2021/05/033487" .
- Starting with a fresh model for final RDF. MeSH ids that are not linked to any trial not considered. This reduces the graph size considerably.
- Trial records are fetched from ICTRP's weekly + periodic full export and AACT's daily + monthly full snapshot.
- Trials are written down to a file (will be used later) : vaidhyamegha_clinical_trials.csv
$ wc -l vaidhyamegha_clinical_trials.csv 755272 vaidhyamegha_clinical_trials.csv
- Download the RDF from here.
- v0.2
- Clinical trials are linked to the RDF nodes corresponding to the MeSH terms for conditions. For example :
- Download the enhanced RDF from here.
More information, including references, is available in article and also here
VaidhyaMegha's prior work on
- clinical trial registries data linking.
- symptoms to diseases linking.
- phenotype to genotype linking.
- trials to research articles linking.
Last 3 are covered in the "examples" folder here. They were covered in separate public repos here earlier.
- Complete article
- Full list of trial ids to be used in combination with id_information table to generate a final list of unique trials using WQUPC algorithm
- Add secondary trial ids to graph (this may increase graph size considerably). However, it could be of utility.
- Build SparQL + GraphQL version of API to allow direct querying of the graph. Provide some reasonable examples that are harder in SQL.
- Snowmed CT, ICD 10.
- Host Knowledge graph on Ne04j's cloud service, Aura DB.
- Use Neo4j's GraphQL API from Postman to demonstrate sample queries on clinical trials.