Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define specific queries on specific CDEs modelled by Care-SM #28

Open
2 tasks
mroos opened this issue Jan 15, 2024 · 13 comments
Open
2 tasks

Define specific queries on specific CDEs modelled by Care-SM #28

mroos opened this issue Jan 15, 2024 · 13 comments
Assignees

Comments

@mroos
Copy link
Collaborator

mroos commented Jan 15, 2024

Define SPARQL queries that can be used to answer the information needs described in the use case flash card and the mindmap linked from #57 by relying on the information models defined in the Virtual Platform Specification (VIPS)1 with extensions only where necessary.

List of models from VIPS used (add as necessary):

  1. EJP RD meta data model – findability of rare disease resources
  2. Clinical And Registry Entries (CARE) Semantic Model – core data standard describing common data elements essential for RD research

List of models not in VIPS used (add as necessary):

  1. FAIR Genomes metadata schema – semantic metadata schema to power reuse of NGS data

Queries to implement:

  • Query 1: …
  • Query 2: …

Footnotes

  1. See VIPS 2.0, page 17

@mroos mroos converted this from a draft issue Jan 15, 2024
@mroos
Copy link
Collaborator Author

mroos commented Jan 15, 2024

This issue is to define a specific case that we can perform, using a minimal number of data elements including a minimal number of CDEs such to demonstrate the use of Care-SM in queries.

Special request

  • Can we do a rare disease and oncology case in parallel: this will help adoption at the local institutes? Marco will bring this up with Karolis for the LCCO project.

@mroos mroos moved this from Backlog to Ready in L3-FAIR Data Train issues Jan 29, 2024
@wna-se
Copy link
Contributor

wna-se commented Mar 5, 2024

It would be very useful to have a realistic and compelling query that includes the CARE-SM data elements related to Clinical measurements and Genetic assessment with a description on the sizes and characteristics of the datasets it would be run on and the expected results.

Perhaps something based on B1MG D4.1 - Secure cross-border data access roadmap - 1v0:
3 example use cases as defined by WG8 and given to WP4 as example use cases

@andrawaag
Copy link
Collaborator

During biohackathon on Mai 3rd @wna-se and @andrawaag made this a place to collect query example to be transformed into SPARQL

@wna-se
Copy link
Contributor

wna-se commented Jun 3, 2024

@markwilkinson Added you here as discussed during today’s meeting. Please link to / add any queries that you can share here and / or to the ejp-rd-vp/DistributedAnalysis repository.

@wna-se
Copy link
Contributor

wna-se commented Jun 3, 2024

@NuriaQueralt For those working on Phenopackets and/or phenotypic data that could be present in case reports, the following datasets consisting of published case report translated to Phenopackets may be useful?

Also, the JSON Schema validator from the phenopackets / phenopacket-tools could perhaps be a useful resource to inspire the RDF/Shacl-mapped version. Notably, the folder with the JSON Schema gives an example of how they have mirrored the structure of the authoritative protobuf definitions, the validation rules could probably be translated into Shacl, and the the choice of uris used to reference the definitions could also be useful.

Is there an issue specifically for the Phenopacket work? Perhaps also relevant to @rosazwart ?

@wna-se
Copy link
Contributor

wna-se commented Jun 3, 2024

@andrawaag : @mroos said that you would be a great person to take the lead on this task. As we are working on preparing the synthetic data we have for the VP it would be great to have some examples of genomic-related that we could prioritise mapping to, ideally a few queries based on the @mroos’ mindmap (see reference in the description of #57) and using the CARE-SM and FAIR Genomes semantic schema.

Edit: @ericprud : I’m also tagging you here as discussed during today’s meeting. It would be very helpful with some exemples

@mroos
Copy link
Collaborator Author

mroos commented Jun 10, 2024

Update 10/6
@hbcesar, Annika, @andrawaag working on example queries.
@pabloalarconm asked to provide example data for the queries (CSV + conversion method).

  • @ericprud asks to share resulting RDF into github for others in this group to use (or repo of choosing) @pabloalarconm

@andrawaag : may need an intermediate step first (chicken-egg). Need to find a way to get to the RDF, whereas Wolmar (and colleages) need help on converting from data that works for Beacon.

@mroos mroos moved this from Ready to In progress in L3-FAIR Data Train issues Jun 10, 2024
andrawaag added a commit to ejp-rd-vp/DistributedAnalysisDemonstrator that referenced this issue Jun 10, 2024
@wna-se
Copy link
Contributor

wna-se commented Jun 14, 2024

@NuriaQueralt For those working on Phenopackets and/or phenotypic data that could be present in case reports, the following datasets consisting of published case report translated to Phenopackets may be useful?

Also, the JSON Schema validator from the phenopackets / phenopacket-tools could perhaps be a useful resource to inspire the RDF/Shacl-mapped version. Notably, the folder with the JSON Schema gives an example of how they have mirrored the structure of the authoritative protobuf definitions, the validation rules could probably be translated into Shacl, and the the choice of uris used to reference the definitions could also be useful.

Is there an issue specifically for the Phenopacket work? Perhaps also relevant to @rosazwart ?

@andrawaag Above are two references to collections of Phenopackets that represent published case reports and could be useful source materials to produce a realistic graph to query using @NuriaQueralt and @rosazwart mapping. The synthetic data that we have been working on in Sweden is a subset of files derived from the Rare Disease Synthetic Dataset available in full from the European Genome-Phenome Archive (EGA) through accession number EGAD00001008392, see example phenopacket, PDF describing the data and the full subset as well as derived files in NBISweden/ejprd-data/ .

@wna-se
Copy link
Contributor

wna-se commented Jun 17, 2024

The CARE-SM/beaconAPI4CARESM also contains some SPARQL templates that can be used to serve a Beacon endpoint.

@pabloalarconm
Copy link
Member

Hi @wna-se @mroos

Some of these tasks are tagging me in this conversation but its not clear what you need. As the main maintener of CARE-SM nowadays, what is exactly what you need from my contribution of your use case? (Probably you discussed in a meeting Im not involved to)

  • ShEx files for schema validation are already included at here

  • SPARQL queries have been always here There's two examples, but let me know if you need more cases to add here. SPARQL queries fragments from beaconAPI4CARESM are just fragments, hard to reuse in a first attempt but let me know if you need help with that (I can connect to a meeting to discuss its implementation out of this API)

  • I will add examplar RDF data to the CARE-SM implementation repo. DO you need to for every specific data element? Or a single example representation?

Bests,
Pablo

@wna-se
Copy link
Contributor

wna-se commented Jul 1, 2024

Pasted from e-mail by @NuriaQueralt on 20 June:

Dear all,

I have finished the phenopackets RDF model, in ShEx. You can have a look in github, in branch “”v2”. I modelled ONLY the elements required for the GDI use case. Rosa, I also modelled the Variant related elements for the LUMC data, so you can start adapting your RDFization pipeline.

Good news! We have a bunch of phenopackets that follow the current scheme version here: https://monarch-initiative.github.io/phenopacket-store/ I suggest to use these set for our POC. I may refine the model adding some RDF examples using these data, so I may do some changes to the model.

My apologies, I cannot make it to today-s meeting due to a clash in my agenda.

With kind regards,
Núria

@mroos
Copy link
Collaborator Author

mroos commented Jul 22, 2024

@ericprud
Copy link
Collaborator

ericprud commented Aug 2, 2024

  • I will add examplar RDF data to the CARE-SM implementation repo. DO you need to for every specific data element? Or a single example representation?

Ideally, we'd have a couple nice examples that demonstrated the breadth of the expressions. This will serve as documentation and inspiration for schema and queries. Such examples could be cobbled together from multiple instances of the current JSON data.

Having all the data would also be handy as it would help us verify schema and queries and provide a corpus for tests. Would also be nice for demos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

7 participants