Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic genome data and 1+ Million Genomes Framework services (NBIS - ELIXIR Sweden) #29

Open
mroos opened this issue Jan 29, 2024 · 14 comments
Assignees
Labels
use case step This is a step of a use case

Comments

@mroos
Copy link
Collaborator

mroos commented Jan 29, 2024

The use case described in #57 will be used to identify requirements on the interfaces between the VP1 and the connected resources, on the data objects that are communicated, and the services that process the data. Some focus will be on leveraging solutions offered by B1MG/GDI compatible resources that enable secure and federated queries for genomic/phenotypic information. EJP-RD partners and Swedish actors generating/stewarding rare disease data with genomic components will be consulted to maximise the utility of the use case in validating technical solutions and in supporting onboarding of new EJP-RD resources.

Relevance to EJP-RD and the wider health data community:

  1. Demonstrator with examples of scenarios where a federated approach to queries/analysis would be ideal—including synthetic data that can be used to test and validate technical solutions (this is what we aim to do with our work in EJP-RD)
  2. Showcase on how to prepare resources (such as clinical datastores, biobanks, FEGA, GDI) to be able to connect to and provide services of value to users through federated platforms (where the Virtual Platform of the EJP-RD is an example)

References:

Implementation:

  • Mapping B1MG Rare Disease use cases and demonstrator to scenarios using EJP-RD’s Virtual Platform
  • Mapping the EGA metadata model to the EJP-RD resource matdata schema and CARE-SM
  • Mapping metadata and content level information (data elements, files and relations) that don’t fit into the EJP-RD models) to concepts that can be represented in RDF, such as using FAIR Genomes
  • Identifying and mapping common services from platforms such as Scout, GPAP and GDI Starter Kit to the FAIR Data Services for EJP-RD supported by concepts emerging from the FAIR Data Train
  • Configuring a local test bed that implements a selection of the services identified above but minimally:
    1. SPARQL for provisioning CARE-SM and genomic data
    2. Beacon v2 with genomicVariations endpoint
    3. htsget for genomic variantion data (and possibly aligned genome data)
    4. GA4GA TESK implementation that supports containerised compute
  • Executable demonstrator scenarios with corresponding synthetic and mock data services

Footnotes

  1. Intro to 1+MG, B1MG, and GDI, https://framework.onemilliongenomes.eu/about-the-framework

@mroos mroos converted this from a draft issue Jan 29, 2024
@dwijnbergen

This comment was marked as resolved.

@wna-se
Copy link
Contributor

wna-se commented Feb 5, 2024

Status update 2024-02-05:
We are currently working on creating a suitable subset of a synthetic dataset hosted in the European Genome-Phenome Archive (EGA) EGA:EGAD00001008392 and corresponding GDI starter-kit service endpoints.

On the technical side, we have set up a local development instance of the FIAB that we will use to test our mapping from the genomic use case document before the data is transferred to the testbed instance.

Issues: We need some guidance on what we need to prepare to add data to the testbed.

@mroos

This comment was marked as resolved.

@wna-se
Copy link
Contributor

wna-se commented Mar 4, 2024

SWAT4HCLS updates (programme):

  • Mon: Tutorials - Contributing to the content training new members of the team in the EJP-RD FDP configuration
  • Tue: Co-located meeting - Clinical genetics LUMC collaborators to prepare hackathon.
  • Thu: Hackathon - EJP-RD scenario for federating SPARQL through the FDP
  • Fri: Co-located meeting - LUMC + ELIXIR Sweden EJP-RD.

Interaction pattern

Milstones

Reflections:

  • For the genomics case and resources like the EGA and GDI, I think that part of the challenge might be a lack of widely established/used models that encompass all of the concepts represented in the common formats/tools for WGS/VCF files.
  • Where there are suitable concepts in CARE-SM, we will of course use them and set up the transformations (CSV-YAARRRML or otherwise). A simple solution for making sensitive data available for SPARQL queries within a secured environment could be to rely on the GDI approach for federated analyses more generally and wrap the queries in a request to the GA4GH Task Execution Service endpoint.
  • We might be able to find a way to create a secure SPARQL endpoint that can be exposed on the FDP and internally translate the requests to use the secure infrastructure for executing the queries where sensitive data can be accessed.
  • Could you perhaps write a few example SPARQL queries that we should be able to run across our mock-FDPs?

@wna-se
Copy link
Contributor

wna-se commented Mar 11, 2024

CARE-SM for the genomics use case:
It seems that the CARE-SM model implementations has changed quite a lot since the CDE version we used last year. With the new Laboratory measurement module and using the Laboratory Procedure type we would like to relate to one of the subclasses Whole Exome Sequencing and Disease Panel Gene Sequencing as we mentioned below and the sio:has-target should probably relate to something under DNA Sequence in the Anatomic Structure, System, or Substance tree with a value_datatypeset to IRI. Would it be valid to add an optional column called something like model_subclass (as a replacement of processURI) to allow assigning more specific types of Laboratory Procedure?

The IRI could for example be related to a DCAT Dataset and DataService offering access through the GA4GH htsget protocol, https://www.ga4gh.org/news_item/htsget-ga4ghs-streaming-api-is-a-bridge-to-the-future-for-modern-genomic-data-processing/

When it comes to the modules under Genomic assessment, I think that it would make sense to have an input IRI referencing the outputs of omics-related lab measurements (e.g. WES, Panels etc) or some computational
variant analysis process (perhaps a new assessment type?).

Synthetic data:
Reached out to Sergi Beltran (CNAG) regarding the ⁠Rare Disease Synthetic Dataset (EGAD00001008392) and the Rare Disease Use Case from B1MG D4.1 Secure data access roadmap to find contacts who have been involved in creating the dataset/use case or examples of analyses/tools that have relied on the dataset or demonstrated the use case than could be translated to a federated example over the VP.

Future direction:

  • Follow up on Leon’s example, e.g. create a mock SPARQL query that selects imaging outputs associated with a patient age and retrieves the patient age and corresponding IRI:s to images
  • Create a mock query (ab)using the existing Lab measurement and Genomic variant implementations to select outputs accepted with a diagnosis and retrieve the related IRI:s to genomic sequences
  • Add an issue to the CARE-SM implementations repository explaining the query above to assess if an extension of the existing models or creating a new model would be the ideal solution

@wna-se
Copy link
Contributor

wna-se commented Mar 25, 2024

Update:

Future direction:

@wna-se
Copy link
Contributor

wna-se commented Apr 15, 2024

Update:

Future direction:

@wna-se
Copy link
Contributor

wna-se commented Apr 22, 2024

Update:

@wna-se wna-se changed the title Expose Swedish FDPs containing synthetic genome data on the test bed Synthetic genome data & 1+ Million Genomes Framework implementation (NBIS - ELIXIR Sweden) Apr 22, 2024
@wna-se wna-se changed the title Synthetic genome data & 1+ Million Genomes Framework implementation (NBIS - ELIXIR Sweden) Synthetic genome data and 1+ Million Genomes Framework services (NBIS - ELIXIR Sweden) Apr 22, 2024
@wna-se
Copy link
Contributor

wna-se commented Apr 29, 2024

Update:

Future direction:

  • Add GDI Beacon endpoint to local testbed

@NuriaQueralt

This comment was marked as resolved.

@wna-se
Copy link
Contributor

wna-se commented Apr 30, 2024

@NuriaQueralt, I’ve made a copy of (a subset of) the files in a private GitHub repository and invited you as a collaborator. Once you have accepted the invitation you can find one of the files here and a narrative description is available here.

Anyone can register an account on EGA and request access to the full dataset if you want a local copy at LUMC. I’ve reached out to the RD Connect Platform team to look into under what conditions the data can redistributed more broadly.

@NuriaQueralt

This comment was marked as resolved.

@wna-se wna-se mentioned this issue May 20, 2024
@wna-se
Copy link
Contributor

wna-se commented Jul 1, 2024

Update:

  • Local testbed configuration (NBISweden/ejprd) is being deployed to a public cloud service and will be available this week. Some LS-AAI to be resolved or fall-back to mock-service
  • Use case demonstrator scenarios / questions to answer using the VP has been developed and will be added to the testbed

@mroos
Copy link
Collaborator Author

mroos commented Jul 22, 2024

FYI: Alberto shared VP testbed configuration in the Teams chat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
use case step This is a step of a use case
Projects
Status: In progress
Development

No branches or pull requests

4 participants