Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

dataset used inconsistently across API #595

Open
diekhans opened this issue Apr 1, 2016 · 8 comments
Open

dataset used inconsistently across API #595

diekhans opened this issue Apr 1, 2016 · 8 comments

Comments

@diekhans
Copy link
Contributor

diekhans commented Apr 1, 2016

dataset is not used consistently by the datamodel. Since we have this concept for organizing data, it is proposed that all data objects should be in some dataset and this not be optional. It makes data discovery and access more complex if we don't have a firmly defined hierarchy.

  • ReadGroup has union { null, string } datasetId = null; Is there any use case where a ReadGroups in different ReadGroupSets will be in different DataSets? Suggest removing datasetId.
  • ReadGroupSet has union { null, string } datasetId = null;. Suggest making this a required field as it is with VariantSet
  • ReferenceSet does not have datasetId. Suggest removing this one special case and require ReferenceSet to be part of a dataset.
@calbach
Copy link
Contributor

calbach commented Apr 2, 2016

ReadGroup has union { null, string } datasetId = null; Is there any use case where a ReadGroups in different ReadGroupSets will be in different DataSets? Suggest removing datasetId.

I would agree, with the caveat that we need to be explicit that ReadGroup belongs to a single ReadGroupSet. IIRC this is a controversial topic. Otherwise it's a gray area as we can have a case where a ReadGroup belongs to multiple Datasets via multiple ReadGroupSet parents. And where does the ReadGroup belong if it has 0 parent ReadGroupSets?

ReadGroupSet has union { null, string } datasetId = null;. Suggest making this a required field as it is with VariantSet

+1. This should be uncontroversial.

ReferenceSet does not have datasetId. Suggest removing this one special case and require ReferenceSet to be part of a dataset.

+1 with a caveat; this is a bit more of a semantic change details on my comment here. If we're doing this, we should consider putting a datasetId field on Reference, as it has the same many:many as ReadGroup does.

@diekhans
Copy link
Contributor Author

diekhans commented Apr 2, 2016

Thanks @calbach

CH Albach [email protected] writes:

I would agree, with the caveat that we need to be explicit that ReadGroup
belongs to a single ReadGroupSet. IIRC this is a controversial topic. Otherwise
it's a gray area as we can have a case where a ReadGroup belongs to multiple
Datasets via multiple ReadGroupSet parents. And where does the ReadGroup belong
if it has 0 parent ReadGroupSets?

What is the use case for readgroups being in multiple
readgroupsets? I have never seen data organized. Also, it seems
that a readgroup should always have a parent readgroupset

ReferenceSet does not have datasetId. Suggest removing this one special
case and require ReferenceSet to be part of a dataset.

+1 with a caveat; this is a bit more of a semantic change details on my comment
here. If we're doing this, we should consider putting a datasetId field on
Reference, as it has the same many:many as ReadGroup does.

A use case for sharing References in multiple ReferenceSets
is putting out an assembly where some chromosomes are unchanged,
such as GRCh37lite.

@richarddurbin
Copy link
Contributor

Certainly the original intention was that a reference could belong to multiple ReferenceSets. Otherwise there will be lots of duplication.

Likewise, I always thought that References and ReferenceSets would stand outside Datasets. Again to avoid duplication because many datasets will use the same reference. Though I confess I can see the case for them to have an "owner" or responsible authority/source. And we perhaps should not rule out someone wanting to use a reference subject to access control, which I think is at the dataset level.

Sent from my iPhone

On 2 Apr 2016, at 5:24 am, Mark Diekhans [email protected] wrote:

Thanks @calbach

CH Albach [email protected] writes:

I would agree, with the caveat that we need to be explicit that ReadGroup
belongs to a single ReadGroupSet. IIRC this is a controversial topic. Otherwise
it's a gray area as we can have a case where a ReadGroup belongs to multiple
Datasets via multiple ReadGroupSet parents. And where does the ReadGroup belong
if it has 0 parent ReadGroupSets?

What is the use case for readgroups being in multiple
readgroupsets? I have never seen data organized. Also, it seems
that a readgroup should always have a parent readgroupset

ReferenceSet does not have datasetId. Suggest removing this one special
case and require ReferenceSet to be part of a dataset.

+1 with a caveat; this is a bit more of a semantic change details on my comment
here. If we're doing this, we should consider putting a datasetId field on
Reference, as it has the same many:many as ReadGroup does.

A use case for sharing References in multiple ReferenceSets
is putting out an assembly where some chromosomes are unchanged,
such as GRCh37lite.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@diekhans
Copy link
Contributor Author

diekhans commented Apr 2, 2016

Thank you @richarddurbin

Cross-dataset should analysis should be as easy as within the
same dataset or we have a design flaw. While Google thinks of
dataset as access and billing control, for open data, dataset seems
more akin to NCBI BioProject.

We have a couple of terabytes of assembly hubs with more than
60,000 organisms. Having a dataset bag to put this in is very
helpful.

One design question I am asking is: should a ReferenceSet be
able to be composted of References from different datasets?

This relates to:
Cancer virus and human decoy search use cases not supported #429
#429

Does one construct a new, combined referenceset for
multi-referenceset mapping or does one have a way of specifying
multiple referencesets?

I am not arguing for one approach over another, only for having
a rigorous specification on the supported use cases.

Richard Durbin [email protected] writes:

Certainly the original intention was that a reference could belong to multiple
ReferenceSets. Otherwise there will be lots of duplication.

Likewise, I always thought that References and ReferenceSets would stand
outside Datasets. Again to avoid duplication because many datasets will use the
same reference. Though I confess I can see the case for them to have an "owner"
or responsible authority/source. And we perhaps should not rule out someone
wanting to use a reference subject to access control, which I think is at the
dataset level.

Sent from my iPhone

On 2 Apr 2016, at 5:24 am, Mark Diekhans [email protected] wrote:

Thanks @calbach

CH Albach [email protected] writes:

I would agree, with the caveat that we need to be explicit that ReadGroup
belongs to a single ReadGroupSet. IIRC this is a controversial topic.
Otherwise
it's a gray area as we can have a case where a ReadGroup belongs to
multiple
Datasets via multiple ReadGroupSet parents. And where does the ReadGroup
belong
if it has 0 parent ReadGroupSets?

What is the use case for readgroups being in multiple
readgroupsets? I have never seen data organized. Also, it seems
that a readgroup should always have a parent readgroupset

ReferenceSet does not have datasetId. Suggest removing this one special
case and require ReferenceSet to be part of a dataset.

+1 with a caveat; this is a bit more of a semantic change details on my
comment
here. If we're doing this, we should consider putting a datasetId field on
Reference, as it has the same many:many as ReadGroup does.

A use case for sharing References in multiple ReferenceSets
is putting out an assembly where some chromosomes are unchanged,
such as GRCh37lite.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub*

@calbach
Copy link
Contributor

calbach commented Apr 5, 2016

What is the use case for readgroups being in multiple
readgroupsets? I have never seen data organized. Also, it seems
that a readgroup should always have a parent readgroupset

I don't know of any, would have to dig through old GitHub issues/email. I don't support having many:many here.

A use case for sharing References in multiple ReferenceSets
is putting out an assembly where some chromosomes are unchanged,
such as GRCh37lite.

I wasn't arguing against many:many here, just adding that it should likely have a datasetId field along with this change, as ownership will otherwise be ambiguous, e.g. Reference belonging to 0 ReferenceSets or a Reference belonging to 2 ReferenceSets with different datasetIds.

Does one construct a new, combined referenceset for
multi-referenceset mapping or does one have a way of specifying
multiple referencesets?

Not sure what the correct answer is, but given today's API the answer is clearly to create a new ReferenceSet, as there is no way of specifying a mapping to multiple.

@dglazer
Copy link
Member

dglazer commented Apr 5, 2016

For history, #32 has the very long
original discussion of readgroups and readgroupsets. I too would be happy
with requiring each RG to have exactly one RGSet.

On Mon, Apr 4, 2016 at 8:56 PM, CH Albach [email protected] wrote:

What is the use case for readgroups being in multiple
readgroupsets? I have never seen data organized. Also, it seems
that a readgroup should always have a parent readgroupset

I don't know of any, would have to dig through old GitHub issues/email. I
don't support having many:many here.

A use case for sharing References in multiple ReferenceSets
is putting out an assembly where some chromosomes are unchanged,
such as GRCh37lite.

I wasn't arguing against many:many here, just adding that it should likely
have a datasetId field along with this change, as ownership will
otherwise be ambiguous, e.g. Reference belonging to 0 ReferenceSets or a
Reference belonging to 2 ReferenceSets with different datasetIds.

Does one construct a new, combined referenceset for
multi-referenceset mapping or does one have a way of specifying
multiple referencesets?

Not sure what the correct answer is, but given today's API the answer is
clearly to create a new ReferenceSet, as there is no way of specifying a
mapping to multiple.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#595 (comment)

@diekhans
Copy link
Contributor Author

diekhans commented Apr 5, 2016

thanks @dglazer; that piece of history is enlighten. Some of the discussion seems to confound the concepts of grouping for data production vs data analysis. A fairly ridged model is good for data production, which is my understanding of what readgroupset (all readgroups for an sequence experiment on a sample) and readgroup are.

For data analysis, one wants to be able to group things is fairly arbitrary ways. This is better done with different types of linking (maybe as simple as lists of readgroupsets).

@diekhans
Copy link
Contributor Author

diekhans commented Apr 5, 2016

Thanks @dglazer; that piece of history is enlighten. Some of the discussion seems to confound the concepts of grouping for data production vs data analysis. A fairly ridged model is good for data production, which is my understanding of what readgroupset (all readgroups for an sequence experiment on a sample) and readgroup are.

For data analysis, one wants to be able to group things is fairly arbitrary ways. This is better done with different types of linking (maybe as simple as lists of readgroupsets).

David Glazer [email protected] writes:

For history, #32 has the very long
original discussion of readgroups and readgroupsets. I too would be happy
with requiring each RG to have exactly one RGSet.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants