dataset used inconsistently across API #595

diekhans · 2016-04-01T22:06:16Z

dataset is not used consistently by the datamodel. Since we have this concept for organizing data, it is proposed that all data objects should be in some dataset and this not be optional. It makes data discovery and access more complex if we don't have a firmly defined hierarchy.

ReadGroup has union { null, string } datasetId = null; Is there any use case where a ReadGroups in different ReadGroupSets will be in different DataSets? Suggest removing datasetId.
ReadGroupSet has union { null, string } datasetId = null;. Suggest making this a required field as it is with VariantSet
ReferenceSet does not have datasetId. Suggest removing this one special case and require ReferenceSet to be part of a dataset.

The text was updated successfully, but these errors were encountered:

calbach · 2016-04-02T00:01:41Z

ReadGroup has union { null, string } datasetId = null; Is there any use case where a ReadGroups in different ReadGroupSets will be in different DataSets? Suggest removing datasetId.

I would agree, with the caveat that we need to be explicit that ReadGroup belongs to a single ReadGroupSet. IIRC this is a controversial topic. Otherwise it's a gray area as we can have a case where a ReadGroup belongs to multiple Datasets via multiple ReadGroupSet parents. And where does the ReadGroup belong if it has 0 parent ReadGroupSets?

ReadGroupSet has union { null, string } datasetId = null;. Suggest making this a required field as it is with VariantSet

+1. This should be uncontroversial.

ReferenceSet does not have datasetId. Suggest removing this one special case and require ReferenceSet to be part of a dataset.

+1 with a caveat; this is a bit more of a semantic change details on my comment here. If we're doing this, we should consider putting a datasetId field on Reference, as it has the same many:many as ReadGroup does.

diekhans · 2016-04-02T04:24:21Z

Thanks @calbach

CH Albach [email protected] writes:

I would agree, with the caveat that we need to be explicit that ReadGroup
belongs to a single ReadGroupSet. IIRC this is a controversial topic. Otherwise
it's a gray area as we can have a case where a ReadGroup belongs to multiple
Datasets via multiple ReadGroupSet parents. And where does the ReadGroup belong
if it has 0 parent ReadGroupSets?

What is the use case for readgroups being in multiple
readgroupsets? I have never seen data organized. Also, it seems
that a readgroup should always have a parent readgroupset

ReferenceSet does not have datasetId. Suggest removing this one special
case and require ReferenceSet to be part of a dataset.
+1 with a caveat; this is a bit more of a semantic change details on my comment
here. If we're doing this, we should consider putting a datasetId field on
Reference, as it has the same many:many as ReadGroup does.

A use case for sharing References in multiple ReferenceSets
is putting out an assembly where some chromosomes are unchanged,
such as GRCh37lite.

richarddurbin · 2016-04-02T08:48:07Z

Certainly the original intention was that a reference could belong to multiple ReferenceSets. Otherwise there will be lots of duplication.

Likewise, I always thought that References and ReferenceSets would stand outside Datasets. Again to avoid duplication because many datasets will use the same reference. Though I confess I can see the case for them to have an "owner" or responsible authority/source. And we perhaps should not rule out someone wanting to use a reference subject to access control, which I think is at the dataset level.

Sent from my iPhone

On 2 Apr 2016, at 5:24 am, Mark Diekhans [email protected] wrote:

Thanks @calbach

CH Albach [email protected] writes:

I would agree, with the caveat that we need to be explicit that ReadGroup
belongs to a single ReadGroupSet. IIRC this is a controversial topic. Otherwise
it's a gray area as we can have a case where a ReadGroup belongs to multiple
Datasets via multiple ReadGroupSet parents. And where does the ReadGroup belong
if it has 0 parent ReadGroupSets?

What is the use case for readgroups being in multiple
readgroupsets? I have never seen data organized. Also, it seems
that a readgroup should always have a parent readgroupset

ReferenceSet does not have datasetId. Suggest removing this one special
case and require ReferenceSet to be part of a dataset.

+1 with a caveat; this is a bit more of a semantic change details on my comment
here. If we're doing this, we should consider putting a datasetId field on
Reference, as it has the same many:many as ReadGroup does.

A use case for sharing References in multiple ReferenceSets
is putting out an assembly where some chromosomes are unchanged,
such as GRCh37lite.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

diekhans · 2016-04-02T17:04:22Z

Thank you @richarddurbin

Cross-dataset should analysis should be as easy as within the
same dataset or we have a design flaw. While Google thinks of
dataset as access and billing control, for open data, dataset seems
more akin to NCBI BioProject.

We have a couple of terabytes of assembly hubs with more than
60,000 organisms. Having a dataset bag to put this in is very
helpful.

One design question I am asking is: should a ReferenceSet be
able to be composted of References from different datasets?

This relates to:
Cancer virus and human decoy search use cases not supported #429
#429

Does one construct a new, combined referenceset for
multi-referenceset mapping or does one have a way of specifying
multiple referencesets?

I am not arguing for one approach over another, only for having
a rigorous specification on the supported use cases.

Richard Durbin [email protected] writes:

Certainly the original intention was that a reference could belong to multiple
ReferenceSets. Otherwise there will be lots of duplication.

Likewise, I always thought that References and ReferenceSets would stand
outside Datasets. Again to avoid duplication because many datasets will use the
same reference. Though I confess I can see the case for them to have an "owner"
or responsible authority/source. And we perhaps should not rule out someone
wanting to use a reference subject to access control, which I think is at the
dataset level.

Sent from my iPhone

On 2 Apr 2016, at 5:24 am, Mark Diekhans [email protected] wrote:

Thanks @calbach

CH Albach [email protected] writes:

I would agree, with the caveat that we need to be explicit that ReadGroup
belongs to a single ReadGroupSet. IIRC this is a controversial topic.
Otherwise
it's a gray area as we can have a case where a ReadGroup belongs to
multiple
Datasets via multiple ReadGroupSet parents. And where does the ReadGroup
belong
if it has 0 parent ReadGroupSets?

What is the use case for readgroups being in multiple
readgroupsets? I have never seen data organized. Also, it seems
that a readgroup should always have a parent readgroupset

ReferenceSet does not have datasetId. Suggest removing this one special
case and require ReferenceSet to be part of a dataset.

+1 with a caveat; this is a bit more of a semantic change details on my
comment
here. If we're doing this, we should consider putting a datasetId field on
Reference, as it has the same many:many as ReadGroup does.

A use case for sharing References in multiple ReferenceSets
is putting out an assembly where some chromosomes are unchanged,
such as GRCh37lite.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub*

calbach · 2016-04-05T03:56:12Z

What is the use case for readgroups being in multiple
readgroupsets? I have never seen data organized. Also, it seems
that a readgroup should always have a parent readgroupset

I don't know of any, would have to dig through old GitHub issues/email. I don't support having many:many here.

A use case for sharing References in multiple ReferenceSets
is putting out an assembly where some chromosomes are unchanged,
such as GRCh37lite.

I wasn't arguing against many:many here, just adding that it should likely have a datasetId field along with this change, as ownership will otherwise be ambiguous, e.g. Reference belonging to 0 ReferenceSets or a Reference belonging to 2 ReferenceSets with different datasetIds.

Does one construct a new, combined referenceset for
multi-referenceset mapping or does one have a way of specifying
multiple referencesets?

Not sure what the correct answer is, but given today's API the answer is clearly to create a new ReferenceSet, as there is no way of specifying a mapping to multiple.

dglazer · 2016-04-05T11:56:14Z

For history, #32 has the very long
original discussion of readgroups and readgroupsets. I too would be happy
with requiring each RG to have exactly one RGSet.

On Mon, Apr 4, 2016 at 8:56 PM, CH Albach [email protected] wrote:

What is the use case for readgroups being in multiple
readgroupsets? I have never seen data organized. Also, it seems
that a readgroup should always have a parent readgroupset

I don't know of any, would have to dig through old GitHub issues/email. I
don't support having many:many here.

A use case for sharing References in multiple ReferenceSets
is putting out an assembly where some chromosomes are unchanged,
such as GRCh37lite.

I wasn't arguing against many:many here, just adding that it should likely
have a datasetId field along with this change, as ownership will
otherwise be ambiguous, e.g. Reference belonging to 0 ReferenceSets or a
Reference belonging to 2 ReferenceSets with different datasetIds.

Does one construct a new, combined referenceset for
multi-referenceset mapping or does one have a way of specifying
multiple referencesets?

Not sure what the correct answer is, but given today's API the answer is
clearly to create a new ReferenceSet, as there is no way of specifying a
mapping to multiple.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#595 (comment)

diekhans · 2016-04-05T20:58:09Z

thanks @dglazer; that piece of history is enlighten. Some of the discussion seems to confound the concepts of grouping for data production vs data analysis. A fairly ridged model is good for data production, which is my understanding of what readgroupset (all readgroups for an sequence experiment on a sample) and readgroup are.

For data analysis, one wants to be able to group things is fairly arbitrary ways. This is better done with different types of linking (maybe as simple as lists of readgroupsets).

diekhans · 2016-04-05T21:28:43Z

Thanks @dglazer; that piece of history is enlighten. Some of the discussion seems to confound the concepts of grouping for data production vs data analysis. A fairly ridged model is good for data production, which is my understanding of what readgroupset (all readgroups for an sequence experiment on a sample) and readgroup are.

For data analysis, one wants to be able to group things is fairly arbitrary ways. This is better done with different types of linking (maybe as simple as lists of readgroupsets).

David Glazer [email protected] writes:

For history, #32 has the very long
original discussion of readgroups and readgroupsets. I too would be happy
with requiring each RG to have exactly one RGSet.

diekhans added the API Consistency label Apr 1, 2016

david4096 mentioned this issue Jun 20, 2016

BioMetadata Protocol #636

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset used inconsistently across API #595

dataset used inconsistently across API #595

diekhans commented Apr 1, 2016

calbach commented Apr 2, 2016

diekhans commented Apr 2, 2016

richarddurbin commented Apr 2, 2016

diekhans commented Apr 2, 2016

calbach commented Apr 5, 2016

dglazer commented Apr 5, 2016

diekhans commented Apr 5, 2016

diekhans commented Apr 5, 2016

dataset used inconsistently across API #595

dataset used inconsistently across API #595

Comments

diekhans commented Apr 1, 2016

calbach commented Apr 2, 2016

diekhans commented Apr 2, 2016

richarddurbin commented Apr 2, 2016

diekhans commented Apr 2, 2016

calbach commented Apr 5, 2016

dglazer commented Apr 5, 2016

diekhans commented Apr 5, 2016

diekhans commented Apr 5, 2016