Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Change referenceName to referenceId #616

Closed
wants to merge 1 commit into from
Closed

Conversation

david4096
Copy link
Member

@david4096 david4096 commented Apr 27, 2016

When performing searches on the Variants or Features endpoints the referenceName field allows the client to define which contig to search on. This is problematic because reference names are not required to be unique, and it is not immediately clear how to form the "referenceName" field. Is it "1" or "chr1"? This is further complicated by the fact that references can belong to multiple Reference Sets.

By using the referenceId in its place we ensure that each request is against a specific reference and we get better guarantees about the relationships in the data. The Reads API properly makes this distinction:

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/readmethods.avdl#L26

@diekhans
Copy link
Contributor

+1, this is a big inconsistency in the API.

This is related to: #593

David Steinberg [email protected] writes:

When performing searches on the variants or feature's endpoints the
referenceName field allows the client to define which contig to search on. This
is problematic because reference names are not required to be unique, and it is
not immediately clear how to form the "referenceName" field. Is it "1" or
"chr1"? This is further complicated by the fact that references can belong to
multiple Reference Sets.

By using the referenceId in its place we ensure that each request is against a
specific reference and we get better guarantees about the relationships in the
data. The Reads API properly makes this distinction:

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/
readmethods.avdl#L26

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

You can view, comment on, or merge this pull request online at:

#616

Commit Summary

• Change referenceName to referenceId

File Changes

• M src/main/resources/avro/alleleAnnotationmethods.avdl (15)
• M src/main/resources/avro/sequenceAnnotationmethods.avdl (2)
• M src/main/resources/avro/sequenceAnnotations.avdl (3)
• M src/main/resources/avro/variantmethods.avdl (2)
• M src/main/resources/avro/variants.avdl (2)

Patch Links:

https://github.com/ga4gh/schemas/pull/616.patch
https://github.com/ga4gh/schemas/pull/616.diff


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub*

@richarddurbin
Copy link
Contributor

Within a referenceSet the referenceNames need to be unique, of course - this was explicit in an early version of the schema.

The same reference may be in different referenceSets, and may (unfortunately) have different names in those different sets. This is just the truth - it is called
chr1 by some people and 1 by others. The human genetics user community will never convert to using accession numbers or other lengthy identifiers in place
of chromosome names.

If you want to use ids then you must be able to query by name to find the id, so that when I look for 1:214217-412291 I can find the referenceId
for the reference for chromosome 1. Using the name as the identifier within Position and other structures that are unambiguously within a referenceSet
avoids this additional lookup, and to my mind makes things more transparent and clearer, but if you want it the other way then OK.

I am concerned that the core design, over which there was a lot of careful thought by people with a lot of experience of representing genetic data, is being
or has been mangled by layers of these sorts of decisions that are being made piecemeal without joint oversight and buy-in.

Richard

On 28 Apr 2016, at 00:50, Mark Diekhans [email protected] wrote:

+1, this is a big inconsistency in the API.

This is related to: #593 #593

David Steinberg <[email protected] mailto:[email protected]> writes:

When performing searches on the variants or feature's endpoints the
referenceName field allows the client to define which contig to search on. This
is problematic because reference names are not required to be unique, and it is
not immediately clear how to form the "referenceName" field. Is it "1" or
"chr1"? This is further complicated by the fact that references can belong to
multiple Reference Sets.

By using the referenceId in its place we ensure that each request is against a
specific reference and we get better guarantees about the relationships in the
data. The Reads API properly makes this distinction:

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/ https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/
readmethods.avdl#L26

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

You can view, comment on, or merge this pull request online at:

#616

Commit Summary

• Change referenceName to referenceId

File Changes

• M src/main/resources/avro/alleleAnnotationmethods.avdl (15)
• M src/main/resources/avro/sequenceAnnotationmethods.avdl (2)
• M src/main/resources/avro/sequenceAnnotations.avdl (3)
• M src/main/resources/avro/variantmethods.avdl (2)
• M src/main/resources/avro/variants.avdl (2)

Patch Links:

https://github.com/ga4gh/schemas/pull/616.patch
https://github.com/ga4gh/schemas/pull/616.diff


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub*


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub #616 (comment)

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@diekhans
Copy link
Contributor

changing that to a -1, sorry when I first looked at it; I though I added reference id; now I see it replaces reference name with id in searches. This does not solve the problem of multiple names, because you
still have to look up the id.

We need to address the inconsistencies with the API's use of names vs ids as a whole.

@diekhans
Copy link
Contributor

There are two issue here, somewhat related:

  • The multiple commonly used names for chromosomes. We can't
    fix chr1 vs 1 and need to support a list of names. Using id
    doesn't address this issue.
  • The design of the API appears to be the all object linkage
    is done by id, not name. So when one object references
    another, it is by id. When one finds an object initially by
    name, then goes from object to object via id. Id is really
    an server-local pointer. I found this very weird, but I
    maybe WEB APIs are done this way. I am guessing at this
    design paradigm from talking to people, it's not documented,
    so I could be wrong.

The API is inconsistent, sometimes objects are linked by name,
sometimes by id. The API does not require chromosome name to be
unique. This is nuts, but that is what the GA4GH API explicitly
states.

We need to do a major architectural review of the API. It is
inconsistent and the documentation is still inadequate.

Richard Durbin [email protected] writes:

Within a referenceSet the referenceNames need to be unique, of course - this
was explicit in an early version of the schema.

The same reference may be in different referenceSets, and may (unfortunately)
have different names in those different sets. This is just the truth - it is
called
chr1 by some people and 1 by others. The human genetics user community will
never convert to using accession numbers or other lengthy identifiers in place
of chromosome names.

If you want to use ids then you must be able to query by name to find the id,
so that when I look for 1:214217-412291 I can find the referenceId
for the reference for chromosome 1. Using the name as the identifier within
Position and other structures that are unambiguously within a referenceSet
avoids this additional lookup, and to my mind makes things more transparent and
clearer, but if you want it the other way then OK.

I am concerned that the core design, over which there was a lot of careful
thought by people with a lot of experience of representing genetic data, is
being
or has been mangled by layers of these sorts of decisions that are being made
piecemeal without joint oversight and buy-in.

Richard

On 28 Apr 2016, at 00:50, Mark Diekhans [email protected] wrote:

+1, this is a big inconsistency in the API.

This is related to: #593 <https://
github.com//issues/593>

David Steinberg <[email protected] mailto:[email protected]>
writes:

When performing searches on the variants or feature's endpoints the
referenceName field allows the client to define which contig to search on.
This
is problematic because reference names are not required to be unique, and
it is
not immediately clear how to form the "referenceName" field. Is it "1" or
"chr1"? This is further complicated by the fact that references can belong
to
multiple Reference Sets.

By using the referenceId in its place we ensure that each request is
against a
specific reference and we get better guarantees about the relationships in
the
data. The Reads API properly makes this distinction:

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/
https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/
readmethods.avdl#L26

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

You can view, comment on, or merge this pull request online at:

#616

Commit Summary

• Change referenceName to referenceId

File Changes

• M src/main/resources/avro/alleleAnnotationmethods.avdl (15)
• M src/main/resources/avro/sequenceAnnotationmethods.avdl (2)
• M src/main/resources/avro/sequenceAnnotations.avdl (3)
• M src/main/resources/avro/variantmethods.avdl (2)
• M src/main/resources/avro/variants.avdl (2)

Patch Links:

https://github.com/ga4gh/schemas/pull/616.patch
https://github.com/ga4gh/schemas/pull/616.diff


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub*


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub <https://github.com/ga4gh/
schemas/pull/616#issuecomment-215266594>

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub*

@mbaudis
Copy link
Member

mbaudis commented May 2, 2016

@diekhans So maybe someone could provide a list of occurrences of both, linking by "id" and linking by "name"? But anyway, should be "id" (and is documented like that in the metadata docs ...).

(I personally find the "search by name/alias, link by id" not so weird ...)

@richarddurbin
Copy link
Contributor

From my perspective “id" should be treated like an opaque handle, specific to the particular server you are talking to,
to facilitate repeated interaction, so like a pointer.

Whereas “name” is something with external meaning, so can be used as an entry point for search, for display back to
users, and for export of results, for example into BAM or VCF but also potentially linking across different GA4GH servers,
which would have their own different ‘id's for the same thing. I note that MatchMaker Exchange can only work if people
use externally meaningful names rather than server-specific ids.

There is a case that since the ids are really private to the server, one should minimise their use in the interface.
After all they don’t mean anything to the client. Of course this is only possible when the names are guaranteed to
behave as unique identifiers, but that is required to be the case for References (though whether we can really
enforce that in the current model is not clear - we could in the original model by construction).

Once we get into this topic of names being good identifiers we get concerned with scope. Think about variable names in
software. When we write and debug a program, we want to use and see meaningful variable names, not memory
locations as in assembly code, and the same name may end up as a different location in different compiled and executed
systems (compare to different ids in different GA4GH servers). We control name clashes with scope. We haven’t thought
much about scope in GA4GH, perhaps because we haven’t thought much about the write interface to the server, nor about
compatibility (consistency?) across severs, though no doubt there will be many servers.

Richard

On 2 May 2016, at 17:51, Michael Baudis [email protected] wrote:

@diekhans https://github.com/diekhans So maybe someone could provide a list of occurrences of both, linking by "id" and linking by "name"? But anyway, should be "id" (and is documented like that in the metadata docs https://github.com/ga4gh/schemas/blob/biosample/doc/source/api/metadata.rst#common-attribute-names-and-formats ...).

(I personally find the "search by name/alias, link by id" not so weird ...)


You are receiving this because you commented.
Reply to this email directly or view it on GitHub #616 (comment)

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@mbaudis
Copy link
Member

mbaudis commented May 2, 2016

@richarddurbin Yes, agreement here. Implementation-wise, I am all for a general solution where:

  • id is whatever format is used locally, guaranteeing uniqueness, stable links/pointers
  • name is locally and "context based" unique (NCBI36.1; GSM198179) - but this is through documentation only, with enforcement being left to the implementer
  • aliases (¿aternativeNames? ...) provides alternative names that have been used for the same data representations (hg18 ...)

In principle, for the genome editions one could enforce/request a specific value space, to which queries etc. could be converted. But this would have to be maintained through GA4GH, and wouldn't be feasible with regard to non-human genome space. And it wouldn't work as a general data structure, for different types of records.

So, IMO id + name + aliases. And id for referencing, name and aliases for search/matching.

@lh3
Copy link
Member

lh3 commented May 2, 2016

In my view

  • id is like a foreign key in SQL, which can be auto-incremental and is not long-term stable.
  • We talked about symbol before. It still deserves its place. A symbol usually does not contain spaces or special characters. We may require it to be unique within a scope. For example, HGNC defines gene symbols; compilers may have symbol table – in C, variable and function names are all symbols, but keywords (e.g. for/break/continue) are not.
  • A name is just a string, which is often full-text indexed for search. HGNC takes this way as well.

There is a case that since the ids are really private to the server, one should minimise their use in the interface.

In the SQL world, ids are often joined away. Users may completely ignore id values. GA4GH does not have a concept of "join". We have to expose ids in almost every API. This complicates the meaning of id.

We haven’t thought much about scope in GA4GH

There was a discussion before. I believe some symbols should be unique within a certain scope, but others think this adds unnecessary constraints.

@diekhans
Copy link
Contributor

diekhans commented May 3, 2016

The analogy to foreign key is perfect and it's intention is to
be used in that way. One can jump from object to object very
easily with id, but it's two steps if you come in with a `name'.
Very analogous to join.

IMHO it would have been clearer to have symbol' rather than name' and then `description'. The id vs name has proved ver

I believe this PR should be to add referenceId to referenceName
rather than replace it.

Heng Li [email protected] writes:

In my view

• id is like a foreign key in SQL, which can be auto-incremental and is not
long-term stable.
• We talked about symbol before. It still deserves its place. A symbol
usually does not contain spaces or special characters. We may require it to
be unique within a scope. For example, HGNC defines gene symbols; compilers
may have symbol table – in C, variable and function names are all symbols,
but keywords (e.g. for/break/continue) are not.
• A name is just a string, which is often full-text indexed for search. HGNC
takes this way as well.

There is a case that since the ids are really private to the server, one
should minimise their use in the interface.

In the SQL world, ids are often joined away. Users may completely ignore id
values. GA4GH does not have a concept of "join". We have to expose ids in
almost every API. This complicates the meaning of id.

We haven’t thought much about scope in GA4GH

There was a discussion before. I believe some symbols should be unique within a
certain scope, but others think this adds unnecessary constraints.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub*

@david4096 david4096 force-pushed the refId branch 2 times, most recently from 53e9580 to 1e17aa8 Compare May 19, 2016 00:09
Add back in reference name to methods
Add referenceName to reads
@david4096
Copy link
Member Author

I've updated this PR to include both referenceName and referenceId in the search methods. I also added referenceName to SearchReadsRequest, as it is a lone case of accepting referenceId, and not referenceName.

@david4096
Copy link
Member Author

After reviewing this, I believe the best pattern is to remove the need for server specific IDs in favor of using reference name everywhere.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants