-
Notifications
You must be signed in to change notification settings - Fork 112
Conversation
+1, this is a big inconsistency in the API. This is related to: #593 David Steinberg [email protected] writes:
|
Within a referenceSet the referenceNames need to be unique, of course - this was explicit in an early version of the schema. The same reference may be in different referenceSets, and may (unfortunately) have different names in those different sets. This is just the truth - it is called If you want to use ids then you must be able to query by name to find the id, so that when I look for 1:214217-412291 I can find the referenceId I am concerned that the core design, over which there was a lot of careful thought by people with a lot of experience of representing genetic data, is being Richard
The Wellcome Trust Sanger Institute is operated by Genome Research |
changing that to a -1, sorry when I first looked at it; I though I added reference id; now I see it replaces reference name with id in searches. This does not solve the problem of multiple names, because you We need to address the inconsistencies with the API's use of names vs ids as a whole. |
There are two issue here, somewhat related:
The API is inconsistent, sometimes objects are linked by name, We need to do a major architectural review of the API. It is Richard Durbin [email protected] writes:
|
@diekhans So maybe someone could provide a list of occurrences of both, linking by "id" and linking by "name"? But anyway, should be "id" (and is documented like that in the metadata docs ...). (I personally find the "search by name/alias, link by id" not so weird ...) |
From my perspective “id" should be treated like an opaque handle, specific to the particular server you are talking to, Whereas “name” is something with external meaning, so can be used as an entry point for search, for display back to There is a case that since the ids are really private to the server, one should minimise their use in the interface. Once we get into this topic of names being good identifiers we get concerned with scope. Think about variable names in Richard
The Wellcome Trust Sanger Institute is operated by Genome Research |
@richarddurbin Yes, agreement here. Implementation-wise, I am all for a general solution where:
In principle, for the genome editions one could enforce/request a specific value space, to which queries etc. could be converted. But this would have to be maintained through GA4GH, and wouldn't be feasible with regard to non-human genome space. And it wouldn't work as a general data structure, for different types of records. So, IMO id + name + aliases. And |
In my view
In the SQL world, ids are often joined away. Users may completely ignore id values. GA4GH does not have a concept of "join". We have to expose ids in almost every API. This complicates the meaning of id.
There was a discussion before. I believe some symbols should be unique within a certain scope, but others think this adds unnecessary constraints. |
The analogy to foreign key is perfect and it's intention is to IMHO it would have been clearer to have I believe this PR should be to add referenceId to referenceName Heng Li [email protected] writes:
|
53e9580
to
1e17aa8
Compare
Add back in reference name to methods Add referenceName to reads
I've updated this PR to include both referenceName and referenceId in the search methods. I also added |
After reviewing this, I believe the best pattern is to remove the need for server specific IDs in favor of using reference name everywhere. |
When performing searches on the Variants or Features endpoints the
referenceName
field allows the client to define which contig to search on. This is problematic because reference names are not required to be unique, and it is not immediately clear how to form the "referenceName" field. Is it "1" or "chr1"? This is further complicated by the fact that references can belong to multiple Reference Sets.By using the
referenceId
in its place we ensure that each request is against a specific reference and we get better guarantees about the relationships in the data. The Reads API properly makes this distinction:https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/readmethods.avdl#L26