Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Reads Info tag is poor representation of arrays and ints #445

Closed
mp15 opened this issue Oct 22, 2015 · 6 comments · Fixed by #700
Closed

Reads Info tag is poor representation of arrays and ints #445

mp15 opened this issue Oct 22, 2015 · 6 comments · Fixed by #700

Comments

@mp15
Copy link

mp15 commented Oct 22, 2015

The info tag is string only at the moment, unfortunately this makes for a rather poor representation of arrays and integers. Given that this tag appears to be intended to represent BAM/SAM/CRAM info tags it might be wise to make this a union or similar?

https://github.com/ga4gh/schemas/blob/d2b3380992d920150c6d40f884c48cda0d50f321/src/main/resources/avro/reads.avdl#L280

@mbaudis
Copy link
Member

mbaudis commented Oct 22, 2015

Not really a string.
In my reading map<array<string>> info = {}; represents a map where the values are arrays with string array elements.

I'm not sure about empty or not as option; IMHO this should read map<array<union{ null, string }>> info to allow empty arrays; but this may be covered through the = {} declaration of a default empty map? Who could elaborate, please? This declaration is all around, and I would like to be sure of its validity for empty constructs (since info should be optional).

@calbach
Copy link
Contributor

calbach commented Oct 22, 2015

The current definition is intended to be a map of string -> []string (I guess the string key type is implied in Avro?), e.g. the following JSON:

info: {
  "foo": ["1", "2"],
  "bar": []
}

The string-only approach is definitely a bit inflexible. Possibly this could be improved along with or after the proposed move to proto.

@mp15
Copy link
Author

mp15 commented Oct 22, 2015

My problem with jamming it into a string is that the SAM equivalent carries type info with it. Without that info you can go SAM -> GA4GH but you can't then go back the other way without knowing what type the tag is supposed to be. For unregistered or lowercase tags this may not be possible.

If we make this a union of int, float, string and array I can look at the JSON and see "" I can imply it's a string, 6 suggests an integer, 6.0 a float and [] an array. Not sure how that would translate into Proto though.

@mbaudis
Copy link
Member

mbaudis commented Oct 23, 2015

@calbach Thanks for the empty array confirmation. And yes, map keys are strings per Avro spec.

@mp15
Copy link
Author

mp15 commented Nov 2, 2015

So would anyone accept the following:

  map< union{array< union{string, int, float}>, string, int, float, null }> info = {};

This doesn't really cover all the cases in http://samtools.github.io/hts-specs/SAMv1.pdf section 1.5 but it covers a lot more than we currently do.

@diekhans
Copy link
Contributor

diekhans commented Nov 2, 2015

This suggestions is in a good direction, I think.

Some kind of rich representation of attributes is important. We
should avoid requiring strings to be parsed within an attribute
value. This is a source of ambiguity and fragility in two dimensional
file formats. Having values as arrays is very useful avoid parsing strings.

Sequence annotations https://github.com/ga4gh/schemas/blob/rna/src/main/resources/avro/sequenceAnnotations.avdl
has:

record Attributes {
map<array<union {string, ExternalIdentifier, OntologyTerm}>> vals = {};
}

I am unsure how useful it is to have scalar values as well as
arrays, since a scalar can be a length one array.

It would be good to converge on single attribute structure for
all of ga4gh.

Martin Pollard [email protected] writes:

So would anyone accept the following:

map< union{array< union{string, int, float}>, string, int, float, null }> info = {};

This doesn't really cover all the cases in http://samtools.github.io/hts-specs/
SAMv1.pdf section 1.5 but it covers a lot more than we currently do.


Reply to this email directly or view it on GitHub.*

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants