Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate MIAPA checklist-compliant nexml #46

Open
25 tasks
cboettig opened this issue Nov 30, 2013 · 1 comment
Open
25 tasks

Generate MIAPA checklist-compliant nexml #46

cboettig opened this issue Nov 30, 2013 · 1 comment

Comments

@cboettig
Copy link
Member

cboettig commented Nov 30, 2013

RNeXML should optionally be able to include all the basic metadata listed on the MIAPA checklist, hopefully guiding users that are unfamiliar with the process and being able to provide reasonable automated suggestions when possible (e.g. suggesting external identifiers based on OTU labels, #24) A function might be provided that could check (and perhaps summarize/return) miapa compliance(?).

I've reproduced the checklist below with notes added on how we're doing in RNeXML.

For each item, I've either made a note on if/how we handle it in NeXML, or a question when I'm unsure how to handle it. For instance, I can sometimes find a corresponding block in the example files in the miapa repo, but they are in OWL and the translation to NeXML's meta/RDFa isn't clear to me. An example nexml file that satisfies all these requirements would be super helpful to me.

Topology

  • The topology itself, possibly as an identifier of a database (such as a !TreeBASE) record. included in the nexml tree node

  • Is this a gene tree or species tree? Do we use the treebase namespace to define this, or is there a better alternative?

    <meta content="Species Tree" datatype="xsd:string" id="meta24059" property="tb:kind.tree" xsi:type="nex:LiteralMeta"/>
    <meta content="21" datatype="xsd:integer" id="meta24062" property="tb:ntax.tree" xsi:type="nex:LiteralMeta"/>
    <meta content="Unrated" datatype="xsd:string" id="meta24061" property="tb:quality.tree" xsi:type="nex:LiteralMeta"/>
  • It is a tree or a network? nexml defines this by using <tree> or <network>

  • Is topology rooted or not? In nexml, defined by an attribute root="true" on a member nod_. Should we consider declaring this in metadata too?

  • The type of consensus if this a consensus topology (that summarizes the topology inference in some way, rather than being directly provided by the inference method)

    Do we use the treebase namespace for this as well? e.g.

    <meta content="Consensus" datatype="xsd:string" id="meta24060" property="tb:type.tree" xsi:type="nex:LiteralMeta"/>
  • The topology should be "well described", as applicable to the inference method being used. For example, a likelihood for maximum likelihood analysis. For Bayesian analyses this should also include the burn-in period excluded, and the convergence of the chain(s). This may also include more then one topology, for example a sample from the posterior probability distribution for Bayesian, or equally scoring topologies for a maximum parsimony analysis. Examples?

OTUs:

All terminal nodes should be appropriately labelled and referenced in one of the following ways. Internal nodes need not be.

  • A meaningful external identifier (a combination of database or resource and identifier/accession within that database).
    We generate with taxize, add TSNs from species names using taxize #24
  • For specimens, museum, collection (if applicable), and specimen identifier. Alternatively, if a specimen is not in a museum collection, use the laboratory, laboratory collection, and accession within that collection.
  • Precise (GPS) georeferences for specimens are highly desirable (but not always available).
  • Branch lengths: Some measure of branch length required unless it is not applicable to the analysis method.. Further semantics of the measure should be implied by the tree inference method. length attribute in nexml is sufficient
  • Branch support: Some value of branch support should be provided, for example posterior probability, or bootstrap value, unless it is not applicable to the analysis method. meta annotation of edge node. example?

Character matrix:

I note that this description is entirely in reference to the character matrix being data from which the tree was derived. It appears that the MIAPA standard doesn't refer to comparative trait data. Further, it many not always be desirable to include a copy of the character matrix in the data file, where that alignment can be found in a separate file might suffice?

  • aligned data matrix that is the basis for the tree (by having been the input for the tree inference method)

MIAPA shows an example how how to state that the tree wasDerivedFrom the alignment, not sure whe corresponding rdfa in the nexml would look like

 <owl:NamedIndividual rdf:about="&Peters2011hymenoptera;tree0000001">
        <rdf:type rdf:resource="&obo;CDAO_0000012"/>
        <rdf:type rdf:resource="&obo;CDAO_0000073"/>
        <prov:wasGeneratedBy rdf:resource="&annot;InferenceOfPetersTree"/>
        <prov:wasDerivedFrom rdf:resource="&annot;PetersAlignment"/>
    </owl:NamedIndividual>
  • Data type must be provided, for example DNA, RNA, protein, morphology, etc.
    For molecular matrices, the accession numbers (and respective database(s) if different from Genbank) of the sequences used for each row must be provided.
  • a mapping that relates each row identifier to a tip of the topology otu attribute present on row
  • a mapping that relates each accession number or specimen identifier to a row label inverse of the above map

Alignment method

  • name of software used, version of program

MIAPA defines that the alignment wasGeneratedBy some software.

    <owl:NamedIndividual rdf:about="&annot;PetersMUSCLEAlignmentActivity">
        <rdf:type rdf:resource="&edamontology;operation_2928"/>
        <rdf:type rdf:resource="&obo;MIAPA_0000003"/>
        <prov:wasAssociatedWith rdf:resource="&annot;Muscle"/>
        <prov:used rdf:resource="&obo;MIAPA_0000013"/>
    </owl:NamedIndividual>
  • parameters used (or default if default values were used).
  • whether alignment was manually corrected or edited

Character trait data

This is not part of the draft MIAPA standard, but merely my own suggestions/brainstorm list, based on the required metadata for EML description of character traits

  • character trait name (Or trait label/definition pair)
  • possible states a discrete trait can have
  • units (for continuous traits)
  • methodological description of how the trait was measured

Tree inference method

  • name of software used, version of program
    <owl:NamedIndividual rdf:about="&annot;RaXML_7.2.8">
        <rdf:type rdf:resource="&obo;MIAPA_0000016"/>
        <rdfs:label>RAxML_7.2.8</rdfs:label>
        <swo2:SWO_0000740 rdf:resource="&annot;UseMaximumLikelihood"/>
        <swo:SWO_0004000 rdf:resource="&obo;MIAPA_0000017"/>
    </owl:NamedIndividual>
  • parameters used, including model of evolution, and optimality criterion
 <owl:NamedIndividual rdf:about="&annot;UseMaximumLikelihood">
        <rdf:type rdf:resource="&obo;MIAPA_0000015"/>
        <rdfs:label>Maximum Likelihood algorithm</rdfs:label>
        <dc:description>The inference algorithm uses maximum likelihood as an optimality criterion. </dc:description>
    </owl:NamedIndividual>
  • character weights if (normally then morphological) characters were weighted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant