opt-in integration with `taxize` to generate species coverage over all taxonomic ranks #198

cboettig · 2017-02-28T20:38:45Z

Metadata-based search queries in platforms like MetaCat are most useful if all EML files include full taxonomic rank classification (kingdom, phylum etc), even though the full classification isn't required for taxonomic coverage. This could be automatically computed from the species name (or other rank) given using queries from the R package taxize.

The text was updated successfully, but these errors were encountered:

amoeba · 2017-06-07T18:45:36Z

Just had a conversation with some data managers and I found you already had an issue for this.

IMO best practice for taxonomic coverage is to serialize only the most specific taxonomic rank (e.g., species) you can and avoid "materializing" up the taxonomic ranks (e.g., family and higher). This is because taxonomic position may change over time. Is that a niche opinion?

In Metacat, we attach a Solr index which is the ideal place to materialize the "full" taxonomy for better search. I'm not sure if we do that but it's a more suitable place.

However, the EML documentation itself recommends materializing the full tree:

Information about the range of taxa addressed in the data set or collection. It is recommended that one provide information starting from the taxonomic rank of kingdom, to a level which reflects the data set or collection being documented. The levels of Kingdom, Division/Phylum, Class, Order, Family, Genus, and Species should be included as ranks as appropriate. Because the taxonomic ranks are hierarchical, the Taxonomic Classification field is self-referencing to allow for an arbitrary depth of rank, down to species.

All that said, I can imagine people will want this so I think this is 👍 !

taxize could be used for error-checking and materializing.

From an API design perspective, does adding an expand logical arg to control the behavior seem like a good place to add this?

cboettig · 2017-06-07T19:29:25Z

cc'ing @sckott since he has more experience in some of the taxonomic issues.

Right, my understanding is that using a taxonomic identifier would be ideal, since it references the taxonomic name to a particular reference authority (at a particular point in time(?)), reflecting the fact that authorities differ and taxonomic names (even including the species level) are subject to continual, um, evolution. e.g. in an ideal system, we should be able to handle the case in which at the time of my study, the organism in question is species/group A, which is at a later date broken into new groups A and B, with B potentially being assigned into different parent/higher-level group as well. A proper system would use different identifiers for species A before and after the split, and and a different identifier for B, allowing researchers to at least identify that a record created before the split could be referring to an organism that was either a member of species/group new A or of species B.

Like you say, the choice here also depends on what assumptions we can make about the consumer of the metadata, e.g. if MetaCat UI is expanding species names to the full taxonomy, then that's preferable (e.g. automated, consistent, and easier to update) than doing it at the user / EML level. However, it doesn't seem the case to me -- I don't think I can get all the data about "Birds" by setting Class: Aves in KNB/DataONE web UI, so I've tended to assume that's because most of the EML files didn't provide any higher-level grouping. (Maybe I'm wrong about that or maybe the Solr-based queries get around this?)
Anyway, I think from a data-discovery standpoint, being able to search for data matching higher-level taxonomy is an obvious utility, despite the loose and changing nature of rank-based classification.

amoeba · 2017-06-07T19:36:39Z

(Maybe I'm wrong about that or maybe the Solr-based queries get around this?)

I think you're right -- we aren't doing this. I've just PR'd a quick work-up of auto expansion. I've done only a minimal amount of testing. #216

sckott · 2017-06-07T19:53:29Z

my understanding is that using a taxonomic identifier would be ideal, since it references the taxonomic name to a particular reference authority (at a particular point in time(?)), reflecting the fact that authorities differ and taxonomic names (even including the species level)

Right, i'd think including identifiers (e.g., 123456) and what database they are from (e.g., COL or ITIS) are pretty impt. pieces of metadata.

wrt whether to include complete hierarchy or not - yes, agree that higher taxonomy can change -

@amoeba what do you mean by "error checking" in

taxize could be used for error-checking and materializing.

amoeba · 2017-06-07T19:57:58Z

Just typos. It looks like taxize or ITIS can already help the user find what they meant if they've made
a typo. I could see us having a check_taxonomy function which might query ITIS with what the user has entered and, if there was a typo, offering suggestions (as taxize does) or fix it.

sckott · 2017-06-07T20:04:34Z

Right, - note though that each data source has a different backend setup, so results can vary :(

cboettig · 2017-06-07T20:10:07Z

@amoeba So technically one could also capture the full provenance of a query used to find a "closest match" (e.g. check for a typo in a species name), which could in principle be useful in diagnosing when and why a particular taxonomic identifier was chosen, just for fun see the discussion we had about this on the NeXML side: ropensci/RNeXML#24 (comment).

In practice, some simpler check to make sure the species names didn't have typos (e.g. either matches the authority-provided name directly or perhaps closestMatch) would no doubt be helpful for users in catching common typos in writing out latin names...

cboettig · 2018-11-27T20:17:56Z

this is now an opt-in for set_taxonomicCoverage

cboettig modified the milestone: v1.3: Expanded support for serialization use cases Feb 28, 2017

cboettig closed this as completed Nov 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt-in integration with `taxize` to generate species coverage over all taxonomic ranks #198

opt-in integration with `taxize` to generate species coverage over all taxonomic ranks #198

cboettig commented Feb 28, 2017

amoeba commented Jun 7, 2017

cboettig commented Jun 7, 2017

amoeba commented Jun 7, 2017 •

edited

Loading

sckott commented Jun 7, 2017

amoeba commented Jun 7, 2017

sckott commented Jun 7, 2017

cboettig commented Jun 7, 2017

cboettig commented Nov 27, 2018

opt-in integration with taxize to generate species coverage over all taxonomic ranks #198

opt-in integration with taxize to generate species coverage over all taxonomic ranks #198

Comments

cboettig commented Feb 28, 2017

amoeba commented Jun 7, 2017

cboettig commented Jun 7, 2017

amoeba commented Jun 7, 2017 • edited Loading

sckott commented Jun 7, 2017

amoeba commented Jun 7, 2017

sckott commented Jun 7, 2017

cboettig commented Jun 7, 2017

cboettig commented Nov 27, 2018

opt-in integration with `taxize` to generate species coverage over all taxonomic ranks #198

opt-in integration with `taxize` to generate species coverage over all taxonomic ranks #198

amoeba commented Jun 7, 2017 •

edited

Loading