Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt-in integration with taxize to generate species coverage over all taxonomic ranks #198

Closed
cboettig opened this issue Feb 28, 2017 · 8 comments

Comments

@cboettig
Copy link
Member

Metadata-based search queries in platforms like MetaCat are most useful if all EML files include full taxonomic rank classification (kingdom, phylum etc), even though the full classification isn't required for taxonomic coverage. This could be automatically computed from the species name (or other rank) given using queries from the R package taxize.

@amoeba
Copy link
Collaborator

amoeba commented Jun 7, 2017

Just had a conversation with some data managers and I found you already had an issue for this.

IMO best practice for taxonomic coverage is to serialize only the most specific taxonomic rank (e.g., species) you can and avoid "materializing" up the taxonomic ranks (e.g., family and higher). This is because taxonomic position may change over time. Is that a niche opinion?

In Metacat, we attach a Solr index which is the ideal place to materialize the "full" taxonomy for better search. I'm not sure if we do that but it's a more suitable place.

However, the EML documentation itself recommends materializing the full tree:

Information about the range of taxa addressed in the data set or collection. It is recommended that one provide information starting from the taxonomic rank of kingdom, to a level which reflects the data set or collection being documented. The levels of Kingdom, Division/Phylum, Class, Order, Family, Genus, and Species should be included as ranks as appropriate. Because the taxonomic ranks are hierarchical, the Taxonomic Classification field is self-referencing to allow for an arbitrary depth of rank, down to species.

All that said, I can imagine people will want this so I think this is 👍 !

taxize could be used for error-checking and materializing.

From an API design perspective, does adding an expand logical arg to control the behavior seem like a good place to add this?

@cboettig
Copy link
Member Author

cboettig commented Jun 7, 2017

cc'ing @sckott since he has more experience in some of the taxonomic issues.

Right, my understanding is that using a taxonomic identifier would be ideal, since it references the taxonomic name to a particular reference authority (at a particular point in time(?)), reflecting the fact that authorities differ and taxonomic names (even including the species level) are subject to continual, um, evolution. e.g. in an ideal system, we should be able to handle the case in which at the time of my study, the organism in question is species/group A, which is at a later date broken into new groups A and B, with B potentially being assigned into different parent/higher-level group as well. A proper system would use different identifiers for species A before and after the split, and and a different identifier for B, allowing researchers to at least identify that a record created before the split could be referring to an organism that was either a member of species/group new A or of species B.

Like you say, the choice here also depends on what assumptions we can make about the consumer of the metadata, e.g. if MetaCat UI is expanding species names to the full taxonomy, then that's preferable (e.g. automated, consistent, and easier to update) than doing it at the user / EML level. However, it doesn't seem the case to me -- I don't think I can get all the data about "Birds" by setting Class: Aves in KNB/DataONE web UI, so I've tended to assume that's because most of the EML files didn't provide any higher-level grouping. (Maybe I'm wrong about that or maybe the Solr-based queries get around this?)
Anyway, I think from a data-discovery standpoint, being able to search for data matching higher-level taxonomy is an obvious utility, despite the loose and changing nature of rank-based classification.

@amoeba
Copy link
Collaborator

amoeba commented Jun 7, 2017

(Maybe I'm wrong about that or maybe the Solr-based queries get around this?)

I think you're right -- we aren't doing this. I've just PR'd a quick work-up of auto expansion. I've done only a minimal amount of testing. #216

@sckott
Copy link

sckott commented Jun 7, 2017

my understanding is that using a taxonomic identifier would be ideal, since it references the taxonomic name to a particular reference authority (at a particular point in time(?)), reflecting the fact that authorities differ and taxonomic names (even including the species level)

Right, i'd think including identifiers (e.g., 123456) and what database they are from (e.g., COL or ITIS) are pretty impt. pieces of metadata.

wrt whether to include complete hierarchy or not - yes, agree that higher taxonomy can change -

@amoeba what do you mean by "error checking" in

taxize could be used for error-checking and materializing.

@amoeba
Copy link
Collaborator

amoeba commented Jun 7, 2017

Just typos. It looks like taxize or ITIS can already help the user find what they meant if they've made
a typo. I could see us having a check_taxonomy function which might query ITIS with what the user has entered and, if there was a typo, offering suggestions (as taxize does) or fix it.

@sckott
Copy link

sckott commented Jun 7, 2017

Right, - note though that each data source has a different backend setup, so results can vary :(

@cboettig
Copy link
Member Author

cboettig commented Jun 7, 2017

@amoeba So technically one could also capture the full provenance of a query used to find a "closest match" (e.g. check for a typo in a species name), which could in principle be useful in diagnosing when and why a particular taxonomic identifier was chosen, just for fun see the discussion we had about this on the NeXML side: ropensci/RNeXML#24 (comment).

In practice, some simpler check to make sure the species names didn't have typos (e.g. either matches the authority-provided name directly or perhaps closestMatch) would no doubt be helpful for users in catching common typos in writing out latin names...

@cboettig
Copy link
Member Author

this is now an opt-in for set_taxonomicCoverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants