-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt-in integration with taxize
to generate species coverage over all taxonomic ranks
#198
Comments
Just had a conversation with some data managers and I found you already had an issue for this. IMO best practice for taxonomic coverage is to serialize only the most specific taxonomic rank (e.g., species) you can and avoid "materializing" up the taxonomic ranks (e.g., family and higher). This is because taxonomic position may change over time. Is that a niche opinion? In Metacat, we attach a Solr index which is the ideal place to materialize the "full" taxonomy for better search. I'm not sure if we do that but it's a more suitable place. However, the EML documentation itself recommends materializing the full tree:
All that said, I can imagine people will want this so I think this is 👍 !
From an API design perspective, does adding an |
cc'ing @sckott since he has more experience in some of the taxonomic issues. Right, my understanding is that using a taxonomic identifier would be ideal, since it references the taxonomic name to a particular reference authority (at a particular point in time(?)), reflecting the fact that authorities differ and taxonomic names (even including the species level) are subject to continual, um, evolution. e.g. in an ideal system, we should be able to handle the case in which at the time of my study, the organism in question is species/group A, which is at a later date broken into new groups A and B, with B potentially being assigned into different parent/higher-level group as well. A proper system would use different identifiers for species A before and after the split, and and a different identifier for B, allowing researchers to at least identify that a record created before the split could be referring to an organism that was either a member of species/group new A or of species B. Like you say, the choice here also depends on what assumptions we can make about the consumer of the metadata, e.g. if MetaCat UI is expanding species names to the full taxonomy, then that's preferable (e.g. automated, consistent, and easier to update) than doing it at the user / EML level. However, it doesn't seem the case to me -- I don't think I can get all the data about "Birds" by setting |
I think you're right -- we aren't doing this. I've just PR'd a quick work-up of auto expansion. I've done only a minimal amount of testing. #216 |
Right, i'd think including identifiers (e.g., 123456) and what database they are from (e.g., COL or ITIS) are pretty impt. pieces of metadata. wrt whether to include complete hierarchy or not - yes, agree that higher taxonomy can change - @amoeba what do you mean by "error checking" in
|
Just typos. It looks like taxize or ITIS can already help the user find what they meant if they've made |
Right, - note though that each data source has a different backend setup, so results can vary :( |
@amoeba So technically one could also capture the full provenance of a query used to find a "closest match" (e.g. check for a typo in a species name), which could in principle be useful in diagnosing when and why a particular taxonomic identifier was chosen, just for fun see the discussion we had about this on the NeXML side: ropensci/RNeXML#24 (comment). In practice, some simpler check to make sure the species names didn't have typos (e.g. either matches the authority-provided name directly or perhaps closestMatch) would no doubt be helpful for users in catching common typos in writing out latin names... |
this is now an opt-in for |
Metadata-based search queries in platforms like MetaCat are most useful if all EML files include full taxonomic rank classification (kingdom, phylum etc), even though the full classification isn't required for taxonomic coverage. This could be automatically computed from the species name (or other rank) given using queries from the R package
taxize
.The text was updated successfully, but these errors were encountered: