Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvements to contexts #134

Open
chinchliff opened this issue Mar 24, 2016 · 7 comments
Open

improvements to contexts #134

chinchliff opened this issue Mar 24, 2016 · 7 comments
Assignees

Comments

@chinchliff
Copy link
Member

Some feature requests from @uyedaj:

  1. Provide more contexts
  2. Support "synonyms" for contexts in the web services, e.g. for the "Land plants" context, also make this accessible using the official taxon name "Embryophyta", and not case sensitive (if it is case sensitive)
@jar398
Copy link
Member

jar398 commented Mar 24, 2016

In theory we only need one context for each nomenclatural code. From a UI
point of view we seem to already have too many. I'm sure @uyedaj
https://github.com/uyedaj is right but I would like to know why we need
them - that would also tell me which ones to add. Examples would help.

Maybe it would be more useful to allow an arbitrary taxon to be used as a
context? I don't know if that's possible in taxomachine, though.

@mtholder
Copy link
Member

I agree with the idea that every higher taxon in OTT should be usable as a context.
For the sake of efficiency, th implementation of that might require:

  1. merging the results over a few pre-calculated contexts (each of which are too small for the request)
  2. filtering the results from a pre-calculated context (which is too large). or
  3. merging then filtering

But those would all be implementation details, and not visible to (or confusing to) the client.

@mtholder
Copy link
Member

While I understand the benefit of tolerating synonyms, it actually seems much cleaner to me to require 2 calls:

  1. get an OTT ID for your context (which would support synonyms)
  2. then use that OTT ID to specify the context.

@chinchliff
Copy link
Member Author

Well, besides name disambiguation, the other advantage of contexts is that they limit the search space and are thus faster and provide better fuzzy matches. For example, "Felis domestica" (an invalid name for housecat) is a close fuzzy match to "Malus domestica" (apple). I note these are already separated by existing contexts but at least it illustrates the advantage of using more limited scope for fuzzy matching (and I will reiterate: the speed improvements for fuzzy matching could be significant).

As far as using any arbitrary taxon for contexts, this is theoretically possible to do that currently, but it would require quadratic space and runtime to store/build the indexes: each one includes entries for all the descendants of the specified taxon. That seems prohibitive. Mark's ideas seem promising.

In the mean time, adding a handful more contexts at shallow levels in the taxonomy could be helpful and would require almost no effort and only a moderate amount of disk space, but I'm not sure how many nor which taxa to use. Maybe @uyedaj could provide some thoughts.

@jar398
Copy link
Member

jar398 commented Mar 24, 2016 via email

@uyedaj
Copy link

uyedaj commented Mar 24, 2016

I don't have specific examples...I guess recently I was working with a cephalopod and a elasmobranch phylogeny. I was hoping that you could turn any higher taxon into a context, and then the user could just query whatever name they wanted (e.g. sharks, selachii, selachimorpha), get the ottid, and then use it as a context for querying tnrs.

Failing that, the standard textbook list of named animal clades would be useful. Some of these are already available, but others are not. e.g.:

Porifera, Ctenophora, Rotifera, Onychophora, Echinodermata, Brachiopoda, Bilateria, Lophotrochozoa, Ecdysozoa, Protostomes, Deuterostomes

Within larger groups, would be useful to have things like:
Gastropoda, Bivalvia, Cephalopoda, Crustacea, Chondricthyes, Actinopterygii, Sarcopterygii, Coleoptera, Hymenoptera, etc.

This is by no means exhaustive. As Cody said, my main issue is not disambiguation but speed. Even querying OpenTree for ottids when the names are exact matches is slower than I would like it to be for large trees.

@chinchliff
Copy link
Member Author

It's not quadratic, it's n log n.

It depends on the shape of the tree, right? If the tree were fully
imbalanced it would be n^2. If the tree were balanced it would be n log n.
But that does suggest that adding a lot of contexts might not actually be
that bad. Especially if they were limited to a minimum level of
inclusivity... E.g. it might not really make sense to add contexts
for small taxa—they won't be much faster than slightly larger contexts and
since most taxa are relatively small this could save a lot of space. But
I'm not sure how imbalanced the taxonomy actually is.

Mark's suggestion certainly seems more space efficient (and time efficient
for the initial indexing), but if it doesn't actually cost too much to just
create lots of redundant indexes, that seems simpler and could potentially
result in faster queries.

@jar398 jar398 self-assigned this Mar 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants