improvements to contexts #134

chinchliff · 2016-03-24T17:15:47Z

Some feature requests from @uyedaj:

Provide more contexts
Support "synonyms" for contexts in the web services, e.g. for the "Land plants" context, also make this accessible using the official taxon name "Embryophyta", and not case sensitive (if it is case sensitive)

jar398 · 2016-03-24T17:33:08Z

In theory we only need one context for each nomenclatural code. From a UI
point of view we seem to already have too many. I'm sure @uyedaj
https://github.com/uyedaj is right but I would like to know why we need
them - that would also tell me which ones to add. Examples would help.

Maybe it would be more useful to allow an arbitrary taxon to be used as a
context? I don't know if that's possible in taxomachine, though.

mtholder · 2016-03-24T17:44:17Z

I agree with the idea that every higher taxon in OTT should be usable as a context.
For the sake of efficiency, th implementation of that might require:

merging the results over a few pre-calculated contexts (each of which are too small for the request)
filtering the results from a pre-calculated context (which is too large). or
merging then filtering

But those would all be implementation details, and not visible to (or confusing to) the client.

mtholder · 2016-03-24T17:46:02Z

While I understand the benefit of tolerating synonyms, it actually seems much cleaner to me to require 2 calls:

get an OTT ID for your context (which would support synonyms)
then use that OTT ID to specify the context.

chinchliff · 2016-03-24T17:49:08Z

Well, besides name disambiguation, the other advantage of contexts is that they limit the search space and are thus faster and provide better fuzzy matches. For example, "Felis domestica" (an invalid name for housecat) is a close fuzzy match to "Malus domestica" (apple). I note these are already separated by existing contexts but at least it illustrates the advantage of using more limited scope for fuzzy matching (and I will reiterate: the speed improvements for fuzzy matching could be significant).

As far as using any arbitrary taxon for contexts, this is theoretically possible to do that currently, but it would require quadratic space and runtime to store/build the indexes: each one includes entries for all the descendants of the specified taxon. That seems prohibitive. Mark's ideas seem promising.

In the mean time, adding a handful more contexts at shallow levels in the taxonomy could be helpful and would require almost no effort and only a moderate amount of disk space, but I'm not sure how many nor which taxa to use. Maybe @uyedaj could provide some thoughts.

jar398 · 2016-03-24T17:53:14Z

It's not quadratic, it's n log n. But I agree it's probably too big given the current prices for AWS instances. Awaiting examples and/or criteria. They're not hard to add.

uyedaj · 2016-03-24T21:09:09Z

I don't have specific examples...I guess recently I was working with a cephalopod and a elasmobranch phylogeny. I was hoping that you could turn any higher taxon into a context, and then the user could just query whatever name they wanted (e.g. sharks, selachii, selachimorpha), get the ottid, and then use it as a context for querying tnrs.

Failing that, the standard textbook list of named animal clades would be useful. Some of these are already available, but others are not. e.g.:

Porifera, Ctenophora, Rotifera, Onychophora, Echinodermata, Brachiopoda, Bilateria, Lophotrochozoa, Ecdysozoa, Protostomes, Deuterostomes

Within larger groups, would be useful to have things like:
Gastropoda, Bivalvia, Cephalopoda, Crustacea, Chondricthyes, Actinopterygii, Sarcopterygii, Coleoptera, Hymenoptera, etc.

This is by no means exhaustive. As Cody said, my main issue is not disambiguation but speed. Even querying OpenTree for ottids when the names are exact matches is slower than I would like it to be for large trees.

chinchliff · 2016-03-25T02:14:58Z

It's not quadratic, it's n log n.

It depends on the shape of the tree, right? If the tree were fully
imbalanced it would be n^2. If the tree were balanced it would be n log n.
But that does suggest that adding a lot of contexts might not actually be
that bad. Especially if they were limited to a minimum level of
inclusivity... E.g. it might not really make sense to add contexts
for small taxa—they won't be much faster than slightly larger contexts and
since most taxa are relatively small this could save a lot of space. But
I'm not sure how imbalanced the taxonomy actually is.

Mark's suggestion certainly seems more space efficient (and time efficient
for the initial indexing), but if it doesn't actually cost too much to just
create lots of redundant indexes, that seems simpler and could potentially
result in faster queries.

jar398 self-assigned this Mar 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improvements to contexts #134

improvements to contexts #134

chinchliff commented Mar 24, 2016

jar398 commented Mar 24, 2016

mtholder commented Mar 24, 2016

mtholder commented Mar 24, 2016

chinchliff commented Mar 24, 2016

jar398 commented Mar 24, 2016 via email

uyedaj commented Mar 24, 2016

chinchliff commented Mar 25, 2016

improvements to contexts #134

improvements to contexts #134

Comments

chinchliff commented Mar 24, 2016

jar398 commented Mar 24, 2016

mtholder commented Mar 24, 2016

mtholder commented Mar 24, 2016

chinchliff commented Mar 24, 2016

jar398 commented Mar 24, 2016 via email

uyedaj commented Mar 24, 2016

chinchliff commented Mar 25, 2016