-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terms Lookup by Query/Filter (aka. Join Filter) #3278
Conversation
Just updated PR with significant improvements to lookup and filter on numeric fields. Performing the lookup on a numeric field is now ~2x faster than the same lookup on a string field. |
I love this functionality, I already got an use case for it! 👍 |
Updated PR to be current with the latest master changes and fixed a bug when executing across multiple nodes. All commits have been squashed. Has anyone had a chance to review this PR? |
Updated to work with latest changes in master. |
Sorry for getting involved so late... This feature is really cool! (equivalent of a subquery). The main concern I have with this feature, is that the amount of values being send over the wire between nodes is unbounded. Each time the terms filter with query is parsed a query is being executed on all shards if the routing option isn't used. In the case a query matches with millions of values (which is think is a common scenario), then all these values need to be send over the wire. This transport of values will occur for each search request with a terms filter being executed times the number of shards this search request is targeted for. I'm also wondering about the lookup cache here, if one document changes in the index that terms lookup query is being executed on, then this the cache entry for that query needs to be trashed and think a bit more about the lookup cache it doesn't make much sense in the case if terms lookup query is used, since it is really hard to find out what has changed in perspective with the terms lookup query. Also wondering If the terms lookup query would only be executed on shards to are locally available on the node that is executing the terms filter, would this feature still be useful? We can make for example the routing option required. |
Hey @martijnvg,
I have the same concern about the amount of data being sent, but I guess that could be documented as a known limitation. I imagine people will find it acceptable given the benefits a query like this provide. The request per shard (since it happens during parsing) is a problem. I actually didn't realize that the json was parsed per-shard. I would really like to gather the terms once then send the to the shards. Maybe creating a new phase for this? Not sure. A while back we talked about field collapsing and you mentioned that to do it correctly some things will need to be moved around internally (the phases if I remember correctly). Would that change make implementing this feature any easier/better?
Yea, I did this to try and piggyback on the terms lookup functionality/caching. I don't think caching the lookup makes much sense unless we can do it per-segment, but then we still have the cost of sending over the transport. My original idea was to have this as a new JoinQuery and JoinFilter that only caches the resulting terms fitler. I would probably move to that vs. being part of the lookup if we were to move forward on this.
I don't think so. You might as well use parent/child functionality if that was the case. At any rate, the move from trove to hppc (088e05b) caused some issues with my current patch due to hppc ObjectOpenHashSet not implementing the standard Collection interface. Before I spend some time getting everything fixed up, let's figure out the best approach to get this moving forward. Thanks for taking the time to look at this! |
Perhaps there can be an option that controls the amount of terms being fetched per shard? Something like About the request per shard, we can do something smart and have a helper service that just bundles the shard level field value retrieval in one request. I don't think there is a need to introduce an additional distributed phase here. (it is possible via an additional phase, but no necessary and this makes this feature much bigger then it should?) About hppc, in what case do you need bridge to the JCF world? |
Yea, I think having a That helper service sounds great, is there anything like that being used somewhere in the code that I can have a look at? I prefer that over the new distributed phase for sure! I had relied on JCF by abstracting TermsLookup to have a getTerms method that returns a collection of terms. The existing field based implementation returns a list of strings, my query implementation returned THashSet of either BytesRef or Number depending on the field type being looked up. Hppc doesnt implment Collection interface, so I can't just swap Trove for Hppc easily. This won't be a problem if I didn't try to make this part of terms lookup. What do you think about pulling this back out into a |
I don't know of a good example right now, but it should be something simple. It should just keep the query result around for the duration of the query phase (maybe as a ThreadLocal?), so that unnecessary requests are avoided and drop the results after the query phase has been completed. I think it is best just to completely move over to Hppc. The TermsLookup#getTerms should return Iterator instead of Collection. Both implementations would return a simple wrapper that delegate to the actual implementation. For the QueryTermsLookup you can just make TermsByQueryAction work with ObjectOpenHashSet and just wrap the result in an anonymous Iterator impl. For FieldTermsLookup you can just keep using XContentMapValues#extractRawValues as is and just return the list's iterator. Looking at XContentMapValues#extractRawValues usages, I think it can be moved to use Hppc nativly instead of JCF, but that is an unrelated to this change. |
Sounds good, let me see what I can do. I am going to work on Hppc fix first as that should be pretty easy. Thanks. |
@mattweber great that you can work on this! Lets move this forward and get this in. |
@martijnvg Got this updated for hppc but I am stuck trying to figure out how to execute only a single |
Yes, a ThreadLocal won't help then... Maybe for now keep the TermsByQueryAction as is. The |
PR updated with latest changes from master and the switch to hppc. |
@s1monw The tests I am writing for this keep failing. Just wondering if this is because I am running it on a mac with the new randomized testing? Doesn't appear to be anything related to my actual test. |
it tells you that you are missing to release the searcher you are acquired:
it even tells you where this happened: |
I do release it...
It looks like something with the MockDirectory. I have had this test complete successfully, but most the time it fails. If I delete all the TEST cluster data manually and run it, it tends to pass. |
The mock dir failure is a side-effect of this not being released properly! This seems spooky:
seems like you don't release it if there is no field mapper? Maybe you don't create the mapping propperly and the mapping is not available on the shard? do you use dynamic mapping? |
I use dynamic mapping, but this is only executing over a single shard so it should have the mapping once documents are indexed. Let me make sure the context gets released even if there is a fieldMapper exception. What get's me is that the test passes sometimes... |
Still getting random fails... I can see that on tests that fail there is this:
On tests that pass, that is not there. |
I pulled that PR in and did some modifications to get a more clear error message. I don't see any pending searchers with that one anymore and it fails all the time with that seed. check this out: https://gist.github.com/s1monw/6998524 this should give you a better idea why you are seeing the nullpointer exceptions all the time with this cmd it fails consistently for me:
|
Thanks Simon, I think I found the problem in the serialization code. It only happens when using a different transport. Randomized testing did it's job! |
Ohh I see you fixed the serialization as well in that gist. Thanks! |
Pushed up the latest changes. The bug in the tests was due to missing serialization of the While fixing this, I removed some of the unnecessary multi-shard and multi-node tests since that is all handled by the randomized tests. I updated my tests to pick a random number of shards, documents to index, and range query. BTW, I force pushed the latest update to this PR so you should do a new checkout. |
import java.util.Iterator; | ||
import java.util.List; | ||
|
||
public abstract class ResponseTerms implements Streamable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been thinking about this abstraction and maybe we don't need it? We only care about the term value and we don't do anything with it. I think it only there because of Field data, right? Maybe can only use ByteRef and when there is a join on a number based field convert to BR prejoin and convert it back to number based representation postjoin? This can reduce the amount of code here and in FieldDataTermsFilter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is actually how I originally implemented it but found performance considerably worse for numeric fields. This is due to the fact to get the BytesRef
from the fielddata it does a new BytesRef(Long.toString(val))
. These string conversions killed performance to the point where the joins on numeric fields were 2x as slow as string field. Now that I stick with the primitive types the joins on numeric fields are ~2x faster than string fields. I figured the more complex code was worth the performance increase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I was just wondering if this could be done with less code :)
Is there any update on this PR? (timeline, roadmap, etc) |
+1 |
1 similar comment
+1 |
+1 @mattweber did you ever get any traction with moving this feature into a plugin? I'd be very interested in trying it out in order to avoid my current solution for joins of passing large lists of identifiers in terms filters. |
It seems SIREn has implemented joins as a plugin based on this request. See |
+1 |
Thanks for the heads-up regarding Siren @yehosef . I took it for a quick spin and it seems to work well! At the very least it cuts out a round trip between the cluster and the client for single join scenarios. |
@mattweber now that a lot of the query refactoring work has been done, is there still interest in pursuing this pull request? (it may need to be rewritten with all the changes in master lately) |
Please. I'm sure a lot of people would love to see this added. Siren works well, but would be very nice to have this out of the box. |
Yes, I would love to see it out of the box as well. |
Can one of the admins verify this patch? |
Closing this as we didn't receive any feedback from the author |
noooooo On Mon, Aug 8, 2016 at 1:18 PM, Lee Hinman [email protected] wrote:
|
Either way, I want to thank @mattweber for this PR and trying. I am sure this type of functionality will eventually be part of Elasticsearch. |
+1 |
@gcampbell-epiq not until somebody comes up with a way to do this that is horizontally scalable |
@clintongormley - I'm not sure if this a viable option or totally related, but here is a way I had been considering adding joins to ES that is horizontally scalable (I think..). It's a different approach that this - you can decide if it's helpful/relevant/crazy. You have the ability to insert with a query (sort of - the reindex api) and update with a query. If you had the ability to do an UPSERT with a query you could relatively easily do large joins. Let's say I have a index (“sales”) with products users have bought and I want to find people that bought product x and product y (not in the same purchase). You can do an upsert by query of the mem_ids that bought product x, with the mem_id as the doc_id in a new index (call it “tmp_join_sales_123”). The doc json would look like {product_x:true}. You would then do the same thing for product y - the data would be {product_y:true}. The data would be one of three forms {product_x:true}, {product_y:true}, or {product_x:true,product_y:true}. The join query would then be like: It is a little more indirect because you are actually creating a temp index for the join, but I think that's what saves it for large data sets. It seems to me the general approach would be very powerful. The main functionality that missing is UPSERT by query. This obviously doesn’t need the two indices to be the same, just that they have a common element - which is the basis of any join. If the join conditions are complex you could serialize or hash it. If you need extra data in the results of the query other than the mem_id, you could add it to data both queries write to the tmp index. |
@clintongormley - great to hear. It seems to me that it shouldn't be too complicated to implement, but I'm not sure. There's a different product I'm playing with now (under NDA - can't say what) that can do broad joins in elasticsearch, but I think having this capability internal would probably perform better for big data sets. |
This PR adds support for generating a terms filter based on the field values
of documents matching a specified lookup query/filter. The value of the
configurable "path" field is collected from the field data cache for each
document matching the lookup query/filter and is then used to filter the main
query. This is can also be called a join filter.
This PR abstracts the TermsLookup functionality in order to support multiple
lookup methods. The existing functionality is moved into FieldTermsLookup and
the new query based lookup is in QueryTermsLookup. All existing caching
functionality works with the new query based lookup for increased performance.
During testing of I found that one of the performance bottlenecks was
generating the Lucene TermsFilter on large sets of terms (probably since
it sorts the terms). I have created a FieldDataTermsFilter that uses the
field data cache to lookup value of the field being filtered and compare it to
the set of gathered terms. This significantly increased performance at the
cost of higher memory usage. Currently a TermsFilter is used when the number
of filtering terms is less than 1024 and the FieldDataTermsFilter is used
for everything else. This should eventually be configurable or we need to
perform some test to find the optimal value.
Examples:
Replicate a has_child query by joining on the child's "pid" field to the
parent's "id" field for each child that has the tag "something".
Lookup companies that offer products or services mentioning elasticsearch.
Notice that products and services are kept in their own indices.