-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Field Collapsing/Combining #256
Comments
Count this comment as a vote to have this feature added. |
I could make good use of this feature. Go for it! |
+1 vote for that |
yes it's really cool feature. |
In SOLR, grouping is not supported for distributed search. If it's implemented, it can be big plus for ElasticSearch |
The only workaround is to "group" the results on the client side is correct? |
+1 This sounds really useful |
This is probably a broader topic of collapsing (dropping dupes based on sort order although many times one field isn't enough to decide a good dedupe), or full rollups where you retain the individual documents within an aggregate replacement document ("5 books by this author"). There are fun issues with each, such as do you try to satisfy the requested window results? How does paging work when things are missing? Does the total document count get adjusted (but is still wrong as you don't know what other pages hold)? ... |
For me this should work like "select distinct" in sql - so i expect duplicates to be removed everywhere - including total document count, pagination and window result. |
at that point, its a full group-by and in SQL you are getting aggregate values back in functions, and sometimes undefined if you ask for non-aggregate fields ... in the search engine how are the other fields besides the rollup key being treated? Is it a grouping into a master aggregate document listing all the children, or at least the fact that there are children such as what Endeca does? Of is it a deduping and the first one at highest relevancy wins even if many of the other fields differ outside of the key (you need compound keys then as deduping on a single field isn't enough to make that desirable)? |
Hey, We're now applying this and adding facets to it with a two phased approach. We first get the list of doc ids and then we pass them in as a term list and faceting on that query. Was curious if there was any more efficient method of doing this? Thanks, |
+1 vote for this issue too. |
subscribe |
+1 |
plz don't make us switch to SOLR just for this feature |
Note that solr does not implment it for a distributed search (as far as I know) and the implementation is problematic (my view). |
Are you referring to the "field collapse patch" floating around in their Jira? I haven't checked if that made it into a recent release so I don't know how up to date my info is, I just noticed that queries using "field collapse patch" are by magnitude slower than queries without. |
Note that there is now (finally!) a new grouping module in Lucene -- see https://issues.apache.org/jira/browse/LUCENE-1421 It's been back-ported to 3.x, under lucene/contrib/grouping. So in theory exposing this in ElasticSearch should be straightforward? (And, if it's not, I'd really like to know about that so we can fix it!). There is some performance hit but not as bad as I had expected. See the 3 TermGroupXXX charts here: http://people.apache.org/~mikemccand/lucenebench -- it's ~ 2.3x-2.5X slower than the straight TermQuery, when grouping by a field with 100, 10K, 1M unique values (though, the sort and groupSort are relevance; maybe when sorting by other fields this is slower). This should also be the worst-case slowdown since TermQuery is such an "easy" query; queries which are "hard" and don't produce many results should see less net impact from the grouping overhead, I expect. |
Cool!, saw that a few days ago, will definitely have a look. |
Hi, with the release of Lucene 3.2, one of its features are: |
+1 |
1 similar comment
+1 |
++1 |
+1 |
3 similar comments
+1 |
+1 |
+1 |
I'm also working on making it easy(ier) to distribute grouping, by adding static merge methods to TopDocs/TopGroups. Ie, each shard can run the 1st pass collector, send top groups back to front end, front end merges the top groups (SearchGroup.merge) and issues request to all shards to run 2nd pass collector, gets results back, merges with TopGroups.merge. This is all under https://issues.apache.org/jira/browse/LUCENE-3191 |
+1 |
+1 any news on whether https://issues.apache.org/jira/browse/LUCENE-1421 as mentioned by mikemccand will work in elasticsearch? |
+1 |
This would be incredibly useful for the application I am writing for my company. I am, however, amazed at how capable Elasticsearch is already that I feel it would be rude not to say thank-you before adding my YES to this request for this feature to be added. |
+1 |
+1 this is a tie breaker for us right now when evaluating ES vs Solr |
+1 |
+1 |
See #6124, which looks like it will handle all field-collapsing requirements, in a distributed manner. |
While neat, is it possible to perform aggregations against all collapsed documents? For example, collapse a set of books on the author field, then aggregate terms in the publisher field, to find the most common publishers by number of distinct authors? |
@thejohnfreeman I imagine #6124 is just the first steps, but considering this is a bucket aggregator, what you describe should be possible. Keep and eye on the PR. |
Let me +1 this issue for the last time :) The top_hits aggregation will handle the field collapse requirements and #6124 is the first step. @thejohnfreeman Right now the top_hits can only be used as leaf aggregation. Can you example also be implemented via two nested terms aggregations (first on author field and then on publisher) and a top_hits aggregation as leaf? |
What about paging? As far as I can tell, where is no way to page agg results. |
@artemredkin Pagination isn't supported yet, but it shouldn't be to difficult to add that. |
+1 :) |
Cool! |
should I add an issue for pagination? |
Hi @artemredkin we already have issue #6299 for it ;) |
Got it, thanks! |
Is there a master-snapshot version available through maven? I can start on my development till 1.3.0 gets officially released. Also, what would be a likely release date of 1.3.0? |
You can build the 1.3.0 branch It contains the aggregations feature |
@vvaradhan 1.3.0-SNAPSHOT is available on Sonatype repo: https://oss.sonatype.org/#nexus-search;gav~org.elasticsearch~elasticsearch~1.3.0-SNAPSHOT~~ HTH |
Released in http://www.elasticsearch.org/downloads/1-3-0/ - #6124 is referenced in release notes. |
No traffic on this in almost a year. Should it be presumed that this issue is closed by #6124 ? |
Correct.
|
Ability to collapse on a field. For example, I want the most relevant result from all different report types. Or similarly, the most recent result of each report type. Or maybe, I want to de-dup on headline.
So, the sort order would dictate which one from the group is returned. Similar to what is discussed here:
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/
From my understanding, it seems that in order for field collapsing to be efficient, the result set must be relatively small.
This is also referred to as "Combine" on some other search products.
The text was updated successfully, but these errors were encountered: