[[common-terms]] === Divide and Conquer
The terms in a query string can be divided into more-important (low-frequency) and less-important (high-frequency) terms.((("stopwords", "low and high frequency terms"))) Documents that match only the less important terms are probably of very little interest. Really, we want documents that match as many of the more important terms as possible.
The match
query accepts ((("cutoff_frequency parameter")))((("match query", "cutoff_frequency parameter")))a cutoff_frequency
parameter, which allows it to
divide the terms in the query string into a low-frequency and high-frequency
group.((("term frequency", "cutoff_frequency parameter in match query"))) The low-frequency group (more-important terms) form the bulk of the
query, while the high-frequency group (less-important terms) is used only for
scoring, not for matching. By treating these two groups differently, we can
gain a real boost of speed on previously slow queries.
.Domain-Specific Stopwords
One of the benefits of cutoff_frequency
is that you get domain-specific
stopwords for free.((("domain specific stopwords")))((("stopwords", "domain specific"))) For instance, a website about movies may use the words
movie, color, black, and white so often that they could be
considered almost meaningless. With the stop
token filter, these domain-specific terms would have to be added to the stopwords list manually. However,
because the cutoff_frequency
looks at the actual frequency of terms in the
index, these words would be classified as high frequency automatically.
Take this query as an example:
<1> Any term that occurs in more than 1% of documents is considered to be high
frequency. The cutoff_frequency
can be specified as a fraction (0.01
)
or as an absolute number (5
).
This query uses the cutoff_frequency
to first divide the query terms into a
low-frequency group (quick
, dead
) and a high-frequency group (and
,
the
). Then, the query is rewritten to produce the following bool
query:
{ "bool": { "must": { <1> "bool": { "should": [ { "term": { "text": "quick" }}, { "term": { "text": "dead" }} ] } }, "should": { <2> "bool": { "should": [ { "term": { "text": "and" }}, { "term": { "text": "the" }} ] } } } }
<1> At least one low-frequency/high-importance term must match. <2> High-frequency/low-importance terms are entirely optional.
The must
clause means that at least one of the low-frequency terms—quick
or dead
—must be present for a document to be considered a
match. All other documents are excluded. The should
clause then looks for
the high-frequency terms and
and the
, but only in the documents collected
by the must
clause. The sole job of the should
clause is to score a
document like Quick _and the_ dead'' higher than
The quick but
dead''. This approach greatly reduces the number of documents that need to be
examined and scored.
Setting the operator parameter to and
would make all low-frequency terms
required, and score documents that contain all high-frequency terms higher.
However, matching documents would not be required to contain all high-frequency terms. If you would prefer all low- and high-frequency terms to be
required, you should use a bool
query instead. As we saw in
<>, this is already an efficient query.
==================================================
==== Controlling Precision
The minimum_should_match
parameter can be combined with cutoff_frequency
but it applies to only the low-frequency terms.((("stopwords", "low and high frequency terms", "controlling precision")))((("minimum_should_match parameter", "controlling precision"))) This query:
{ "match": { "text": { "query": "Quick and the dead", "cutoff_frequency": 0.01, "minimum_should_match": "75%" } }
would be rewritten as follows:
{ "bool": { "must": { "bool": { "should": [ { "term": { "text": "quick" }}, { "term": { "text": "dead" }} ], "minimum_should_match": 1 <1> } }, "should": { <2> "bool": { "should": [ { "term": { "text": "and" }}, { "term": { "text": "the" }} ] } } } }
<1> Because there are only two terms, the original 75% is rounded down
to 1
, that is: one out of two low-terms must match.
<2> The high-frequency terms are still optional and used only for scoring.
==== Only High-Frequency Terms
An or
query for high-frequency((("stopwords", "low and high frequency terms", "only high frequency terms"))) terms only—``To be, or not to be''—is
the worst case for performance. It is pointless to score all the
documents that contain only one of these terms in order to return just the top
10 matches. We are really interested only in documents in which the terms all occur
together, so in the case where there are no low-frequency terms, the query is
rewritten to make all high-frequency terms required:
{ "bool": { "must": [ { "term": { "text": "to" }}, { "term": { "text": "be" }}, { "term": { "text": "or" }}, { "term": { "text": "not" }}, { "term": { "text": "to" }}, { "term": { "text": "be" }} ] } }
==== More Control with Common Terms
While the high/low frequency functionality in the match
query is useful,
sometimes you want more control((("stopwords", "low and high frequency terms", "more control over common terms"))) over how the high- and low-frequency groups
should be handled. The match
query exposes a subset of the
functionality available in the common
terms query.((("common terms query")))
For instance, we could make all low-frequency terms required, and score only documents that have 75% of all high-frequency terms with a query like this:
{ "common": { "text": { "query": "Quick and the dead", "cutoff_frequency": 0.01, "low_freq_operator": "and", "minimum_should_match": { "high_freq": "75%" } } } }
See the http://bit.ly/1wdS2Qo[`common` terms query] reference page for more options.