From 308d87343efa54a949184777a1195d219f1e368d Mon Sep 17 00:00:00 2001 From: looly Date: Sat, 18 Oct 2014 19:46:31 +0800 Subject: [PATCH] Finished 5.0 --- 050_Search/00_Intro.asciidoc | 57 ------- 050_Search/00_Intro.md | 60 +++----- 050_Search/05_Empty_search.asciidoc | 114 -------------- 050_Search/10_Multi_index_multi_type.asciidoc | 61 -------- 050_Search/15_Pagination.asciidoc | 53 ------- 050_Search/20_Query_string.asciidoc | 144 ------------------ 6 files changed, 18 insertions(+), 471 deletions(-) delete mode 100755 050_Search/00_Intro.asciidoc delete mode 100755 050_Search/05_Empty_search.asciidoc delete mode 100755 050_Search/10_Multi_index_multi_type.asciidoc delete mode 100755 050_Search/15_Pagination.asciidoc delete mode 100755 050_Search/20_Query_string.asciidoc diff --git a/050_Search/00_Intro.asciidoc b/050_Search/00_Intro.asciidoc deleted file mode 100755 index ad3793d..0000000 --- a/050_Search/00_Intro.asciidoc +++ /dev/null @@ -1,57 +0,0 @@ -[[search]] -== Searching – the basic tools - -So far, we have learned how to use Elasticsearch as a simple NoSQL-style -distributed document store -- we can throw JSON documents at Elasticsearch and -retrieve each one by ID. But the real power of Elasticsearch lies in its -ability to make sense out of chaos -- to turn Big Data into Big Information. - -This is the reason that we use structured JSON documents, rather than -amorphous blobs of data. Elasticsearch doesn't only _store_ the document, it -also _indexes_ the content of the document in order to make it searchable. - -*Every field in a document is indexed and can be queried*. And it's not just -that. During a single query, Elasticsearch can use *all* of these indices, to -return results at breath-taking speed. That's something that you could never -consider doing with a traditional database. - -A _search_ can be: - -* a structured query on concrete fields like `gender` or `age`, sorted by - a field like `join_date`, similar to the type of query that you could construct - in SQL - -* a full text query, which finds all documents matching the search keywords, - and returns them sorted by _relevance_ - -* or a combination of the two - -While many searches will just work out of the box, to use Elasticsearch to -its full potential you need to understand three subjects: - -[horizontal] - -_Mapping_:: how the data in each field is interpreted -_Analysis_:: how full text is processed to make it searchable -_Query DSL_:: the flexible, powerful query language used by Elasticsearch - -Each of the above is a big subject in its own right and we explain them in -detail in <>. The chapters in this section will introduce the -basic concepts of all three -- just enough to help you to get an overall -understanding of how search works. - -We will start by explaining the `search` API in its simplest form. - -.Test data - -**** - -The documents that we will use for test purposes in this chapter can be found -in this gist: https://gist.github.com/clintongormley/8579281 - -You can copy the commands and paste them into your shell in order to follow -along with this chapter. - -Alternatively, link:sense_widget.html?snippets/050_Search/Test_data.json[click here to open in Sense]. - -**** diff --git a/050_Search/00_Intro.md b/050_Search/00_Intro.md index 1782a7b..78f7058 100755 --- a/050_Search/00_Intro.md +++ b/050_Search/00_Intro.md @@ -2,59 +2,35 @@ == Searching – the basic tools ## 搜索——基本的工具 -So far, we have learned how to use Elasticsearch as a simple NoSQL-style -distributed document store -- we can throw JSON documents at Elasticsearch and -retrieve each one by ID. But the real power of Elasticsearch lies in its -ability to make sense out of chaos -- to turn Big Data into Big Information. +到目前为止,我们已经学会了如何使用elasticsearch作为一个简单的NoSQL风格的分布式文件存储器——我们可以将一个JSON文档扔给Elasticsearch,也可以根据ID检索它们。但Elasticsearch真正强大之处在于可以从混乱的数据中找出有意义的信息——从大数据到全面的信息。 -至目前,我们已经学会了如何使用elasticsearch作为一个简单的NoSQL风格的分布式文件存储器——我们可以将一个JSON文档扔给Elasticsearch,也可以根据ID检索它们。 +这也是为什么我们使用结构化的JSON文档,而不是无结构的二进制数据。Elasticsearch不只会**存储(store)**文档,也会**索引(indexes)**文档内容来使之可以被搜索。 -This is the reason that we use structured JSON documents, rather than -amorphous blobs of data. Elasticsearch doesn't only _store_ the document, it -also _indexes_ the content of the document in order to make it searchable. - -*Every field in a document is indexed and can be queried*. And it's not just -that. During a single query, Elasticsearch can use *all* of these indices, to -return results at breath-taking speed. That's something that you could never -consider doing with a traditional database. +**每个文档里的字段都会被索引并被查询**。而且不仅如此。在简单查询时,Elasticsearch可以使用**所有**的索引,以非常快的速度返回结果。这让你永远不必考虑传统数据库的一些东西。 A _search_ can be: +**搜索(search)**可以: -* a structured query on concrete fields like `gender` or `age`, sorted by - a field like `join_date`, similar to the type of query that you could construct - in SQL - -* a full text query, which finds all documents matching the search keywords, - and returns them sorted by _relevance_ - -* or a combination of the two - -While many searches will just work out of the box, to use Elasticsearch to -its full potential you need to understand three subjects: - -[horizontal] +* 在类似于`gender`或者`age`这样的字段上使用结构化查询,`join_date`这样的字段上使用排序,就像SQL的结构化查询一样。 +* 全文检索,可以使用所有字段来匹配关键字,然后按照**关联性(relevance)**排序返回结果。 +* 或者结合以上两条。 -_Mapping_:: how the data in each field is interpreted -_Analysis_:: how full text is processed to make it searchable -_Query DSL_:: the flexible, powerful query language used by Elasticsearch +很多搜索都是开箱即用的,为了充分挖掘Elasticsearch的潜力,你需要理解以下三个概念: -Each of the above is a big subject in its own right and we explain them in -detail in <>. The chapters in this section will introduce the -basic concepts of all three -- just enough to help you to get an overall -understanding of how search works. -We will start by explaining the `search` API in its simplest form. +| 概念 | 解释 | +| ------------------------------- | ----------------------------------------- | +| **映射(Mapping)** | 数据在每个字段中的解释说明 | +| **分析(Analysis)** | 全文是如何处理的可以被搜索的 | +| **领域特定语言查询(Query DSL)** | Elasticsearch使用的灵活的、强大的查询语言 | -.Test data -**** +以上提到的每个点都是一个巨大的话题,我们将在《深入搜索》一章阐述它们。本章节我们将介绍这三点的一些基本概念——仅仅帮助你大致了解搜索是如何工作的。 -The documents that we will use for test purposes in this chapter can be found -in this gist: https://gist.github.com/clintongormley/8579281 +我们将使用最简单的形式开始介绍`search` API. -You can copy the commands and paste them into your shell in order to follow -along with this chapter. +> ### 测试数据 -Alternatively, link:sense_widget.html?snippets/050_Search/Test_data.json[click here to open in Sense]. +> 本章节测试用的数据可以在这里被找到[https://gist.github.com/clintongormley/8579281](https://gist.github.com/clintongormley/8579281) -**** +> 你可以把这些命令复制到终端中执行以便可以实践本章的例子。 diff --git a/050_Search/05_Empty_search.asciidoc b/050_Search/05_Empty_search.asciidoc deleted file mode 100755 index a71ebf3..0000000 --- a/050_Search/05_Empty_search.asciidoc +++ /dev/null @@ -1,114 +0,0 @@ -[[empty-search]] -=== The empty search - -The most basic form of the search API is the _empty search_ which doesn't -specify any query, but simply returns all documents in all indices in the -cluster: - -[source,js] --------------------------------------------------- -GET /_search --------------------------------------------------- -// SENSE: 050_Search/05_Empty_search.json - -The response (edited for brevity) looks something like this: - -[source,js] --------------------------------------------------- -{ - "hits" : { - "total" : 14, - "hits" : [ - { - "_index": "us", - "_type": "tweet", - "_id": "7", - "_score": 1, - "_source": { - "date": "2014-09-17", - "name": "John Smith", - "tweet": "The Query DSL is really powerful and flexible", - "user_id": 2 - } - }, - ... 9 RESULTS REMOVED ... - ], - "max_score" : 1 - }, - "took" : 4, - "_shards" : { - "failed" : 0, - "successful" : 10, - "total" : 10 - }, - "timed_out" : false -} --------------------------------------------------- - - -==== `hits` - -The most important section of the response is `hits`, which contains the -`total` number of documents that matched our query, and a `hits` array -containing the first 10 of those matching documents -- the results. - -Each result in the `hits` array contains the `_index`, `_type` and `_id` of -the document, plus the `_source` field. This means that the whole document is -immediately available to us directly from the search results. This is unlike -other search engines which return just the document ID, requiring you to fetch -the document itself in a separate step. - -Each element also has a `_score`. This is the _relevance score_, which is a -measure of how well the document matches the query. By default, results are -returned with the most relevant documents first; that is, in descending order -of `_score`. In this case, we didn't specify any query so all documents are -equally relevant, hence the neutral `_score` of `1` for all results. - -The `max_score` value is the highest `_score` of any document that matches our -query. - -==== `took` - -The `took` value tells us how many milliseconds the entire search request took -to execute. - -==== `shards` - -The `_shards` element tells us the `total` number of shards that were involved -in the query and, of them, how many were `successful` and how many `failed`. -We wouldn't normally expect shards to fail, but it can happen. If we were to -suffer a major disaster in which we lost both the primary and the replica copy -of the same shard, there would be no copies of that shard available to respond -to search requests. In this case, Elasticsearch would report the shard as -`failed`, but continue to return results from the remaining shards. - -==== `timeout` - -The `timed_out` value tells us whether the query timed out or not. By -default, search requests do not timeout. If low response times are more -important to you than complete results, you can specify a `timeout` as `10` -or `"10ms"` (10 milliseconds), or `"1s"` (1 second): - -[source,js] --------------------------------------------------- -GET /_search?timeout=10ms --------------------------------------------------- - - -Elasticsearch will return any results that it has managed to gather from -shards which responded before the request timed out. - -.Timeout is not a circuit breaker -[WARNING] -================================================ - -It should be noted that this `timeout` does not halt the execution of the -query, it merely tells the coordinating node to return the results collected -_so far_ and to close the connection. In the background, other shards may -still be processing the query even though results have been sent. - -Use the timeout because it is important to your SLA, not because you want -to abort the execution of long running queries. - -================================================ - diff --git a/050_Search/10_Multi_index_multi_type.asciidoc b/050_Search/10_Multi_index_multi_type.asciidoc deleted file mode 100755 index 65d52a3..0000000 --- a/050_Search/10_Multi_index_multi_type.asciidoc +++ /dev/null @@ -1,61 +0,0 @@ -[[multi-index-multi-type]] -=== Multi-index, multi-type - -Did you notice that the results from the <> above -contained documents of different types -- `user` and `tweet` -- from two -different indices -- `us` and `gb`? - -By not limiting our search to a particular index or type, we have searched -across *all* documents in the cluster. Elasticsearch forwarded the search -request in parallel to a primary or replica of every shard in the cluster, -gathered the results to select the overall top ten, and returned them to us. - -Usually, however, you will want to search within one or more specific indices, -and probably one or more specific types. We can do this by specifying the -index and type in the URL, as follows: - -[horizontal] -`/_search`:: - - search all types in all indices - -`/gb/_search`:: - - search all types in the `gb` index - -`/gb,us/_search`:: - - search all types in the `gb` and `us` indices - -`/g*,u*/_search`:: - - search all types in any indices beginning with `g` or beginning with `u` - -`/gb/user/_search`:: - - search type `user` in the `gb` index - -`/gb,us/user,tweet/_search`:: - - search types `user` and `tweet` in the `gb` and `us` indices - -`/_all/user,tweet/_search`:: - - search types `user` and `tweet` in all indices - - -When you search within a single index, Elasticsearch forwards the search -request to a primary or replica of every shard in that index, then gathers the -results from each shard. Searching within multiple indices works in exactly -the same way -- there are just more shards involved. - -[IMPORTANT] -================================================ - -Searching one index which has 5 primary shards is *exactly equivalent* to -searching 5 indices which have one primary shard each. - -================================================ - -Later, you will see how this simple fact makes it easy to scale flexibly -as your requirements change. diff --git a/050_Search/15_Pagination.asciidoc b/050_Search/15_Pagination.asciidoc deleted file mode 100755 index 4f9e11e..0000000 --- a/050_Search/15_Pagination.asciidoc +++ /dev/null @@ -1,53 +0,0 @@ -[[pagination]] -=== Pagination - -Our <> told us that there are 14 documents in the -cluster which match our (empty) query. But there were only 10 documents in -the `hits` array. How can we see the other documents? - -In the same way as SQL uses the `LIMIT` keyword to return a single ``page'' of -results, Elasticsearch accepts the `from` and `size` parameters: - -[horizontal] -`size`:: How many results should be returned, defaults to `10` -`from`:: How many initial results should be skipped, defaults to `0` - -If you wanted to show 5 results per page, then pages 1 to 3 -could be requested as: - -[source,js] --------------------------------------------------- -GET /_search?size=5 -GET /_search?size=5&from=5 -GET /_search?size=5&from=10 --------------------------------------------------- -// SENSE: 050_Search/15_Pagination.json - - -Beware of paging too deep or requesting too many results at once. Results are -sorted before being returned. But remember that a search request usually spans -multiple shards. Each shard generates its own sorted results, which then need -to be sorted centrally to ensure that the overall order is correct. - -.Deep paging in distributed systems -**** - -To understand why deep paging is problematic, let's imagine that we are -searching within a single index with 5 primary shards. When we request the -first page of results (results 1 to 10), each shard produces its own top 10 -results and returns them to the _requesting node_, which then sorts all 50 -results in order to select the overall top 10. - -Now imagine that we ask for page 1,000 -- results 10,001 to 10,010. Everything -works in the same way except that each shard has to produce its top 10,010 -results. The requesting node then sorts through all 50,050 results and -discards 50,040 of them! - -You can see that, in a distributed system, the cost of sorting results -grows exponentially the deeper we page. There is a very good reason -why web search engines don't return more than 1,000 results for any query. - -**** - -TIP: In <> we will explain how you *can* retrieve large numbers of -documents efficiently. diff --git a/050_Search/20_Query_string.asciidoc b/050_Search/20_Query_string.asciidoc deleted file mode 100755 index a763b23..0000000 --- a/050_Search/20_Query_string.asciidoc +++ /dev/null @@ -1,144 +0,0 @@ -[[search-lite]] -=== Search _Lite_ - -There are two forms of the `search` API: a ``lite'' _query string_ version -that expects all its parameters to be passed in the query string, and the full -_request body_ version that expects a JSON request body and uses a -rich search language called the query DSL. - -The query string search is useful for running _ad hoc_ queries from the -command line. For instance this query finds all documents of type `tweet` that -contain the word `"elasticsearch"` in the `tweet` field: - -[source,js] --------------------------------------------------- -GET /_all/tweet/_search?q=tweet:elasticsearch --------------------------------------------------- -// SENSE: 050_Search/20_Query_string.json - -The next query looks for `"john"` in the `name` field and `"mary"` in the -`tweet` field. The actual query is just: - - +name:john +tweet:mary - -but the _percent encoding_ needed for query string parameters makes it appear -more cryptic than it really is: - -[source,js] --------------------------------------------------- -GET /_search?q=%2Bname%3Ajohn+%2Btweet%3Amary --------------------------------------------------- -// SENSE: 050_Search/20_Query_string.json - - -The `"+"` prefix indicates conditions which _must_ be satisfied for our query to -match. Similarly a `"-"` prefix would indicate conditions that _must not_ -match. All conditions without a `+` or `-` are optional -- the more that match, -the more relevant the document. - -[[all-field-intro]] -==== The `_all` field - -This simple search returns all documents which contain the word `"mary"`: - -[source,js] --------------------------------------------------- -GET /_search?q=mary --------------------------------------------------- -// SENSE: 050_Search/20_All_field.json - - -In the previous examples, we searched for words in the `tweet` or -`name` fields. However, the results from this query mention `"mary"` in -three different fields: - -* a user whose name is "Mary" -* six tweets by "Mary" -* one tweet directed at "@mary" - -How has Elasticsearch managed to find results in three different fields? - -When you index a document, Elasticsearch takes the string values of all of -its fields and concatenates them into one big string which it indexes as -the special `_all` field. For example, when we index this document: - -[source,js] --------------------------------------------------- -{ - "tweet": "However did I manage before Elasticsearch?", - "date": "2014-09-14", - "name": "Mary Jones", - "user_id": 1 -} --------------------------------------------------- - - -it's as if we had added an extra field called `_all` with the value: - -[source,js] --------------------------------------------------- -"However did I manage before Elasticsearch? 2014-09-14 Mary Jones 1" --------------------------------------------------- - - -The query string search uses the `_all` field unless another -field name has been specified. - -TIP: The `_all` field is a useful feature while you are getting started with -a new application. Later, you will find that you have more control over -your search results if you query specific fields instead of the `_all` -field. When the `_all` field is no longer useful to you, you can -disable it, as explained in <>. - -[[query-string-query]] -==== More complicated queries - -The next query searches for tweets: - -* where the `name` field contains `"mary"` or `"john"` -* and where the `date` is greater than `2014-09-10` -* and which contain either of the words `"aggregations"` or `"geo"` in the - `_all` field - -[source,js] --------------------------------------------------- -+name:(mary john) +date:>2014-09-10 +(aggregations geo) --------------------------------------------------- -// SENSE: 050_Search/20_All_field.json - -which, as a properly encoded query string looks like the slightly less -readable: - -[source,js] --------------------------------------------------- -?q=%2Bname%3A(mary+john)+%2Bdate%3A%3E2014-09-10+%2B(aggregations+geo) --------------------------------------------------- - -As you can see from the above examples, this _lite_ query string search is -surprisingly powerful. Its query syntax, which is explained in detail in the -{ref}/query-dsl-query-string-query.html#query-string-syntax[Query String Syntax] -reference docs, allows us to express quite complex queries succinctly. This -makes it great for throwaway queries from the command line or during -development. - -However, you can also see that its terseness can make it cryptic and -difficult to debug. And it's fragile -- a slight syntax error in the query -string, such as a misplaced `-`, `:`, `/` or `"` and it will return an error -instead of results. - -Lastly, the query string search allows any user to run potentially slow heavy -queries on any field in your index, possibly exposing private information or -even bringing your cluster to its knees! - -[TIP] -================================================== -For these reasons, we don't recommend exposing query string search directly to -your users, unless they are power users who can be trusted with your data and -with your cluster. -================================================== - -Instead, in production we usually rely on the full-featured _request body_ -search API, which does all of the above, plus a lot more. Before we get there -though, we first need to take a look at how our data is indexed in -Elasticsearch. -