<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                ??碼云GVP開源項目 12k star Uniapp+ElementUI 功能強大 支持多語言、二開方便! 廣告
                #### 多數字段(Most Fields) 全文搜索是一場召回率(Recall) - 返回所有相關的文檔,以及準確率(Precision) - 不返回無關文檔,之間的戰斗。目標是在結果的第一頁給用戶呈現最相關的文檔。 為了提高召回率,我們會廣撒網 - 不僅包括精確匹配了用戶搜索詞條的文檔,還包括了那些我們認為和查詢相關的文檔。如果一個用戶搜索了"quick brown fox",一份含有fast foxes的文檔也可以作為一個合理的返回結果。 如果我們擁有的相關文檔僅僅是含有fast foxes的文檔,那么它會出現在結果列表的頂部。但是如果我們有100份含有quick brown fox的文檔,那么含有fast foxes的文檔的相關性就會變低,我們希望它出現在結果列表的后面。在包含了許多可能的匹配后,我們需要確保相關度高的文檔出現在頂部。 一個用來調優全文搜索相關性的常用技術是將同樣的文本以多種方式索引,每一種索引方式都提供了不同相關度的信號(Signal)。主要字段(Main field)中含有的詞條的形式是最寬泛的(Broadest-matching),用來盡可能多的匹配文檔。比如,我們可以這樣做: * 使用一個詞干提取器來將jumps,jumping和jumped索引成它們的詞根:jump。然后當用戶搜索的是jumped時,我們仍然能夠匹配含有jumping的文檔。 * 包含同義詞,比如jump,leap和hop。 * 移除變音符號或者聲調符號:比如,ésta,está和esta都會以esta被索引。 但是,如果我們有兩份文檔,其中之一含有jumped,而另一份含有jumping,那么用戶會希望第一份文檔的排序會靠前,因為它含有用戶輸入的精確值。 我們可以通過將相同的文本索引到其它字段來提供更加精確的匹配。一個字段可以包含未被提取詞干的版本,另一個則是含有變音符號的原始單詞,然后第三個使用了shingles,用來提供和[單詞鄰近度](https://www.elastic.co/guide/en/elasticsearch/guide/current/proximity-matching.html)相關的信息。這些其它字段扮演的角色就是信號(Signals),它們用來增加每個匹配文檔的相關度分值。能夠匹配的字段越多,相關度就越高。 如果一份文檔能夠匹配具有最寬泛形式的主要字段(Main field),那么它就會被包含到結果列表中。如果它同時也匹配了信號字段,它會得到一些額外的分值用來將它移動到結果列表的前面。 我們會在本書的后面討論同義詞,單詞鄰近度,部分匹配以及其他可能的信號,但是我們會使用提取了詞干和未提取詞干的字段的簡單例子來解釋這個技術。 #### 多字段映射(Multifield Mapping) 第一件事就是將我們的字段索引兩次:一次是提取了詞干的形式,一次是未提取詞干的形式。為了實現它,我們會使用多字段(Multifields),在字符串排序和[多字段]()中我們介紹過: ```Javascript DELETE /my_index PUT /my_index { "settings": { "number_of_shards": 1 }, <1> "mappings": { "my_type": { "properties": { "title": { <2> "type": "string", "analyzer": "english", "fields": { "std": { <3> "type": "string", "analyzer": "standard" } } } } } } } ``` // SENSE: 110_Multi_Field_Search/30_Most_fields.json <1> See <<[關聯失效(相關性被破壞](/100_Full_Text_Search/35_Relevance_is_broken.md)>>. <2> title字段使用了english解析器進行詞干提取。 <3> title.std字段則使用的是standard解析器,因此它沒有進行詞干提取。 下一步,我們會索引一些文檔: ```Javascript PUT /my_index/my_type/1 { "title": "My rabbit jumps" } PUT /my_index/my_type/2 { "title": "Jumping jack rabbits" } ``` // SENSE: 110_Multi_Field_Search/30_Most_fields.json 以下是一個簡單的針對title字段的match查詢,它查詢jumping rabbits: ```Javascript GET /my_index/_search { "query": { "match": { "title": "jumping rabbits" } } } ``` // SENSE: 110_Multi_Field_Search/30_Most_fields.json 它會變成一個針對兩個提干后的詞條jump和rabbit的查詢,這要得益于english解析器。兩份文檔的title字段都包含了以上兩個詞條,因此兩份文檔的分值是相同的: ```Javascript { "hits": [ { "_id": "1", "_score": 0.42039964, "_source": { "title": "My rabbit jumps" } }, { "_id": "2", "_score": 0.42039964, "_source": { "title": "Jumping jack rabbits" } } ] } ``` 如果我們只查詢title.std字段,那么只有文檔2會匹配。但是,當我們查詢兩個字段并將它們的分值通過bool查詢進行合并的話,兩份文檔都能夠匹配(title字段也匹配了),而文檔2的分值會更高一些(匹配了title.std字段): ```Javascript GET /my_index/_search { "query": { "multi_match": { "query": "jumping rabbits", "type": "most_fields", <1> "fields": [ "title", "title.std" ] } } } ``` // SENSE: 110_Multi_Field_Search/30_Most_fields.json <1> 在上述查詢中,由于我們想合并所有匹配字段的分值,因此使用的類型為most_fields。這會讓multi_match查詢將針對兩個字段的查詢子句包含在一個bool查詢中,而不是包含在一個dis_max查詢中。 ```Javascript { "hits": [ { "_id": "2", "_score": 0.8226396, <1> "_source": { "title": "Jumping jack rabbits" } }, { "_id": "1", "_score": 0.10741998, <1> "_source": { "title": "My rabbit jumps" } } ] } ``` <1> 文檔2的分值比文檔1的高許多。 我們使用了擁有寬泛形式的title字段來匹配盡可能多的文檔 - 來增加召回率(Recall),同時也使用了title.std字段作為信號來讓最相關的文檔能夠擁有更靠前的排序(譯注:增加了準確率(Precision))。 每個字段對最終分值的貢獻可以通過指定boost值進行控制。比如,我們可以提升title字段來讓該字段更加重要,這也減小了其它信號字段的影響: ```Javascript GET /my_index/_search { "query": { "multi_match": { "query": "jumping rabbits", "type": "most_fields", "fields": [ "title^10", "title.std" ] <1> } } } ``` // SENSE: 110_Multi_Field_Search/30_Most_fields.json <1> boost=10讓title字段的相關性比title.std更重要。 <!-- [[most-fields]] === Most Fields Full-text search is a battle between _recall_&#x2014;returning all the documents that are ((("most fields queries")))((("multifield search", "most fields queries")))relevant--and _precision_&#x2014;not returning irrelevant documents. The goal is to present the user with the most relevant documents on the first page of results. To improve recall, we cast((("recall", "improving in full text searches"))) the net wide--we include not only documents that match the user's search terms exactly, but also documents that we believe to be pertinent to the query. If a user searches for ``quick brown fox,'' a document that contains `fast foxes` may well be a reasonable result to return. If the only pertinent document that we have is the one containing `fast foxes`, it will appear at the top of the results list. But of course, if we have 100 documents that contain the words `quick brown fox`, then the `fast foxes` document may be considered less relevant, and we would want to push it further down the list. After including many potential matches, we need to ensure that the best ones rise to the top. A common technique for fine-tuning full-text relevance((("relevance", "fine-tuning full text relevance"))) is to index the same text in multiple ways, each of which provides a different relevance _signal_. The main field would contain terms in their broadest-matching form to match as many documents as possible. For instance, we could do the following: * Use a stemmer to index `jumps`, `jumping`, and `jumped` as their root form: `jump`. Then it doesn't matter if the user searches for `jumped`; we could still match documents containing `jumping`. * Include synonyms like `jump`, `leap`, and `hop`. * Remove diacritics, or accents: for example, `ésta`, `está`, and `esta` would all be indexed without accents as `esta`. However, if we have two documents, one of which contains `jumped` and the other `jumping`, the user would probably expect the first document to rank higher, as it contains exactly what was typed in. We can achieve this by indexing the same text in other fields to provide more-precise matching. One field may contain the unstemmed version, another the original word with diacritics, and a third might use _shingles_ to provide information about <<proximity-matching,word proximity>>. These other fields act as _signals_ that increase the relevance score of each matching document. The more fields that match, the better. A document is included in the results list if it matches the broad-matching main field. If it also matches the _signal_ fields, it gets extra points and is pushed up the results list. We discuss synonyms, word proximity, partial-matching and other potential signals later in the book, but we will use the simple example of stemmed and unstemmed fields to illustrate this technique. ==== Multifield Mapping The first thing to do is to set up our ((("most fields queries", "multifield mapping")))((("mapping (types)", "multifield mapping")))field to be indexed twice: once in a stemmed form and once in an unstemmed form. To do this, we will use _multifields_, which we introduced in <<multi-fields>>: [source,js] -------------------------------------------------- DELETE /my_index PUT /my_index { "settings": { "number_of_shards": 1 }, <1> "mappings": { "my_type": { "properties": { "title": { <2> "type": "string", "analyzer": "english", "fields": { "std": { <3> "type": "string", "analyzer": "standard" } } } } } } } -------------------------------------------------- // SENSE: 110_Multi_Field_Search/30_Most_fields.json <1> See <<relevance-is-broken>>. <2> The `title` field is stemmed by the `english` analyzer. <3> The `title.std` field uses the `standard` analyzer and so is not stemmed. Next we index some documents: [source,js] -------------------------------------------------- PUT /my_index/my_type/1 { "title": "My rabbit jumps" } PUT /my_index/my_type/2 { "title": "Jumping jack rabbits" } -------------------------------------------------- // SENSE: 110_Multi_Field_Search/30_Most_fields.json Here is a simple `match` query on the `title` field for `jumping rabbits`: [source,js] -------------------------------------------------- GET /my_index/_search { "query": { "match": { "title": "jumping rabbits" } } } -------------------------------------------------- // SENSE: 110_Multi_Field_Search/30_Most_fields.json This becomes a query for the two stemmed terms `jump` and `rabbit`, thanks to the `english` analyzer. The `title` field of both documents contains both of those terms, so both documents receive the same score: [source,js] -------------------------------------------------- { "hits": [ { "_id": "1", "_score": 0.42039964, "_source": { "title": "My rabbit jumps" } }, { "_id": "2", "_score": 0.42039964, "_source": { "title": "Jumping jack rabbits" } } ] } -------------------------------------------------- If we were to query just the `title.std` field, then only document 2 would match. However, if we were to query both fields and to _combine_ their scores by using the `bool` query, then both documents would match (thanks to the `title` field) and document 2 would score higher (thanks to the `title.std` field): [source,js] -------------------------------------------------- GET /my_index/_search { "query": { "multi_match": { "query": "jumping rabbits", "type": "most_fields", <1> "fields": [ "title", "title.std" ] } } } -------------------------------------------------- // SENSE: 110_Multi_Field_Search/30_Most_fields.json <1> We want to combine the scores from all matching fields, so we use the `most_fields` type. This causes the `multi_match` query to wrap the two field-clauses in a `bool` query instead of a `dis_max` query. [source,js] -------------------------------------------------- { "hits": [ { "_id": "2", "_score": 0.8226396, <1> "_source": { "title": "Jumping jack rabbits" } }, { "_id": "1", "_score": 0.10741998, <1> "_source": { "title": "My rabbit jumps" } } ] } -------------------------------------------------- <1> Document 2 now scores much higher than document 1. We are using the broad-matching `title` field to include as many documents as possible--to increase recall--but we use the `title.std` field as a _signal_ to push the most relevant results to the top. The contribution of each field to the final score can be controlled by specifying custom `boost` values. For instance, we could boost the `title` field to make it the most important field, thus reducing the effect of any other signal fields: [source,js] -------------------------------------------------- GET /my_index/_search { "query": { "multi_match": { "query": "jumping rabbits", "type": "most_fields", "fields": [ "title^10", "title.std" ] <1> } } } -------------------------------------------------- // SENSE: 110_Multi_Field_Search/30_Most_fields.json <1> The `boost` value of `10` on the `title` field makes that field relatively much more important than the `title.std` field. -->
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看