<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                ??一站式輕松地調用各大LLM模型接口,支持GPT4、智譜、豆包、星火、月之暗面及文生圖、文生視頻 廣告
                #### 以字段為中心的查詢(Field-centric Queries) 上述提到的三個問題都來源于most_fields是以字段為中心(Field-centric),而不是以詞條為中心(Term-centric):它會查詢最多匹配的字段(Most matching fields),而我們真正感興趣的最匹配的詞條(Most matching terms)。 > 提示:best_fields同樣是以字段為中心的,因此它也存在相似的問題。 首先我們來看看為什么存在這些問題,以及如何解決它們。 ##### 問題1:在多個字段中匹配相同的單詞 考慮一下most_fields查詢是如何執行的:ES會為每個字段生成一個match查詢,然后將它們包含在一個bool查詢中。 我們可以將查詢傳入到validate-query API中進行查看: ```Javascript GET /_validate/query?explain { "query": { "multi_match": { "query": "Poland Street W1V", "type": "most_fields", "fields": [ "street", "city", "country", "postcode" ] } } } ``` // SENSE: 110_Multi_Field_Search/40_Entity_search_problems.json 它會產生下面的解釋(explaination): (street:poland street:street street:w1v) (city:poland city:street city:w1v) (country:poland country:street country:w1v) (postcode:poland postcode:street postcode:w1v) 你可以發現能夠在兩個字段中匹配poland的文檔會比在一個字段中匹配了poland和street的文檔的分值要高。 ##### 問題2:減少長尾 在[精度控制(Controlling Precision)](../100_Full_Text_Search/15_Combining_queries.md)一節中,我們討論了如何使用and操作符和minimum_should_match參數來減少相關度低的文檔數量: ```Javascript { "query": { "multi_match": { "query": "Poland Street W1V", "type": "most_fields", "operator": "and", <1> "fields": [ "street", "city", "country", "postcode" ] } } } ``` // SENSE: 110_Multi_Field_Search/40_Entity_search_problems.json <1> 所有的term必須存在。 但是,使用best_fields或者most_fields,這些參數會被傳遞到生成的match查詢中。該查詢的解釋如下(譯注:通過validate-query API): (+street:poland +street:street +street:w1v) (+city:poland +city:street +city:w1v) (+country:poland +country:street +country:w1v) (+postcode:poland +postcode:street +postcode:w1v) 換言之,使用and操作符時,所有的單詞都需要出現在相同的字段中,這顯然是錯的!這樣做可能不會有任何匹配的文檔。 ##### 問題3:詞條頻度 在[什么是相關度(What is Relevance(relevance-intro))](https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html)一節中,我們解釋了默認用來計算每個詞條的相關度分值的相似度算法TF/IDF: * 詞條頻度(Term Frequency):: 在一份文檔中,一個詞條在一個字段中出現的越頻繁,文檔的相關度就越高。 * 倒排文檔頻度(Inverse Document Frequency):: 一個詞條在索引的所有文檔的字段中出現的越頻繁,詞條的相關度就越低。 當通過多字段進行搜索時,TF/IDF會產生一些令人驚訝的結果。 考慮使用first_name和last_name字段搜索"Peter Smith"的例子。Peter是一個常見的名字,Smith是一個常見的姓氏 - 它們的IDF都較低。但是如果在索引中有另外一個名為Smith Williams的人呢?Smith作為名字是非常罕見的,因此它的IDF值會很高! 像下面這樣的一個簡單查詢會將Smith Williams放在Peter Smith前面(譯注:含有Smith Williams的文檔分值比含有Peter Smith的文檔分值高),盡管Peter Smith明顯是更好的匹配: ```Javascript { "query": { "multi_match": { "query": "Peter Smith", "type": "most_fields", "fields": [ "*_name" ] } } } ``` // SENSE: 110_Multi_Field_Search/40_Bad_frequencies.json smith在first_name字段中的高IDF值會壓倒peter在first_name字段和smith在last_name字段中的兩個低IDF值。 ##### 解決方案 這個問題僅在我們處理多字段時存在。如果我們將所有這些字段合并到一個字段中,該問題就不復存在了。我們可以向person文檔中添加一個full_name字段來實現: ```Javascript { "first_name": "Peter", "last_name": "Smith", "full_name": "Peter Smith" } ``` 當我們只查詢full_name字段時: * 擁有更多匹配單詞的文檔會勝過那些重復出現一個單詞的文檔。 * minimum_should_match和operator參數能夠正常工作。 * first_name和last_name的倒排文檔頻度會被合并,因此smith無論是first_name還是last_name都不再重要。 盡管這種方法能工作,可是我們并不想存儲冗余數據。因此,ES為我們提供了兩個解決方案 - 一個在索引期間,一個在搜索期間。下一節對它們進行討論。 <!-- [[field-centric]] === Field-Centric Queries All three of the preceding problems stem from ((("field-centric queries")))((("multifield search", "field-centric queries, problems with")))((("most fields queries", "problems with field-centric queries")))`most_fields` being _field-centric_ rather than _term-centric_: it looks for the most matching _fields_, when really what we're interested is the most matching _terms_. NOTE: The `best_fields` type is also field-centric((("best fields queries", "problems with field-centric queries"))) and suffers from similar problems. First we'll look at why these problems exist, and then how we can combat them. ==== Problem 1: Matching the Same Word in Multiple Fields Think about how the `most_fields` query is executed: Elasticsearch generates a separate `match` query for each field and then wraps these match queries in an outer `bool` query. We can see this by passing our query through the `validate-query` API: [source,js] -------------------------------------------------- GET /_validate/query?explain { "query": { "multi_match": { "query": "Poland Street W1V", "type": "most_fields", "fields": [ "street", "city", "country", "postcode" ] } } } -------------------------------------------------- // SENSE: 110_Multi_Field_Search/40_Entity_search_problems.json which yields this `explanation`: (street:poland street:street street:w1v) (city:poland city:street city:w1v) (country:poland country:street country:w1v) (postcode:poland postcode:street postcode:w1v) You can see that a document matching just the word `poland` in _two_ fields could score higher than a document matching `poland` and `street` in one field. ==== Problem 2: Trimming the Long Tail In <<match-precision>>, we talked about((("and operator", "most fields and best fields queries and")))((("minimum_should_match parameter", "most fields and best fields queries"))) using the `and` operator or the `minimum_should_match` parameter to trim the long tail of almost irrelevant results. Perhaps we could try this: [source,js] -------------------------------------------------- { "query": { "multi_match": { "query": "Poland Street W1V", "type": "most_fields", "operator": "and", <1> "fields": [ "street", "city", "country", "postcode" ] } } } -------------------------------------------------- // SENSE: 110_Multi_Field_Search/40_Entity_search_problems.json <1> All terms must be present. However, with `best_fields` or `most_fields`, these parameters are passed down to the generated `match` queries. The `explanation` for this query shows the following: (+street:poland +street:street +street:w1v) (+city:poland +city:street +city:w1v) (+country:poland +country:street +country:w1v) (+postcode:poland +postcode:street +postcode:w1v) In other words, using the `and` operator means that all words must exist _in the same field_, which is clearly wrong! It is unlikely that any documents would match this query. ==== Problem 3: Term Frequencies In <<relevance-intro>>, we explained that the default similarity algorithm used to calculate the relevance score ((("term frequency", "problems with field-centric queries")))for each term is TF/IDF: Term frequency:: The more often a term appears in a field in a single document, the more relevant the document. Inverse document frequency:: The more often a term appears in a field in all documents in the index, the less relevant is that term. When searching against multiple fields, TF/IDF can((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "surprising results when searching against multiple fields"))) introduce some surprising results. Consider our example of searching for ``Peter Smith'' using the `first_name` and `last_name` fields.((("inverse document frequency", "field-centric queries and"))) Peter is a common first name and Smith is a common last name--both will have low IDFs. But what if we have another person in the index whose name is Smith Williams? Smith as a first name is very uncommon and so will have a high IDF! A simple query like the following may well return Smith Williams above Peter Smith in spite of the fact that the second person is a better match than the first. [source,js] -------------------------------------------------- { "query": { "multi_match": { "query": "Peter Smith", "type": "most_fields", "fields": [ "*_name" ] } } } -------------------------------------------------- // SENSE: 110_Multi_Field_Search/40_Bad_frequencies.json The high IDF of `smith` in the first name field can overwhelm the two low IDFs of `peter` as a first name and `smith` as a last name. ==== Solution These problems only exist because we are dealing with multiple fields. If we were to combine all of these fields into a single field, the problems would vanish. We could achieve this by adding a `full_name` field to our `person` document: [source,js] -------------------------------------------------- { "first_name": "Peter", "last_name": "Smith", "full_name": "Peter Smith" } -------------------------------------------------- When querying just the `full_name` field: * Documents with more matching words would trump documents with the same word repeated. * The `minimum_should_match` and `operator` parameters would function as expected. * The inverse document frequencies for first and last names would be combined so it wouldn't matter whether Smith were a first or last name anymore. While this would work, we don't like having to store redundant data. Instead, Elasticsearch offers us two solutions--one at index time and one at search time--which we discuss next. -->
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看