Elasticsearch 如何實現查詢/聚合不區分大小寫？ · php開發筆記

## **1、實戰問題** 最近社區里有多個關于區分大小寫的問題：問題1：ES查詢和聚合怎么設置不區分大小寫呢？問題2：ES7.6 如何實現模糊查詢不區分大小寫？主要是如何進行分詞和mapping的一些設置來實現這個效果，自己也嘗試過對setting 和 mapping字段進行設置，都是報錯比較著急，類似的問題，既然有很多同學問到，那么咱們就有必要梳理出完整的思路和方案。這或許是銘毅天下公眾號的使命所在。這個問題不復雜，所以本文會言簡意賅，直擊要害！ ## **2、問題拆解** ### **2.1 拆解一：如果默認分詞方式，能區分大小寫的嗎？** 是的，默認分詞器是Standard 標準分詞器，是不區分大小寫的。官方文檔原理部分：如下的兩張圖很直觀的說明了：標準分詞器的 Token filters 核心組成是：Lower Case Token Filter。 ![](https://img.kancloud.cn/55/70/55702b48705f21451103df2e01cc6354_390x256.png) ![](https://img.kancloud.cn/c0/9a/c09ac57388d00021dc3892ff2d351b74_763x389.png) 什么意思呢？大寫的英文字符會轉換成小寫。 ### **2.2 拆解二：實踐 Demo 驗證** ``` ~~~javascript DELETE test_003 PUT test_003 { "mappings": { "properties": { "title":{ "type":"text", "analyzer": "standard" }, "keyword":{ "type":"keyword" } } } } POST test_003/_bulk {"index":{"_id":1}} { "city": "New York"} {"index":{"_id":2}} { "city": "new York"} {"index":{"_id":3}} { "city": "New york"} {"index":{"_id":4}} { "city": "NEW YORK"} {"index":{"_id":5}} { "city": "Seattle"} POST test_003/_analyze { "text": "New york", "analyzer": "standard" } POST test_003/_search { "query": { "match_phrase":{ "city":"new york" } } } ~~~ ``` match\_phrase 檢索返回結果非常明確：\_id = 1,2,3,4 的數據都被召回。這里初步結論是：standard 標準默認分詞器可以實現區分大小寫。但是，我們再看一下聚合呢？ ``` ~~~javascript GET test_003/_search { "size": 0, "aggs": { "cities": { "terms": { "field": "city.keyword" } } } } ~~~ ``` 返回結果如下： ``` ~~~javascript "aggregations" : { "cities" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "NEW YORK", "doc_count" : 1 }, { "key" : "New York", "doc_count" : 1 }, { "key" : "New york", "doc_count" : 1 }, { "key" : "Seattle", "doc_count" : 1 }, { "key" : "new York", "doc_count" : 1 } ] } } ~~~ ``` 這里最核心的是： * Mapping 設置是：multi-fields。 * 聚合走的是 keyword 類型了，不涉及分詞器：standard 了。既然提到了 keyword，我們進一步看： ``` ~~~javascript POST test_003/_search { "query": { "term":{ "city.keyword":"new york" } } } ~~~ ``` 執行精準匹配后，召回結果為空。怎么解釋呢？keyword 類型屬于精準匹配，也就是說：單純的keyword 類型沒法實現大小寫區分。進一步小結：我們上面的組合multi-field 方式，并沒有解決檢索和聚合區分大小寫的問題？ multi-field 都搞不定，那還有招嗎？別急，我們慢慢來...... 這時候得思考：需要在 Mapping 階段做文章了。核心原理：把所有都轉為小寫，寫入時候設置 Mapping——設置filter過濾：小寫過濾。這個是一個我們過往文章沒有提及的知識點 normalizer，希望你把它看完并掌握。 ## **3、解決方案** 先給出實現，后面講原理。 ``` ~~~javascript PUT caseinsensitive { "settings": { "analysis": { "normalizer": { "lowercase_normalizer": { "type": "custom", "char_filter": [], "filter": [ "lowercase" ] } } } }, "mappings": { "properties": { "city": { "type": "keyword", "normalizer": "lowercase_normalizer" } } } } POST caseinsensitive/_bulk {"index":{"_id":1}} { "city": "New York"} {"index":{"_id":2}} { "city": "new York"} {"index":{"_id":3}} { "city": "New york"} {"index":{"_id":4}} { "city": "NEW YORK"} {"index":{"_id":5}} { "city": "Seattle"} GET caseinsensitive/_search { "query": { "bool": { "filter": { "term": { "city": "NEW YORK" } } } } } ~~~ ``` 此時的檢索返回結果是：\_id = 1,2,3,4 文檔都被召回。注意，我們使用了 terms 檢索。 ``` ~~~javascript GET caseinsensitive/_search { "size": 0, "aggs": { "cities": { "terms": { "field": "city" } } } } ~~~ ``` 返回結果是： ``` ~~~javascript "aggregations" : { "cities" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "new york", "doc_count" : 4 }, { "key" : "seattle", "doc_count" : 1 } ] } } ~~~ ``` 以上 new york 4種不同大小寫的全都聚合到了一起，這是我們期望的結果。 ## **4、解決方案的原理解讀** 核心的核心是我們使用了：normalizer。這個概念咱們之前分詞的文章都沒有提及，這里要普及一下。官方解讀如下： * The normalizer property of keyword fields is similar to analyzer except that it guarantees that the analysis chain produces a single token. * The normalizer is applied prior to indexing the keyword, as well as at search-time when the keyword field is searched via a query parser such as the match query or via a term-level query such as the term query. 核心點如下： * 第一：normalizer是 keyword的一個屬性，類似 analyzer分詞器的功能，不同的地方在于：可以對 keyword生成的單一 Term再做進一步的處理。 * 第二：normalizer 在 keyword 類型數據索引化之前被使用，同時在 match 或者 term 類型檢索階段也能被使用。剛才提及的進一步處理，反映到我們的解決方案上：就是可以做小寫 lowercase 轉換。由于寫入階段和檢索階段：normalizer 都生效，所以就實現了我們想要的不區分大小寫的結果。 ## **5、小結** 如果官方文檔熟悉，我們的示例，實際就是官方文檔：normalizer 的舉例。中間的 filter 我們設置了小寫，當然也可以有其他的設置，需要結合業務場景靈活使用。歡迎大家留言說一下類似問題的其他不同實現方案。和你一起，死磕 Elasticsearch！