指定分析器 · Elasticsearch7.x

當Elasticsearch在你的文檔中檢測到一個新的字符串域，它會自動設置其為一個全文字符串域，使用<mark>標準分析器</mark>對它進行分析。你不希望總是這樣。可能你想使用一個不同的分析器，適用于你的數據使用的語言。有時候你想要一個字符串域就是一個字符串域—不使用分析，直接索引你傳入的精確值，例如用戶 ID 或者一個內部的狀態域或標簽。要做到這一點，我們必須手動指定這些域的映射。 [TOC] # 1. IK 分詞器 ES的默認分詞器無法識別中文單詞這樣的詞匯，而是簡單的將每個字拆為一個詞。 ```json GET /_analyze { "text": "測試單詞" } 響應結果如下： { "tokens" : [ { "token" : "測", # token實際存儲到索引中的詞條 "start_offset" : 0, # start_offset和end_offset指明字符在原始字符串中的位置 "end_offset" : 1, "type" : "<IDEOGRAPHIC>", "position" : 0 # position指明詞條在原始文本中出現的位置 }, { "token" : "試", "start_offset" : 1, "end_offset" : 2, "type" : "<IDEOGRAPHIC>", "position" : 1 }, { "token" : "單", "start_offset" : 2, "end_offset" : 3, "type" : "<IDEOGRAPHIC>", "position" : 2 }, { "token" : "詞", "start_offset" : 3, "end_offset" : 4, "type" : "<IDEOGRAPHIC>", "position" : 3 } ] } ``` 這樣的結果顯然不符合我們的使用要求，所以我們需要下載 ES 對應版本的中文分詞器。步驟如下： **1. 下載IK分詞器** https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.8.0 ![](https://img.kancloud.cn/84/fa/84fa9f7d04c9bb8bd8cd34488c2c1b1a_1291x268.png) **2. 解壓到`%ES_HOME%/plugins/`目錄下** ![](https://img.kancloud.cn/1a/d6/1ad69744db8da28f7f3bba6210bbbe39_1458x167.png) **3. 重啟ES** **4. 指定IK分詞器** * `ik_max_word`：會將文本做最細粒度的拆分。 * `ik_smart`：會將文本做最粗粒度的拆分。 ```json GET /_analyze { "text": "測試單詞", "analyzer":"ik_max_word" } 響應結果如下： { "tokens" : [ { "token" : "測試", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "單詞", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 1 } ] } ``` <br/> ES 中也可以進行擴展詞匯，下面的查詢僅僅可以得到每個字的分詞結果，我們需要做的就是使分詞器識別到弗雷爾卓德也是一個詞語。 ```json GET /_analyze { "text": "弗雷爾卓德", "analyzer":"ik_max_word" } 響應結果如下： { "tokens" : [ { "token" : "弗", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "雷", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "爾", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "卓", "start_offset" : 3, "end_offset" : 4, "type" : "CN_CHAR", "position" : 3 }, { "token" : "德", "start_offset" : 4, "end_offset" : 5, "type" : "CN_CHAR", "position" : 4 } ] ``` 使分詞器識別到弗雷爾卓德也是一個詞語，需要做如下工作。 **1. 創建`%ES_HOME%/plugins/ik分詞器目錄/config/**.dic`文件** 創建`custom.dic`（文件名自定義）文件并將需要作為中文詞語的字符串寫入文件中。 ![](https://img.kancloud.cn/f0/81/f0815c0c9337db266cb17b0681b4681c_1254x273.png) ``` 弗雷爾卓德測試單詞 ``` **2. 在文件`%ES_HOME%/plugins/ik分詞器目錄/config/IKAnalyzer.cfg.xml`中配置`custom.dic`文件** ```xml <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 擴展配置</comment>  <entry key="ext_dict">custom.dic</entry>  <entry key="ext_stopwords"></entry>     </properties> ``` **3. 重啟ES** **4. 測試** ```json GET /_analyze { "text": "弗雷爾卓德", "analyzer":"ik_max_word" } 響應結果如下： { "tokens" : [ { "token" : "弗雷爾卓德", "start_offset" : 0, "end_offset" : 5, "type" : "CN_WORD", "position" : 0 } ] } GET /_analyze { "text": "測試單詞", "analyzer":"ik_max_word" } 響應結果如下： { "tokens" : [ { "token" : "測試單詞", "start_offset" : 0, "end_offset" : 4, "type" : "CN_WORD", "position" : 0 }, { "token" : "測試", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 1 }, { "token" : "單詞", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 2 } ] } ```