ElasticSearch最全分詞器比較及使用方法 · php開發筆記

介紹：ElasticSearch 是一個基于 Lucene 的搜索服務器。它提供了一個分布式多用戶能力的全文搜索引擎，基于 RESTful web 接口。Elasticsearch 是用 Java 開發的，并作為Apache許可條款下的開放源碼發布，是當前流行的企業級搜索引擎。設計用于云計算中，能夠達到實時搜索，穩定，可靠，快速，安裝使用方便。 Elasticsearch中，內置了很多分詞器（analyzers）。下面來進行比較下系統默認分詞器和常用的中文分詞器之間的區別。系統默認分詞器： 1、standard 分詞器 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html 如何使用：http://www.yiibai.com/lucene/lucene\_standardanalyzer.html 英文的處理能力同于StopAnalyzer.支持中文采用的方法為單字切分。他會將詞匯單元轉換成小寫形式，并去除停用詞和標點符號。 ~~~ /**StandardAnalyzer分析器*/ public void standardAnalyzer(String msg){ StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); this.getTokens(analyzer, msg); ~~~ 2、simple 分詞器 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html 如何使用:?http://www.yiibai.com/lucene/lucene\_simpleanalyzer.html 功能強于WhitespaceAnalyzer, 首先會通過非字母字符來分割文本信息，然后將詞匯單元統一為小寫形式。該分析器會去掉數字類型的字符。 ~~~ /**SimpleAnalyzer分析器*/ public void simpleAnalyzer(String msg){ SimpleAnalyzer analyzer = new SimpleAnalyzer(Version.LUCENE_36); this.getTokens(analyzer, msg); ~~~ 3、Whitespace 分詞器 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-analyzer.html 如何使用：http://www.yiibai.com/lucene/lucene\_whitespaceanalyzer.html 僅僅是去除空格，對字符沒有lowcase化,不支持中文；并且不對生成的詞匯單元進行其他的規范化處理。 ~~~ /**WhitespaceAnalyzer分析器*/ public void whitespaceAnalyzer(String msg){ WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_36); this.getTokens(analyzer, msg); } ~~~ 4、Stop 分詞器 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-analyzer.html 如何使用：http://www.yiibai.com/lucene/lucene\_stopanalyzer.html StopAnalyzer的功能超越了SimpleAnalyzer，在SimpleAnalyzer的基礎上增加了去除英文中的常用單詞（如the，a等），也可以更加自己的需要設置常用單詞；不支持中文 ~~~ /**StopAnalyzer分析器*/ public void stopAnalyzer(String msg){ StopAnalyzer analyzer = new StopAnalyzer(Version.LUCENE_36); this.getTokens(analyzer, msg); } ~~~ 5、keyword 分詞器 KeywordAnalyzer把整個輸入作為一個單獨詞匯單元，方便特殊類型的文本進行索引和檢索。針對郵政編碼，地址等文本信息使用關鍵詞分詞器進行索引項建立非常方便。 6、pattern 分詞器 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html 一個pattern類型的analyzer可以通過正則表達式將文本分成"terms"(經過token Filter 后得到的東西 )。接受如下設置: 一個 pattern analyzer 可以做如下的屬性設置: lowercaseterms是否是小寫. 默認為 true 小寫.pattern正則表達式的pattern, 默認是 \\W+.flags正則表達式的flagsstopwords一個用于初始化stop filter的需要stop 單詞的列表.默認單詞是空的列表 7、language 分詞器 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html 一個用于解析特殊語言文本的analyzer集合。（ arabic,armenian, basque, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french,galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian,persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.）可惜沒有中文。不予考慮 8、snowball 分詞器一個snowball類型的analyzer是由standard tokenizer和standard filter、lowercase filter、stop filter、snowball filter這四個filter構成的。 snowball analyzer 在Lucene中通常是不推薦使用的。 9、Custom 分詞器是自定義的analyzer。允許多個零到多個tokenizer，零到多個 Char Filters. custom analyzer 的名字不能以 "\_"開頭. The following are settings that can be set for a custom analyzer type: SettingDescriptiontokenizer通用的或者注冊的tokenizer.filter通用的或者注冊的token filterschar\_filter通用的或者注冊的 character filtersposition\_increment\_gap距離查詢時，最大允許查詢的距離，默認是100 自定義的模板： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ index : analysis : analyzer : myAnalyzer2 : type : custom tokenizer : myTokenizer1 filter : [myTokenFilter1, myTokenFilter2] char_filter : [my_html] position_increment_gap: 256 tokenizer : myTokenizer1 : type : standard max_token_length : 900 filter : myTokenFilter1 : type : stop stopwords : [stop1, stop2, stop3, stop4] myTokenFilter2 : type : length min : 0 max : 2000 char_filter : my_html : type : html_strip escaped_tags : [xxx, yyy] read_ahead : 1024 ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ### 10、fingerprint 分詞器 [https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-analyzer.html](https://link.zhihu.com/?target=https%3A//www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-analyzer.html) * * * 中文分詞器： 1、ik-analyzer https://github.com/wks/ik-analyzer IKAnalyzer是一個開源的，基于java語言開發的輕量級的中文分詞工具包。采用了特有的“正向迭代最細粒度切分算法“，支持細粒度和最大詞長兩種切分模式；具有83萬字/秒（1600KB/S）的高速處理能力。采用了多子處理器分析模式，支持：英文字母、數字、中文詞匯等分詞處理，兼容韓文、日文字符優化的詞典存儲，更小的內存占用。支持用戶詞典擴展定義針對Lucene全文檢索優化的查詢分析器IKQueryParser(作者吐血推薦)；引入簡單搜索表達式，采用歧義分析算法優化查詢關鍵字的搜索排列組合，能極大的提高Lucene檢索的命中率。 Maven用法： ~~~ <dependency> <groupId>org.wltea.ik-analyzer</groupId> <artifactId>ik-analyzer</artifactId> <version>3.2.8</version> </dependency> ~~~ 在IK Analyzer加入Maven Central Repository之前，你需要手動安裝，安裝到本地的repository，或者上傳到自己的Maven repository服務器上。要安裝到本地Maven repository，使用如下命令，將自動編譯，打包并安裝： mvn install -Dmaven.test.skip=true Elasticsearch添加中文分詞安裝IK分詞插件 https://github.com/medcl/elasticsearch-analysis-ik 進入elasticsearch-analysis-ik-master 更多安裝請參考博客： 1、為elastic添加中文分詞：?http://blog.csdn.net/dingzfang/article/details/42776693 2、如何在Elasticsearch中安裝中文分詞器(IK+pinyin)：http://www.cnblogs.com/xing901022/p/5910139.html 3、Elasticsearch 中文分詞器 IK 配置和使用：?http://blog.csdn.net/jam00/article/details/52983056 ik 帶有兩個分詞器 ik\_max\_word：會將文本做最細粒度的拆分；盡可能多的拆分出詞語 ik\_smart：會做最粗粒度的拆分；已被分出的詞語將不會再次被其它詞語占有區別： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ # ik_max_word curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_max_word' -d '聯想是全球最大的筆記本廠商' #返回 { "tokens" : [ { "token" : "聯想", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "是", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 1 }, { "token" : "全球", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 2 }, { "token" : "最大", "start_offset" : 5, "end_offset" : 7, "type" : "CN_WORD", "position" : 3 }, { "token" : "的", "start_offset" : 7, "end_offset" : 8, "type" : "CN_CHAR", "position" : 4 }, { "token" : "筆記本", "start_offset" : 8, "end_offset" : 11, "type" : "CN_WORD", "position" : 5 }, { "token" : "筆記", "start_offset" : 8, "end_offset" : 10, "type" : "CN_WORD", "position" : 6 }, { "token" : "本廠", "start_offset" : 10, "end_offset" : 12, "type" : "CN_WORD", "position" : 7 }, { "token" : "廠商", "start_offset" : 11, "end_offset" : 13, "type" : "CN_WORD", "position" : 8 } ] } # ik_smart curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_smart' -d '聯想是全球最大的筆記本廠商' # 返回 { "tokens" : [ { "token" : "聯想", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "是", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 1 }, { "token" : "全球", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 2 }, { "token" : "最大", "start_offset" : 5, "end_offset" : 7, "type" : "CN_WORD", "position" : 3 }, { "token" : "的", "start_offset" : 7, "end_offset" : 8, "type" : "CN_CHAR", "position" : 4 }, { "token" : "筆記本", "start_offset" : 8, "end_offset" : 11, "type" : "CN_WORD", "position" : 5 }, { "token" : "廠商", "start_offset" : 11, "end_offset" : 13, "type" : "CN_WORD", "position" : 6 } ] } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 下面我們來創建一個索引，使用 ik 創建一個名叫 iktest 的索引，設置它的分析器用 ik ，分詞器用 ik\_max\_word，并創建一個 article 的類型，里面有一個 subject 的字段，指定其使用 ik\_max\_word 分詞器 [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ curl -XPUT 'http://localhost:9200/iktest?pretty' -d '{ "settings" : { "analysis" : { "analyzer" : { "ik" : { "tokenizer" : "ik_max_word" } } } }, "mappings" : { "article" : { "dynamic" : true, "properties" : { "subject" : { "type" : "string", "analyzer" : "ik_max_word" } } } } }' ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 批量添加幾條數據，這里我指定元數據 \_id 方便查看，subject 內容為我隨便找的幾條新聞的標題 [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ curl -XPOST http://localhost:9200/iktest/article/_bulk?pretty -d ' { "index" : { "_id" : "1" } } {"subject" : "＂閨蜜＂崔順實被韓檢方傳喚韓總統府促徹查真相" } { "index" : { "_id" : "2" } } {"subject" : "韓舉行＂護國訓練＂青瓦臺:決不許國家安全出問題" } { "index" : { "_id" : "3" } } {"subject" : "媒體稱FBI已經取得搜查令檢視希拉里電郵" } { "index" : { "_id" : "4" } } {"subject" : "村上春樹獲安徒生獎演講中談及歐洲排外問題" } { "index" : { "_id" : "5" } } {"subject" : "希拉里團隊炮轟FBI 參院民主黨領袖批其“違法”" } ' ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 查詢 “希拉里和韓國” [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ curl -XPOST http://localhost:9200/iktest/article/_search?pretty -d' { "query" : { "match" : { "subject" : "希拉里和韓國" }}, "highlight" : { "pre_tags" : [""], "post_tags" : [""], "fields" : { "subject" : {} } } } ' #返回 { "took" : 113, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 4, "max_score" : 0.034062363, "hits" : [ { "_index" : "iktest", "_type" : "article", "_id" : "2", "_score" : 0.034062363, "_source" : { "subject" : "韓舉行＂護國訓練＂青瓦臺:決不許國家安全出問題" }, "highlight" : { "subject" : [ "韓舉行＂護國訓練＂青瓦臺:決不許國家安全出問題" ] } }, { "_index" : "iktest", "_type" : "article", "_id" : "3", "_score" : 0.0076681254, "_source" : { "subject" : "媒體稱FBI已經取得搜查令檢視希拉里電郵" }, "highlight" : { "subject" : [ "媒體稱FBI已經取得搜查令檢視希拉里電郵" ] } }, { "_index" : "iktest", "_type" : "article", "_id" : "5", "_score" : 0.006709609, "_source" : { "subject" : "希拉里團隊炮轟FBI 參院民主黨領袖批其“違法”" }, "highlight" : { "subject" : [ "希拉里團隊炮轟FBI 參院民主黨領袖批其“違法”" ] } }, { "_index" : "iktest", "_type" : "article", "_id" : "1", "_score" : 0.0021509775, "_source" : { "subject" : "＂閨蜜＂崔順實被韓檢方傳喚韓總統府促徹查真相" }, "highlight" : { "subject" : [ "＂閨蜜＂崔順實被韓檢方傳喚 韓總統府促徹查真相" ] } } ] } } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 這里用了高亮屬性 highlight，直接顯示到 html 中，被匹配到的字或詞將以紅色突出顯示。若要用過濾搜索，直接將 match 改為 term 即可熱詞更新配置網絡詞語日新月異，如何讓新出的網絡熱詞（或特定的詞語）實時的更新到我們的搜索當中呢先用 ik 測試一下 [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_max_word' -d ' 成龍原名陳港生 ' #返回 { "tokens" : [ { "token" : "成龍", "start_offset" : 1, "end_offset" : 3, "type" : "CN_WORD", "position" : 0 }, { "token" : "原名", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 1 }, { "token" : "陳", "start_offset" : 5, "end_offset" : 6, "type" : "CN_CHAR", "position" : 2 }, { "token" : "港", "start_offset" : 6, "end_offset" : 7, "type" : "CN_WORD", "position" : 3 }, { "token" : "生", "start_offset" : 7, "end_offset" : 8, "type" : "CN_CHAR", "position" : 4 } ] } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ik 的主詞典中沒有”陳港生” 這個詞，所以被拆分了。現在我們來配置一下修改 IK 的配置文件：ES 目錄/plugins/ik/config/ik/IKAnalyzer.cfg.xml 修改如下： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 擴展配置</comment>  <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>  <entry key="ext_stopwords">custom/ext_stopword.dic</entry>  <entry key="remote_ext_dict">http://192.168.1.136/hotWords.php</entry>   </properties> ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 這里我是用的是遠程擴展字典，因為可以使用其他程序調用更新，且不用重啟 ES，很方便；當然使用自定義的 mydict.dic 字典也是很方便的，一行一個詞，自己加就可以了既然是遠程詞典，那么就要是一個可訪問的鏈接，可以是一個頁面，也可以是一個txt的文檔，但要保證輸出的內容是 utf-8 的格式 hotWords.php 的內容 [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ $s = <<<'EOF' 陳港生元樓藍瘦 EOF; header('Last-Modified: '.gmdate('D, d M Y H:i:s', time()).' GMT', true, 200); header('ETag: "5816f349-19"'); echo $s; ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ik 接收兩個返回的頭部屬性 Last-Modified 和 ETag，只要其中一個有變化，就會觸發更新，ik 會每分鐘獲取一次重啟 Elasticsearch ，查看啟動記錄，看到了三個詞已被加載進來再次執行上面的請求，返回, 就可以看到 ik 分詞器已經匹配到了 “陳港生” 這個詞，同理一些關于我們公司的專有名字（例如：永輝、永輝超市、永輝云創、云創 .... ）也可以自己手動添加到字典中去。 2、結巴中文分詞特點： 1、支持三種分詞模式：精確模式，試圖將句子最精確地切開，適合文本分析；全模式，把句子中所有的可以成詞的詞語都掃描出來, 速度非常快，但是不能解決歧義；搜索引擎模式，在精確模式的基礎上，對長詞再次切分，提高召回率，適合用于搜索引擎分詞。 2、支持繁體分詞 3、支持自定義詞典 3、THULAC THULAC（THU Lexical Analyzer for Chinese）由清華大學自然語言處理與社會人文計算實驗室研制推出的一套中文詞法分析工具包，具有中文分詞和詞性標注功能。THULAC具有如下幾個特點：能力強。利用我們集成的目前世界上規模最大的人工分詞和詞性標注中文語料庫（約含5800萬字）訓練而成，模型標注能力強大。準確率高。該工具包在標準數據集Chinese Treebank（CTB5）上分詞的F1值可達97.3％，詞性標注的F1值可達到92.9％，與該數據集上最好方法效果相當。速度較快。同時進行分詞和詞性標注速度為300KB/s，每秒可處理約15萬字。只進行分詞速度可達到1.3MB/s。中文分詞工具thulac4j發布 1、規范化分詞詞典，并去掉一些無用詞； 2、重寫DAT（雙數組Trie樹）的構造算法，生成的DAT size減少了8%左右，從而節省了內存； 3、優化分詞算法，提高了分詞速率。 ~~~ <dependency> <groupId>io.github.yizhiru</groupId> <artifactId>thulac4j</artifactId> <version>${thulac4j.version}</version> </dependency> ~~~ [http://www.cnblogs.com/en-heng/p/6526598.html](https://link.zhihu.com/?target=http%3A//www.cnblogs.com/en-heng/p/6526598.html) thulac4j支持兩種分詞模式： SegOnly模式，只分詞沒有詞性標注； SegPos模式，分詞兼有詞性標注。 [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ // SegOnly mode String sentence = "滔滔的流水，向著波士頓灣無聲逝去"; SegOnly seg = new SegOnly("models/seg_only.bin"); System.out.println(seg.segment(sentence)); // [滔滔, 的, 流水, ，, 向著, 波士頓灣, 無聲, 逝去] // SegPos mode SegPos pos = new SegPos("models/seg_pos.bin"); System.out.println(pos.segment(sentence)); //[滔滔/a, 的/u, 流水/n, ，/w, 向著/p, 波士頓灣/ns, 無聲/v, 逝去/v] ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 4、NLPIR 中科院計算所 NLPIR：http://ictclas.nlpir.org/nlpir/?(可直接在線分析中文) 下載地址：https://github.com/NLPIR-team/NLPIR 中科院分詞系統(NLPIR)JAVA簡易教程:?http://www.cnblogs.com/wukongjiuwo/p/4092480.html 5、ansj分詞器 https://github.com/NLPchina/ansj\_seg 這是一個基于n-Gram+CRF+HMM的中文分詞的java實現. 分詞速度達到每秒鐘大約200萬字左右（mac air下測試），準確率能達到96%以上目前實現了.中文分詞. 中文姓名識別 . 用戶自定義詞典,關鍵字提取，自動摘要，關鍵字標記等功能可以應用到自然語言處理等方面,適用于對分詞效果要求高的各種項目. maven 引入： ~~~ <dependency> <groupId>org.ansj</groupId> <artifactId>ansj_seg</artifactId> <version>5.1.1</version> </dependency> ~~~ 調用demo ~~~ String str = "歡迎使用ansj_seg,(ansj中文分詞)在這里如果你遇到什么問題都可以聯系我.我一定盡我所能.幫助大家.ansj_seg更快,更準,更自由!" ; System.out.println(ToAnalysis.parse(str)); 歡迎/v,使用/v,ansj/en,_,seg/en,,,(,ansj/en,中文/nz,分詞/n,),在/p,這里/r,如果/c,你/r,遇到/v,什么/r,問題/n,都/d,可以/v,聯系/v,我/r,./m,我/r,一定/d,盡我所能/l,./m,幫助/v,大家/r,./m,ansj/en,_,seg/en,更快/d,,,更/d,準/a,,,更/d,自由/a,! ~~~ 6、哈工大的LTP https://github.com/HIT-SCIR/ltp LTP制定了基于XML的語言處理結果表示，并在此基礎上提供了一整套自底向上的豐富而且高效的中文語言處理模塊（包括詞法、句法、語義等6項中文處理核心技術），以及基于動態鏈接庫（Dynamic Link Library, DLL）的應用程序接口、可視化工具，并且能夠以網絡服務（Web Service）的形式進行使用。關于LTP的使用，請參考:?http://ltp.readthedocs.io/zh\_CN/latest/ 7、庖丁解牛下載地址：http://pan.baidu.com/s/1eQ88SZS 使用分為如下幾步：配置dic文件：修改paoding-analysis.jar中的paoding-dic-home.properties文件，將“#paoding.dic.home=dic”的注釋去掉，并配置成自己dic文件的本地存放路徑。eg：/home/hadoop/work/paoding-analysis-2.0.4-beta/dic 把Jar包導入到項目中：將paoding-analysis.jar、commons-logging.jar、lucene-analyzers-2.2.0.jar和lucene-core-2.2.0.jar四個包導入到項目中，這時就可以在代碼片段中使用庖丁解牛工具提供的中文分詞技術，例如： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ Analyzer analyzer = new PaodingAnalyzer(); //定義一個解析器 String text = "庖丁系統是個完全基于lucene的中文分詞系統，它就是重新建了一個analyzer，叫做PaodingAnalyzer，這個analyer的核心任務就是生成一個可以切詞TokenStream。"; //待分詞的內容 TokenStream tokenStream = analyzer.tokenStream(text, new StringReader(text)); //得到token序列的輸出流 try { Token t; while ((t = tokenStream.next()) != null) { System.out.println(t); //輸出每個token } } catch (IOException e) { e.printStackTrace(); } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 8、sogo在線分詞 sogo在線分詞采用了基于漢字標注的分詞方法，主要使用了線性鏈鏈CRF（Linear-chain CRF）模型。詞性標注模塊主要基于結構化線性模型（Structured Linear Model）在線使用地址為：?http://www.sogou.com/labs/webservice/ 9、word分詞地址：?https://github.com/ysc/word word分詞是一個Java實現的分布式的中文分詞組件，提供了多種基于詞典的分詞算法，并利用ngram模型來消除歧義。能準確識別英文、數字，以及日期、時間等數量詞，能識別人名、地名、組織機構名等未登錄詞。能通過自定義配置文件來改變組件行為，能自定義用戶詞庫、自動檢測詞庫變化、支持大規模分布式環境，能靈活指定多種分詞算法，能使用refine功能靈活控制分詞結果，還能使用詞頻統計、詞性標注、同義標注、反義標注、拼音標注等功能。提供了10種分詞算法，還提供了10種文本相似度算法，同時還無縫和Lucene、Solr、ElasticSearch、Luke集成。注意：word1.3需要JDK1.8 maven 中引入依賴： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ <dependencies> <dependency> <groupId>org.apdplat</groupId> <artifactId>word</artifactId> <version>1.3</version> </dependency> </dependencies> ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ElasticSearch插件： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ 1、打開命令行并切換到elasticsearch的bin目錄 cd elasticsearch-2.1.1/bin 2、運行plugin腳本安裝word分詞插件： ./plugin install http://apdplat.org/word/archive/v1.4.zip 安裝的時候注意：如果提示： ERROR: failed to download 或者 Failed to install word, reason: failed to download 或者 ERROR: incorrect hash (SHA1) 則重新再次運行命令，如果還是不行，多試兩次如果是elasticsearch1.x系列版本，則使用如下命令： ./plugin -u http://apdplat.org/word/archive/v1.3.1.zip -i word 3、修改文件elasticsearch-2.1.1/config/elasticsearch.yml，新增如下配置： index.analysis.analyzer.default.type : "word" index.analysis.tokenizer.default.type : "word" 4、啟動ElasticSearch測試效果，在Chrome瀏覽器中訪問： http://localhost:9200/_analyze?analyzer=word&text=楊尚川是APDPlat應用級產品開發平臺的作者 5、自定義配置修改配置文件elasticsearch-2.1.1/plugins/word/word.local.conf 6、指定分詞算法修改文件elasticsearch-2.1.1/config/elasticsearch.yml，新增如下配置： index.analysis.analyzer.default.segAlgorithm : "ReverseMinimumMatching" index.analysis.tokenizer.default.segAlgorithm : "ReverseMinimumMatching" 這里segAlgorithm可指定的值有：正向最大匹配算法：MaximumMatching 逆向最大匹配算法：ReverseMaximumMatching 正向最小匹配算法：MinimumMatching 逆向最小匹配算法：ReverseMinimumMatching 雙向最大匹配算法：BidirectionalMaximumMatching 雙向最小匹配算法：BidirectionalMinimumMatching 雙向最大最小匹配算法：BidirectionalMaximumMinimumMatching 全切分算法：FullSegmentation 最少詞數算法：MinimalWordCount 最大Ngram分值算法：MaxNgramScore 如不指定，默認使用雙向最大匹配算法：BidirectionalMaximumMatching ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 10、jcseg分詞器 https://code.google.com/archive/p/jcseg/ 11、stanford分詞器 Stanford大學的一個開源分詞工具，目前已支持漢語。首先，去【1】下載Download Stanford Word Segmenter version 3.5.2，取得里面的 data 文件夾，放在maven project的 src/main/resources 里。然后，maven依賴添加： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ <properties> <java.version>1.8</java.version> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <corenlp.version>3.6.0</corenlp.version> </properties> <dependencies> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>${corenlp.version}</version> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>${corenlp.version}</version> <classifier>models</classifier> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>${corenlp.version}</version> <classifier>models-chinese</classifier> </dependency> </dependencies> ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 測試： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ import java.util.Properties; import edu.stanford.nlp.ie.crf.CRFClassifier; public class CoreNLPSegment { private static CoreNLPSegment instance; private CRFClassifier classifier; private CoreNLPSegment(){ Properties props = new Properties(); props.setProperty("sighanCorporaDict", "data"); props.setProperty("serDictionary", "data/dict-chris6.ser.gz"); props.setProperty("inputEncoding", "UTF-8"); props.setProperty("sighanPostProcessing", "true"); classifier = new CRFClassifier(props); classifier.loadClassifierNoExceptions("data/ctb.gz", props); classifier.flags.setProperties(props); } public static CoreNLPSegment getInstance() { if (instance == null) { instance = new CoreNLPSegment(); } return instance; } public String[] doSegment(String data) { return (String[]) classifier.segmentString(data).toArray(); } public static void main(String[] args) { String sentence = "他和我在學校里常打桌球。"; String ret[] = CoreNLPSegment.getInstance().doSegment(sentence); for (String str : ret) { System.out.println(str); } } } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 12、Smartcn Smartcn為Apache2.0協議的開源中文分詞系統，Java語言編寫，修改的中科院計算所ICTCLAS分詞系統。很早以前看到Lucene上多了一個中文分詞的contribution，當時只是簡單的掃了一下.class文件的文件名，通過文件名可以看得出又是一個改的ICTCLAS的分詞系統。 http://lucene.apache.org/core/5\_1\_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html 13、pinyin 分詞器 pinyin分詞器可以讓用戶輸入拼音，就能查找到相關的關鍵詞。比如在某個商城搜索中，輸入yonghui，就能匹配到永輝。這樣的體驗還是非常好的。 pinyin分詞器的安裝與IK是一樣的。下載地址：https://github.com/medcl/elasticsearch-analysis-pinyin 一些參數請參考 GitHub 的 readme 文檔。這個分詞器在1.8版本中，提供了兩種分詞規則： pinyin,就是普通的把漢字轉換成拼音； pinyin\_first\_letter，提取漢字的拼音首字母使用： 1.Create a index with custom pinyin analyzer [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ curl -XPUT http://localhost:9200/medcl/ -d' { "index" : { "analysis" : { "analyzer" : { "pinyin_analyzer" : { "tokenizer" : "my_pinyin" } }, "tokenizer" : { "my_pinyin" : { "type" : "pinyin", "keep_separate_first_letter" : false, "keep_full_pinyin" : true, "keep_original" : true, "limit_first_letter_length" : 16, "lowercase" : true, "remove_duplicated_term" : true } } } } }' ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 2.Test Analyzer, analyzing a chinese name, such as 劉德華 ~~~ http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=pinyin_analyzer ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ { "tokens" : [ { "token" : "liu", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "de", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 1 }, { "token" : "hua", "start_offset" : 2, "end_offset" : 3, "type" : "word", "position" : 2 }, { "token" : "劉德華", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 3 }, { "token" : "ldh", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 4 } ] } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 3.Create mapping [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ curl -XPOST http://localhost:9200/medcl/folks/_mapping -d' { "folks": { "properties": { "name": { "type": "keyword", "fields": { "pinyin": { "type": "text", "store": "no", "term_vector": "with_offsets", "analyzer": "pinyin_analyzer", "boost": 10 } } } } } }' ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 4.Indexing ~~~ curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"劉德華"}' ~~~ 5.Let's search ~~~ http://localhost:9200/medcl/folks/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:%e5%88%98%e5%be%b7 curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:de+hua ~~~ 6.Using Pinyin-TokenFilter [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ curl -XPUT http://localhost:9200/medcl1/ -d' { "index" : { "analysis" : { "analyzer" : { "user_name_analyzer" : { "tokenizer" : "whitespace", "filter" : "pinyin_first_letter_and_full_pinyin_filter" } }, "filter" : { "pinyin_first_letter_and_full_pinyin_filter" : { "type" : "pinyin", "keep_first_letter" : true, "keep_full_pinyin" : false, "keep_none_chinese" : true, "keep_original" : false, "limit_first_letter_length" : 16, "lowercase" : true, "trim_whitespace" : true, "keep_none_chinese_in_first_letter" : true } } } } }' ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") Token Test:劉德華張學友郭富城黎明四大天王 ~~~ curl -XGET http://localhost:9200/medcl1/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e+%e5%bc%a0%e5%ad%a6%e5%8f%8b+%e9%83%ad%e5%af%8c%e5%9f%8e+%e9%bb%8e%e6%98%8e+%e5%9b%9b%e5%a4%a7%e5%a4%a9%e7%8e%8b&analyzer=user_name_analyzer ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ { "tokens" : [ { "token" : "ldh", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 0 }, { "token" : "zxy", "start_offset" : 4, "end_offset" : 7, "type" : "word", "position" : 1 }, { "token" : "gfc", "start_offset" : 8, "end_offset" : 11, "type" : "word", "position" : 2 }, { "token" : "lm", "start_offset" : 12, "end_offset" : 14, "type" : "word", "position" : 3 }, { "token" : "sdtw", "start_offset" : 15, "end_offset" : 19, "type" : "word", "position" : 4 } ] } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 7.Used in phrase query (1)、 [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ PUT /medcl/ { "index" : { "analysis" : { "analyzer" : { "pinyin_analyzer" : { "tokenizer" : "my_pinyin" } }, "tokenizer" : { "my_pinyin" : { "type" : "pinyin", "keep_first_letter":false, "keep_separate_first_letter" : false, "keep_full_pinyin" : true, "keep_original" : false, "limit_first_letter_length" : 16, "lowercase" : true } } } } } GET /medcl/folks/_search { "query": {"match_phrase": { "name.pinyin": "劉德華" }} } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") (2)、 [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ PUT /medcl/ { "index" : { "analysis" : { "analyzer" : { "pinyin_analyzer" : { "tokenizer" : "my_pinyin" } }, "tokenizer" : { "my_pinyin" : { "type" : "pinyin", "keep_first_letter":false, "keep_separate_first_letter" : true, "keep_full_pinyin" : false, "keep_original" : false, "limit_first_letter_length" : 16, "lowercase" : true } } } } } POST /medcl/folks/andy {"name":"劉德華"} GET /medcl/folks/_search { "query": {"match_phrase": { "name.pinyin": "劉德h" }} } GET /medcl/folks/_search { "query": {"match_phrase": { "name.pinyin": "劉dh" }} } GET /medcl/folks/_search { "query": {"match_phrase": { "name.pinyin": "dh" }} } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 14、Mmseg 分詞器也支持 Elasticsearch 下載地址：https://github.com/medcl/elasticsearch-analysis-mmseg/releases?根據對應的版本進行下載如何使用： 1、創建索引： ~~~ curl -XPUT http://localhost:9200/index ~~~ 2、創建 mapping [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ curl -XPOST http://localhost:9200/index/fulltext/_mapping -d' { "properties": { "content": { "type": "text", "term_vector": "with_positions_offsets", "analyzer": "mmseg_maxword", "search_analyzer": "mmseg_maxword" } } }' ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 3.Indexing some docs [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ curl -XPOST http://localhost:9200/index/fulltext/1 -d' {"content":"美國留給伊拉克的是個爛攤子嗎"} ' curl -XPOST http://localhost:9200/index/fulltext/2 -d' {"content":"公安部：各地校車將享最高路權"} ' curl -XPOST http://localhost:9200/index/fulltext/3 -d' {"content":"中韓漁警沖突調查：韓警平均每天扣1艘中國漁船"} ' curl -XPOST http://localhost:9200/index/fulltext/4 -d' {"content":"中國駐洛杉磯領事館遭亞裔男子槍擊嫌犯已自首"} ' ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 4.Query with highlighting(查詢高亮) [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ curl -XPOST http://localhost:9200/index/fulltext/_search -d' { "query" : { "term" : { "content" : "中國" }}, "highlight" : { "pre_tags" : ["<tag1>", "<tag2>"], "post_tags" : ["</tag1>", "</tag2>"], "fields" : { "content" : {} } } } ' ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 5、結果： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ { "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 2, "hits": [ { "_index": "index", "_type": "fulltext", "_id": "4", "_score": 2, "_source": { "content": "中國駐洛杉磯領事館遭亞裔男子槍擊嫌犯已自首" }, "highlight": { "content": [ "<tag1>中國</tag1>駐洛杉磯領事館遭亞裔男子槍擊嫌犯已自首 " ] } }, { "_index": "index", "_type": "fulltext", "_id": "3", "_score": 2, "_source": { "content": "中韓漁警沖突調查：韓警平均每天扣1艘中國漁船" }, "highlight": { "content": [ "均每天扣1艘<tag1>中國</tag1>漁船 " ] } } ] } } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 參考博客：為elastic添加中文分詞:?http://blog.csdn.net/dingzfang/article/details/42776693 15、bosonnlp （玻森數據中文分析器）下載地址：https://github.com/bosondata/elasticsearch-analysis-bosonnlp 如何使用：運行 ElasticSearch 之前需要在 config 文件夾中修改 elasticsearch.yml 來定義使用玻森中文分析器，并填寫玻森 API\_TOKEN 以及玻森分詞 API 的地址，即在該文件結尾處添加： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ index: analysis: analyzer: bosonnlp: type: bosonnlp API_URL: http://api.bosonnlp.com/tag/analysis # You MUST give the API_TOKEN value, otherwise it doesn't work API_TOKEN: *PUT YOUR API TOKEN HERE* # Please uncomment if you want to specify ANY ONE of the following # areguments, otherwise the DEFAULT value will be used, i.e., # space_mode is 0, # oov_level is 3, # t2s is 0, # special_char_conv is 0. # More detials can be found in bosonnlp docs: # http://docs.bosonnlp.com/tag.html # # # space_mode: put your value here(range from 0-3) # oov_level: put your value here(range from 0-4) # t2s: put your value here(range from 0-1) # special_char_conv: put your value here(range from 0-1) ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 需要注意的是必須在 API\_URL 填寫給定的分詞地址以及在API\_TOKEN：PUT YOUR API TOKEN HERE中填寫給定的玻森數據API\_TOKEN，否則無法使用玻森中文分析器。該 API\_TOKEN 是注冊玻森數據賬號所獲得。如果配置文件中已經有配置過其他的 analyzer，請直接在 analyzer 下如上添加 bosonnlp analyzer。如果有多個 node 并且都需要 BosonNLP 的分詞插件，則每個 node 下的 yaml 文件都需要如上安裝和設置。另外，玻森中文分詞還提供了4個參數（space\_mode，oov\_level，t2s，special\_char\_conv）可滿足不同的分詞需求。如果取默認值，則無需任何修改；否則，可取消對應參數的注釋并賦值。測試：建立 index ~~~ curl -XPUT 'localhost:9200/test' ~~~ 測試分析器是否配置成功 ~~~ curl -XGET 'localhost:9200/test/_analyze?analyzer=bosonnlp&pretty' -d '這是玻森數據分詞的測試' ~~~ 結果 [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ { "tokens" : [ { "token" : "這", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "是", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 1 }, { "token" : "玻森", "start_offset" : 2, "end_offset" : 4, "type" : "word", "position" : 2 }, { "token" : "數據", "start_offset" : 4, "end_offset" : 6, "type" : "word", "position" : 3 }, { "token" : "分詞", "start_offset" : 6, "end_offset" : 8, "type" : "word", "position" : 4 }, { "token" : "的", "start_offset" : 8, "end_offset" : 9, "type" : "word", "position" : 5 }, { "token" : "測試", "start_offset" : 9, "end_offset" : 11, "type" : "word", "position" : 6 } ] } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 配置 Token Filter 現有的 BosonNLP 分析器沒有內置 token filter，如果有過濾 Token 的需求，可以利用 BosonNLP Tokenizer 和 ES 提供的 token filter 搭建定制分析器。步驟配置定制的 analyzer 有以下三個步驟：添加 BosonNLP tokenizer 在 elasticsearch.yml 文件中 analysis 下添加 tokenizer，并在 tokenizer 中添加 BosonNLP tokenizer 的配置： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ index: analysis: analyzer: ... tokenizer: bosonnlp: type: bosonnlp API_URL: http://api.bosonnlp.com/tag/analysis # You MUST give the API_TOKEN value, otherwise it doesn't work API_TOKEN: *PUT YOUR API TOKEN HERE* # Please uncomment if you want to specify ANY ONE of the following # areguments, otherwise the DEFAULT value will be used, i.e., # space_mode is 0, # oov_level is 3, # t2s is 0, # special_char_conv is 0. # More detials can be found in bosonnlp docs: # http://docs.bosonnlp.com/tag.html # # # space_mode: put your value here(range from 0-3) # oov_level: put your value here(range from 0-4) # t2s: put your value here(range from 0-1) # special_char_conv: put your value here(range from 0-1) ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 添加 token filter 在 elasticsearch.yml 文件中 analysis 下添加 filter，并在 filter 中添加所需 filter 的配置（下面例子中，我們以 lowercase filter 為例）： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ index: analysis: analyzer: ... tokenizer: ... filter: lowercase: type: lowercase ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 添加定制的 analyzer 在 elasticsearch.yml 文件中 analysis 下添加 analyzer，并在 analyzer 中添加定制的 analyzer 的配置（下面例子中，我們把定制的 analyzer 命名為 filter\_bosonnlp）： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ index: analysis: analyzer: ... filter_bosonnlp: type: custom tokenizer: bosonnlp filter: [lowercase] ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") * * * 自定義分詞器雖然Elasticsearch帶有一些現成的分析器，然而在分析器上Elasticsearch真正的強大之處在于，你可以通過在一個適合你的特定數據的設置之中組合字符過濾器、分詞器、詞匯單元過濾器來創建自定義的分析器。字符過濾器：字符過濾器用來整理一個尚未被分詞的字符串。例如，如果我們的文本是HTML格式的，它會包含像或者這樣的HTML標簽，這些標簽是我們不想索引的。我們可以使用 html清除字符過濾器來移除掉所有的HTML標簽，并且像把Á轉換為相對應的Unicode字符 á 這樣，轉換HTML實體。一個分析器可能有0個或者多個字符過濾器。分詞器: 一個分析器必須有一個唯一的分詞器。分詞器把字符串分解成單個詞條或者詞匯單元。標準分析器里使用的標準分詞器把一個字符串根據單詞邊界分解成單個詞條，并且移除掉大部分的標點符號，然而還有其他不同行為的分詞器存在。詞單元過濾器: 經過分詞，作為結果的詞單元流會按照指定的順序通過指定的詞單元過濾器。詞單元過濾器可以修改、添加或者移除詞單元。我們已經提到過 lowercase 和 stop 詞過濾器，但是在 Elasticsearch 里面還有很多可供選擇的詞單元過濾器。詞干過濾器把單詞遏制為詞干。 ascii\_folding 過濾器移除變音符，把一個像 "très" 這樣的詞轉換為 "tres" 。 ngram 和 edge\_ngram 詞單元過濾器可以產生適合用于部分匹配或者自動補全的詞單元。創建一個自定義分析器我們可以在 analysis 下的相應位置設置字符過濾器、分詞器和詞單元過濾器: [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ PUT /my_index { "settings": { "analysis": { "char_filter": { ... custom character filters ... }, "tokenizer": { ... custom tokenizers ... }, "filter": { ... custom token filters ... }, "analyzer": { ... custom analyzers ... } } } } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 這個分析器可以做到下面的這些事: 1、使用 html清除字符過濾器移除HTML部分。 2、使用一個自定義的映射字符過濾器把 & 替換為 "和" ： ~~~ "char_filter": { "&_to_and": { "type": "mapping", "mappings": [ "&=> and "] } } ~~~ 3、使用標準分詞器分詞。 4、小寫詞條，使用小寫詞過濾器處理。 5、使用自定義停止詞過濾器移除自定義的停止詞列表中包含的詞： ~~~ "filter": { "my_stopwords": { "type": "stop", "stopwords": [ "the", "a" ] } } ~~~ 我們的分析器定義用我們之前已經設置好的自定義過濾器組合了已經定義好的分詞器和過濾器： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "&_to_and" ], "tokenizer": "standard", "filter": [ "lowercase", "my_stopwords" ] } } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 匯總起來，完整的創建索引請求看起來應該像這樣： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ PUT /my_index { "settings": { "analysis": { "char_filter": { "&_to_and": { "type": "mapping", "mappings": [ "&=> and "] }}, "filter": { "my_stopwords": { "type": "stop", "stopwords": [ "the", "a" ] }}, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "&_to_and" ], "tokenizer": "standard", "filter": [ "lowercase", "my_stopwords" ] }} }}} ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 索引被創建以后，使用 analyze API 來測試這個新的分析器： ~~~ GET /my_index/_analyze?analyzer=my_analyzer The quick & brown fox ~~~ 下面的縮略結果展示出我們的分析器正在正確地運行： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ { "tokens" : [ { "token" : "quick", "position" : 2 }, { "token" : "and", "position" : 3 }, { "token" : "brown", "position" : 4 }, { "token" : "fox", "position" : 5 } ] } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 這個分析器現在是沒有多大用處的，除非我們告訴 Elasticsearch在哪里用上它。我們可以像下面這樣把這個分析器應用在一個 string 字段上： [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") ~~~ PUT /my_index/_mapping/my_type { "properties": { "title": { "type": "string", "analyzer": "my_analyzer" } } } ~~~ [![復制代碼](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0); "復制代碼") 最后，感謝原文作者！