Compound Word Token Filter（復合詞過濾器） · Elasticsearch 5.4 中文文檔

# Compound Word Token Filter（復合詞過濾器）原文鏈接 :[https://www.elastic.co/guide/en/elasticsearch/reference/5.4/analysis-compound-word-tokenfilter.html](https://www.elastic.co/guide/en/elasticsearch/reference/5.4/analysis-compound-word-tokenfilter.html) 譯文鏈接 : [http://www.apache.wiki/pages/viewpage.action?pageId=10027907](http://www.apache.wiki/pages/viewpage.action?pageId=10027907) 貢獻者 : [李亞運](/display/~liyayun)，[ApacheCN](/display/~apachecn)，[Apache中文網](/display/~apachechina) ## 簡述 `hyphenation_decompounder`和`dictionary_decompounder`過濾器可以將許多德語中的復合詞進行拆分。兩個過濾器都需要單詞字典，可以按如下方式提供： | `word_list` | 或一系列字，在令牌過濾器配置中內聯指定 | | `word_list_path` | UTF-8編碼文件的路徑（絕對或相對于`config`目錄），每行包含一個字。 | ## Hyphenation decompounder（連詞分解） `hyphenation_decompounder`使用連字符語法來查找潛在的字詞，然后對單詞字典進行檢查。?輸出的token質量與您使用的語法文件的質量直接相關。?對于像德語這樣的語言是非常適用的。? 基于XML的連字符語法文件可以在“?[對象格式化對象](http://offo.sourceforge.net/#FOP+XML+Hyphenation+Patterns)?（OFFO）Sourceforge”項目中找到。?目前僅支持FOP v1.2兼容連字符文件。?您可以直接下載[offo-hyphenation_v1.2.zip](https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/downloadALkJrhh1HOK03GnbuWiuoThBDDDion7pCg)并查看`offo-hyphenation/hyph/`目錄。想了解更多可以去查看Apache FOP項目。 ## Dictionary decompounder（字典分解） `dictionary_decompounder`使用強力方法與僅字典字典結合使用復合詞中的子詞。?它比連字符分解器慢得多，但可以作為檢驗字典質量的第一步。 ## Compound token filter parameters（復合詞元過濾器參數）以下參數可用于配置復合詞元過濾器： | `type` | 任何一個`dictionary_decompounder`或者是`hyphenation_decompounder`?。 | | `word_list` | 包含用于單詞字典的單詞列表的數組。 | | `word_list_path` | 單詞字典的路徑（絕對或相對于`config`目錄）。 | | `hyphenation_patterns_path` | FOP XML連字符模式文件的路徑（絕對或相對于`config`目錄）。（當連詞分解時需要配置） | | `min_word_size` | 最小字大小。?默認為5。 | | `min_subword_size` | 最小子字大小。?默認為2。 | | `max_subword_size` | 最大字大小。?默認為15。 | | `only_longest_match` | 是否只包括最長的匹配子字。?默認為`false` | 如下例所示： ``` index : analysis : analyzer : myAnalyzer2 : type : custom tokenizer : standard filter : [myTokenFilter1, myTokenFilter2] filter : myTokenFilter1 : type : dictionary_decompounder word_list: [one, two, three] myTokenFilter2 : type : hyphenation_decompounder word_list_path: path/to/words.txt hyphenation_patterns_path: path/to/fop.xml max_subword_size : 22 ```