中文分詞器 · TUNA-daily

[TOC] ## IK 中文分詞器 1. 什么是分詞器 > 切分詞語，normalization（提升recall召回率） > 給你一段句子，然后將這段句子拆分成一個一個的單個的單詞，同時對每個單詞進行normalization（時態轉換，單復數轉換），分瓷器 > recall，召回率：搜索的時候，增加能夠搜索到的結果的數量 > * 分詞器的作用： > character filter：在一段文本進行分詞之前，先進行預處理，比如說最常見的就是，過濾html標簽（<span>hello<span> --> hello），& --> and（I&you --> I and you） > tokenizer：分詞，hello you and me --> hello, you, and, me > token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little > 一個分詞器，很重要，將一段文本進行各種處理，最后處理好的結果才會拿去建立倒排索引 2. 內置分詞器的介紹 ~~~ Set the shape to semi-transparent by calling set_trans(5) standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默認的是standard） simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5) language analyzer（特定的語言的分詞器，比如說，english，英語分詞器）：set, shape, semi, transpar, call, set_tran, 5 ~~~ * 安裝 1. mkdir /usr/share/elasticsearch/plugins/ik 時解壓放在/usr/share/elasticsearch/plugins/ik目錄下 1. query string分詞 > query string必須以和index建立時相同的analyzer進行分詞（搜索語句和index是一樣的索引） > query string對exact value和full text的區別對待 ~~~ date：exact value _all：full text # 不指定index的查詢 ~~~ > 比如我們有一個document，其中有一個field，包含的value是：hello you and me，建立倒排索引 > 我們要搜索這個document對應的index，搜索文本是hell me，這個搜索文本就是query string > query string，默認情況下，es會使用它對應的field建立倒排索引時相同的分詞器去進行分詞，分詞和normalization，只有這樣，才能實現正確的搜索 > 我們建立倒排索引的時候，將dogs --> dog，結果你搜索的時候，還是一個dogs，那不就搜索不到了嗎？所以搜索的時候，那個dogs也必須變成dog才行，才能搜索到。 > 知識點： > 不同類型的field，可能有的就是full text，有的就是exact value ~~~ post_date，date：exact value # 精確值 _all：full text，分詞，normalization # 全文索引 ~~~ 2. mapping引入案例遺留問題大揭秘 `GET /_search?q=2017` `搜索的是_all field，document所有的field都會拼接成一個大串，進行分詞` ~~~ 2017-01-02 my second article this is my second article in this website 11400 doc1 doc2 doc3 2017 * * * 01 * 02 * 03 * ~~~ > _all，2017，自然會搜索到3個docuemnt `GET /_search?q=2017-01-01` ~~~ _all，2017-01-01，query string(查詢語句)會用跟建立倒排索引一樣的分詞器去進行分詞 2017 01 01 ~~~ `GET /_search?q=post_date:2017-01-01 ` > date，會作為exact value（精確值）去建立索引 # query string 和index使用相同的分詞器去搜索 ~~~ doc1 doc2 doc3 2017-01-01 * 2017-01-02 * 2017-01-03 * post_date:2017-01-01，2017-01-01，doc1一條document ~~~ GET /_search?q=post_date:2017，這個在這里不講解，因為是es 5.2以后做的一個優化 3、測試分詞器 ~~~ GET /_analyze { "analyzer": "standard", "text": "Text to analyze" } ~~~ ### 1. 測試分詞器效果 > * IK分詞分為兩類：ik_smart和ik_max_word ik_max_word: 會將文本做最細粒度的拆分，比如會將“中華人民共和國國歌”拆分為“中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌”，會窮盡各種可能的組合； ik_smart: 會做最粗粒度的拆分，比如會將“中華人民共和國國歌”拆分為“中華人民共和國,國歌”。 * * * * * #### 1.1 分詞測試 * ik_smart 測試 ~~~ GET _analyze?pretty { "analyzer": "ik_smart", "text": "中華人民共和國國歌" } ~~~ 得到 `中華人民共和國國歌` 兩個詞，如下 ~~~ { "token": "中華人民共和國", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "國歌", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 1 } ~~~ * 測試 ik_max_word ~~~ GET _analyze?pretty { "analyzer": "ik_max_word", "text": "中華人民共和國國歌" } ~~~ 得到 `中華人民共和國中華人民中華華人人民共和國人民共和國國國歌` ~~~ { "token": "中華人民共和國", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "中華人民", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "中華", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 2 }, 。。。。 ~~~ * 由此得到結論兩種分析器都是先分大塊詞，而ik_max_word在從大塊詞中分析，以此類推。 ik_max_word分的更加詳細 * * * * * ### 1.2 基于mysql熱更新分詞測試分詞 ~~~ GET _analyze?pretty { "analyzer": "ik_max_word", "text": "王者榮耀是最好玩的游戲" } ~~~ > 得到 `王者榮耀是最好好玩的游戲 ` 的分詞結果，但是我們想要`王者榮耀`是一個分詞怎么做到呢？就需要熱更新比較流行的分詞 #### 1.2.1 修改ik源碼 1. 自定義線程類HotDictReloadThread，作用時不斷的更新詞典 ~~~ public class HotDictReloadThread implements Runnable { private static final Logger logger = ESLoggerFactory.getLogger(HotDictReloadThread.class.getName()); @Override public void run() { logger.info("==========reload hot dic from mysql......."); while (true){ //不斷的重新加載字典 Dictionary.getSingleton().reLoadMainDict(); } } } ~~~ 2. 修改Dictionary類的initial方法，啟動線程不斷的更新詞典 ~~~ public static synchronized Dictionary initial(Configuration cfg) { if (singleton == null) { synchronized (Dictionary.class) { if (singleton == null) { singleton = new Dictionary(cfg); singleton.loadMainDict(); singleton.loadSurnameDict(); singleton.loadQuantifierDict(); singleton.loadSuffixDict(); singleton.loadPrepDict(); singleton.loadStopWordDict(); # 這里是我們自定義的線程類，不斷的重新加載詞典########## new Thread(new HotDictReloadThread()).start(); if(cfg.isEnableRemoteDict()){ // 建立監控線程 for (String location : singleton.getRemoteExtDictionarys()) { // 10 秒是初始延遲可以修改的 60是間隔時間單位秒 pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS); } for (String location : singleton.getRemoteExtStopWordDictionarys()) { pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS); } } return singleton; } } } return singleton; } ~~~ 3. 自定義loadMySQLExtDict方法，加載mysql中流行詞 ~~~ private static Properties prop = new Properties(); static { try { Class.forName("com.mysql.jdbc.Driver"); } catch (ClassNotFoundException e) { logger.error("error",e); } } private void loadMySQLExtDict() { try { Connection connection = null; Statement statement = null; ResultSet resultSet = null; Path file = PathUtils.get(getDictRoot(),"mysql.properties"); prop.load(new FileInputStream(file.toFile())); logger.info("============JDBC reload properties"); for (Object key : prop.keySet()) logger.info("[==========] query hot dict from mysql," + prop.getProperty(String.valueOf(key))); connection = DriverManager.getConnection( prop.getProperty("jdbc.url"), prop.getProperty("jdbc.user"), prop.getProperty("jdbc.password")); statement = connection.createStatement(); resultSet = statement.executeQuery(prop.getProperty("jdbc.reload.sql")); while (resultSet.next()){ String theWord = resultSet.getString("word"); logger.info("[==========] hot word from mysql: " + theWord); _MainDict.fillSegment(theWord.trim().toCharArray()); } Thread.sleep(Integer.valueOf(prop.getProperty("jdbc.reload.interval"))); } catch (Exception e) { e.printStackTrace(); } } ~~~ 4. 自定義loadMySQLStopwordDict方法，加載停用詞 ~~~ private void loadMySQLStopwordDict() { { try { Connection connection = null; Statement statement = null; ResultSet resultSet = null; Path file = PathUtils.get(getDictRoot(),"mysql.properties"); prop.load(new FileInputStream(file.toFile())); logger.info("============JDBC reload properties"); for (Object key : prop.keySet()) logger.info("[==========] query hot dict from mysql," + prop.getProperty(String.valueOf(key))); connection = DriverManager.getConnection( prop.getProperty("jdbc.url"), prop.getProperty("jdbc.user"), prop.getProperty("jdbc.password")); statement = connection.createStatement(); resultSet = statement.executeQuery(prop.getProperty("jdbc.reload.stopword.sql")); while (resultSet.next()){ String theWord = resultSet.getString("word"); logger.info("[==========] hot word from mysql: " + theWord); _StopWords.fillSegment(theWord.trim().toCharArray()); } Thread.sleep(Integer.valueOf(prop.getProperty("jdbc.reload.interval"))); } catch (Exception e) { e.printStackTrace(); } } } ~~~ 5. 在Dictionary類的loadMainDict方法，調用loadMySQLExtDict方法，加載流行詞 ~~~ private void loadMainDict() { // 建立一個主詞典實例 _MainDict = new DictSegment((char) 0); // 讀取主詞典文件 Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_MAIN); InputStream is = null; try { is = new FileInputStream(file.toFile()); } catch (FileNotFoundException e) { logger.error(e.getMessage(), e); } try { BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"), 512); String theWord = null; do { theWord = br.readLine(); if (theWord != null && !"".equals(theWord.trim())) { _MainDict.fillSegment(theWord.trim().toCharArray()); } } while (theWord != null); } catch (IOException e) { logger.error("ik-analyzer", e); } finally { try { if (is != null) { is.close(); is = null; } } catch (IOException e) { logger.error("ik-analyzer", e); } } // 加載擴展詞典 this.loadExtDict(); // 加載遠程自定義詞庫 this.loadRemoteExtDict(); //加載mysql熱詞 this.loadMySQLExtDict(); } ~~~ 6. 在Dictionary類的loadStopWordDict方法，調用loadMySQLStopwordDict方法 ~~~ private void loadStopWordDict() { // 建立主詞典實例 _StopWords = new DictSegment((char) 0); // 讀取主詞典文件 Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_STOP); InputStream is = null; try { is = new FileInputStream(file.toFile()); } catch (FileNotFoundException e) { logger.error(e.getMessage(), e); } try { BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"), 512); String theWord = null; do { theWord = br.readLine(); if (theWord != null && !"".equals(theWord.trim())) { _StopWords.fillSegment(theWord.trim().toCharArray()); } } while (theWord != null); } catch (IOException e) { logger.error("ik-analyzer", e); } finally { try { if (is != null) { is.close(); is = null; } } catch (IOException e) { logger.error("ik-analyzer", e); } this.loadMySQLStopwordDict(); } ~~~ 7. 添加mysql配置mysql.properties ~~~ jdbc.url=jdbc:mysql://localhost:3306/es?serverTimezone=GMT jdbc.user=root jdbc.password=tuna jdbc.reload.sql=select word from hot_words jdbc.reload.stopword.sql=select stopword as word from hot_stopwords jdbc.reload.interval=30000 ~~~ * 將mysql打成jar包，覆蓋原來的 ![](https://box.kancloud.cn/deb577e6ed1dca638e4f12e54f2eb2f0_1656x50.png) * 導入mysql jar ![](https://box.kancloud.cn/4d8cfbc11bb9b2dc7f429e3992ecbf9e_1639x198.png) * 重啟elasticsearch mysql中的流行詞 ![](https://box.kancloud.cn/1dcccc6af3dfacc5e17d8d9c39c3255b_444x177.png) 結果 ~~~ GET _analyze { "analyzer": "ik_max_word", "text": "王者榮耀很好玩" } ~~~ 得到 ~~~ { "tokens": [ { "token": "王者榮耀", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 0 }, { "token": "王者", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 1 }, { "token": "榮耀", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 2 }, { "token": "很好", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 3 }, { "token": "好玩", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 4 } ] } ~~~ 流行詞更新完畢 ### 1.3 修改索引配置 ~~~ PUT http://192.168.159.159:9200/index1 { "settings": { "refresh_interval": "5s", "number_of_shards" : 1, // 一個主節點 "number_of_replicas" : 0 // 0個副本，后面可以加 }, "mappings": { "_default_":{ "_all": { "enabled": false } // 關閉_all字段，因為我們只搜索title字段 }, "resource": { "dynamic": false, // 關閉“動態修改索引” "properties": { "title": { "type": "string", "index": "analyzed", "fields": { "cn": { "type": "string", "analyzer": "ik" }, "en": { "type": "string", "analyzer": "english" } } } } } } } ~~~ ~~~ GET index/_search { "query": { "match": { "content": "中國漁船" } } } ~~~ ~~~ "hits": { "total": 2, "max_score": 0.6099695, "hits": [ { "_index": "index", "_type": "fulltext", "_id": "4", "_score": 0.6099695, "_source": { "content": "中國駐洛杉磯領事館遭亞裔男子槍擊嫌犯已自首" } }, { "_index": "index", "_type": "fulltext", "_id": "3", "_score": 0.54359555, "_source": { "content": "中韓漁警沖突調查：韓警平均每天扣1艘中國漁船" } ~~~ 設字段的分析器 ~~~ POST index/fulltext/_mapping { "properties": { "content": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word" } } } ~~~ ### 1.4 中文分詞文檔統計 * 因為content字段是text類型，不可以聚合，所以設置 "fielddata": true, ~~~ PUT /news/_mapping/new { "properties": { "content":{ "type": "text", "fielddata": true, "analyzer": "ik_max_word", "search_analyzer": "ik_max_word" } } } ~~~ * 查詢 #### 1.4.1 terms（分組） ~~~ GET /news/_search { "query": { "match": { "content": "中國國家領導人" } }, "aggs": { "top": { "terms": { "size": "10", "field": "content" } } } } ~~~ 得到 ~~~ "aggregations": { "top": { "doc_count_error_upper_bound": 1, "sum_other_doc_count": 67, "buckets": [ { "key": "中國", "doc_count": 5 }, { "key": "在", "doc_count": 3 }, { "key": "人", "doc_count": 2 }, { "key": "沖突", "doc_count": 2 }, ~~~ 中國出現在五篇文檔中，在出現在三篇文檔中