<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                企業??AI智能體構建引擎,智能編排和調試,一鍵部署,支持知識庫和私有化部署方案 廣告
                [TOC] ## IK 中文分詞器 1. 什么是分詞器 > 切分詞語,normalization(提升recall召回率) > 給你一段句子,然后將這段句子拆分成一個一個的單個的單詞,同時對每個單詞進行normalization(時態轉換,單復數轉換),分瓷器 > recall,召回率:搜索的時候,增加能夠搜索到的結果的數量 > * 分詞器的作用: > character filter:在一段文本進行分詞之前,先進行預處理,比如說最常見的就是,過濾html標簽(<span>hello<span> --> hello),& --> and(I&you --> I and you) > tokenizer:分詞,hello you and me --> hello, you, and, me > token filter:lowercase,stop word,synonymom,dogs --> dog,liked --> like,Tom --> tom,a/the/an --> 干掉,mother --> mom,small --> little > 一個分詞器,很重要,將一段文本進行各種處理,最后處理好的結果才會拿去建立倒排索引 2. 內置分詞器的介紹 ~~~ Set the shape to semi-transparent by calling set_trans(5) standard analyzer:set, the, shape, to, semi, transparent, by, calling, set_trans, 5(默認的是standard) simple analyzer:set, the, shape, to, semi, transparent, by, calling, set, trans whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5) language analyzer(特定的語言的分詞器,比如說,english,英語分詞器):set, shape, semi, transpar, call, set_tran, 5 ~~~ * 安裝 1. mkdir /usr/share/elasticsearch/plugins/ik 時解壓放在/usr/share/elasticsearch/plugins/ik目錄下 1. query string分詞 > query string必須以和index建立時相同的analyzer進行分詞(搜索語句和index是一樣的索引) > query string對exact value和full text的區別對待 ~~~ date:exact value _all:full text # 不指定index的查詢 ~~~ > 比如我們有一個document,其中有一個field,包含的value是:hello you and me,建立倒排索引 > 我們要搜索這個document對應的index,搜索文本是hell me,這個搜索文本就是query string > query string,默認情況下,es會使用它對應的field建立倒排索引時相同的分詞器去進行分詞,分詞和normalization,只有這樣,才能實現正確的搜索 > 我們建立倒排索引的時候,將dogs --> dog,結果你搜索的時候,還是一個dogs,那不就搜索不到了嗎?所以搜索的時候,那個dogs也必須變成dog才行,才能搜索到。 > 知識點: > 不同類型的field,可能有的就是full text,有的就是exact value ~~~ post_date,date:exact value # 精確值 _all:full text,分詞,normalization # 全文索引 ~~~ 2. mapping引入案例遺留問題大揭秘 `GET /_search?q=2017` `搜索的是_all field,document所有的field都會拼接成一個大串,進行分詞` ~~~ 2017-01-02 my second article this is my second article in this website 11400 doc1 doc2 doc3 2017 * * * 01 * 02 * 03 * ~~~ > _all,2017,自然會搜索到3個docuemnt `GET /_search?q=2017-01-01` ~~~ _all,2017-01-01,query string(查詢語句)會用跟建立倒排索引一樣的分詞器去進行分詞 2017 01 01 ~~~ `GET /_search?q=post_date:2017-01-01 ` > date,會作為exact value(精確值)去建立索引 # query string 和index使用相同的分詞器去搜索 ~~~ doc1 doc2 doc3 2017-01-01 * 2017-01-02 * 2017-01-03 * post_date:2017-01-01,2017-01-01,doc1一條document ~~~ GET /_search?q=post_date:2017,這個在這里不講解,因為是es 5.2以后做的一個優化 3、測試分詞器 ~~~ GET /_analyze { "analyzer": "standard", "text": "Text to analyze" } ~~~ ### 1. 測試分詞器效果 > * IK分詞分為兩類:ik_smart和ik_max_word ik_max_word: 會將文本做最細粒度的拆分,比如會將“中華人民共和國國歌”拆分為“中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌”,會窮盡各種可能的組合; ik_smart: 會做最粗粒度的拆分,比如會將“中華人民共和國國歌”拆分為“中華人民共和國,國歌”。 * * * * * #### 1.1 分詞測試 * ik_smart 測試 ~~~ GET _analyze?pretty { "analyzer": "ik_smart", "text": "中華人民共和國國歌" } ~~~ 得到 `中華人民共和國 國歌` 兩個詞,如下 ~~~ { "token": "中華人民共和國", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "國歌", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 1 } ~~~ * 測試 ik_max_word ~~~ GET _analyze?pretty { "analyzer": "ik_max_word", "text": "中華人民共和國國歌" } ~~~ 得到 `中華人民共和國 中華人民 中華 華人 人民共和國 人民 共和國 國 國歌` ~~~ { "token": "中華人民共和國", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "中華人民", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "中華", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 2 }, 。。。。 ~~~ * 由此得到結論兩種分析器都是先分大塊詞,而ik_max_word在從大塊詞中分析,以此類推。 ik_max_word分的更加詳細 * * * * * ### 1.2 基于mysql熱更新分詞 測試分詞 ~~~ GET _analyze?pretty { "analyzer": "ik_max_word", "text": "王者榮耀是最好玩的游戲" } ~~~ > 得到 `王者 榮耀 是 最好 好玩 的 游戲 ` 的分詞結果,但是我們想要`王者榮耀`是一個分詞怎么做到呢?就需要熱更新比較流行的分詞 #### 1.2.1 修改ik源碼 1. 自定義線程類HotDictReloadThread,作用時不斷的更新詞典 ~~~ public class HotDictReloadThread implements Runnable { private static final Logger logger = ESLoggerFactory.getLogger(HotDictReloadThread.class.getName()); @Override public void run() { logger.info("==========reload hot dic from mysql......."); while (true){ //不斷的重新加載字典 Dictionary.getSingleton().reLoadMainDict(); } } } ~~~ 2. 修改Dictionary類的initial方法,啟動線程不斷的更新詞典 ~~~ public static synchronized Dictionary initial(Configuration cfg) { if (singleton == null) { synchronized (Dictionary.class) { if (singleton == null) { singleton = new Dictionary(cfg); singleton.loadMainDict(); singleton.loadSurnameDict(); singleton.loadQuantifierDict(); singleton.loadSuffixDict(); singleton.loadPrepDict(); singleton.loadStopWordDict(); # 這里是我們自定義的線程類,不斷的重新加載詞典########## new Thread(new HotDictReloadThread()).start(); if(cfg.isEnableRemoteDict()){ // 建立監控線程 for (String location : singleton.getRemoteExtDictionarys()) { // 10 秒是初始延遲可以修改的 60是間隔時間 單位秒 pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS); } for (String location : singleton.getRemoteExtStopWordDictionarys()) { pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS); } } return singleton; } } } return singleton; } ~~~ 3. 自定義loadMySQLExtDict方法,加載mysql中流行詞 ~~~ private static Properties prop = new Properties(); static { try { Class.forName("com.mysql.jdbc.Driver"); } catch (ClassNotFoundException e) { logger.error("error",e); } } private void loadMySQLExtDict() { try { Connection connection = null; Statement statement = null; ResultSet resultSet = null; Path file = PathUtils.get(getDictRoot(),"mysql.properties"); prop.load(new FileInputStream(file.toFile())); logger.info("============JDBC reload properties"); for (Object key : prop.keySet()) logger.info("[==========] query hot dict from mysql," + prop.getProperty(String.valueOf(key))); connection = DriverManager.getConnection( prop.getProperty("jdbc.url"), prop.getProperty("jdbc.user"), prop.getProperty("jdbc.password")); statement = connection.createStatement(); resultSet = statement.executeQuery(prop.getProperty("jdbc.reload.sql")); while (resultSet.next()){ String theWord = resultSet.getString("word"); logger.info("[==========] hot word from mysql: " + theWord); _MainDict.fillSegment(theWord.trim().toCharArray()); } Thread.sleep(Integer.valueOf(prop.getProperty("jdbc.reload.interval"))); } catch (Exception e) { e.printStackTrace(); } } ~~~ 4. 自定義loadMySQLStopwordDict方法,加載停用詞 ~~~ private void loadMySQLStopwordDict() { { try { Connection connection = null; Statement statement = null; ResultSet resultSet = null; Path file = PathUtils.get(getDictRoot(),"mysql.properties"); prop.load(new FileInputStream(file.toFile())); logger.info("============JDBC reload properties"); for (Object key : prop.keySet()) logger.info("[==========] query hot dict from mysql," + prop.getProperty(String.valueOf(key))); connection = DriverManager.getConnection( prop.getProperty("jdbc.url"), prop.getProperty("jdbc.user"), prop.getProperty("jdbc.password")); statement = connection.createStatement(); resultSet = statement.executeQuery(prop.getProperty("jdbc.reload.stopword.sql")); while (resultSet.next()){ String theWord = resultSet.getString("word"); logger.info("[==========] hot word from mysql: " + theWord); _StopWords.fillSegment(theWord.trim().toCharArray()); } Thread.sleep(Integer.valueOf(prop.getProperty("jdbc.reload.interval"))); } catch (Exception e) { e.printStackTrace(); } } } ~~~ 5. 在Dictionary類的loadMainDict方法,調用loadMySQLExtDict方法,加載流行詞 ~~~ private void loadMainDict() { // 建立一個主詞典實例 _MainDict = new DictSegment((char) 0); // 讀取主詞典文件 Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_MAIN); InputStream is = null; try { is = new FileInputStream(file.toFile()); } catch (FileNotFoundException e) { logger.error(e.getMessage(), e); } try { BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"), 512); String theWord = null; do { theWord = br.readLine(); if (theWord != null && !"".equals(theWord.trim())) { _MainDict.fillSegment(theWord.trim().toCharArray()); } } while (theWord != null); } catch (IOException e) { logger.error("ik-analyzer", e); } finally { try { if (is != null) { is.close(); is = null; } } catch (IOException e) { logger.error("ik-analyzer", e); } } // 加載擴展詞典 this.loadExtDict(); // 加載遠程自定義詞庫 this.loadRemoteExtDict(); //加載mysql熱詞 this.loadMySQLExtDict(); } ~~~ 6. 在Dictionary類的loadStopWordDict方法,調用loadMySQLStopwordDict方法 ~~~ private void loadStopWordDict() { // 建立主詞典實例 _StopWords = new DictSegment((char) 0); // 讀取主詞典文件 Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_STOP); InputStream is = null; try { is = new FileInputStream(file.toFile()); } catch (FileNotFoundException e) { logger.error(e.getMessage(), e); } try { BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"), 512); String theWord = null; do { theWord = br.readLine(); if (theWord != null && !"".equals(theWord.trim())) { _StopWords.fillSegment(theWord.trim().toCharArray()); } } while (theWord != null); } catch (IOException e) { logger.error("ik-analyzer", e); } finally { try { if (is != null) { is.close(); is = null; } } catch (IOException e) { logger.error("ik-analyzer", e); } this.loadMySQLStopwordDict(); } ~~~ 7. 添加mysql配置mysql.properties ~~~ jdbc.url=jdbc:mysql://localhost:3306/es?serverTimezone=GMT jdbc.user=root jdbc.password=tuna jdbc.reload.sql=select word from hot_words jdbc.reload.stopword.sql=select stopword as word from hot_stopwords jdbc.reload.interval=30000 ~~~ * 將mysql打成jar包,覆蓋原來的 ![](https://box.kancloud.cn/deb577e6ed1dca638e4f12e54f2eb2f0_1656x50.png) * 導入mysql jar ![](https://box.kancloud.cn/4d8cfbc11bb9b2dc7f429e3992ecbf9e_1639x198.png) * 重啟elasticsearch mysql中的流行詞 ![](https://box.kancloud.cn/1dcccc6af3dfacc5e17d8d9c39c3255b_444x177.png) 結果 ~~~ GET _analyze { "analyzer": "ik_max_word", "text": "王者榮耀很好玩" } ~~~ 得到 ~~~ { "tokens": [ { "token": "王者榮耀", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 0 }, { "token": "王者", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 1 }, { "token": "榮耀", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 2 }, { "token": "很好", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 3 }, { "token": "好玩", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 4 } ] } ~~~ 流行詞更新完畢 ### 1.3 修改索引配置 ~~~ PUT http://192.168.159.159:9200/index1 { "settings": { "refresh_interval": "5s", "number_of_shards" : 1, // 一個主節點 "number_of_replicas" : 0 // 0個副本,后面可以加 }, "mappings": { "_default_":{ "_all": { "enabled": false } // 關閉_all字段,因為我們只搜索title字段 }, "resource": { "dynamic": false, // 關閉“動態修改索引” "properties": { "title": { "type": "string", "index": "analyzed", "fields": { "cn": { "type": "string", "analyzer": "ik" }, "en": { "type": "string", "analyzer": "english" } } } } } } } ~~~ ~~~ GET index/_search { "query": { "match": { "content": "中國漁船" } } } ~~~ ~~~ "hits": { "total": 2, "max_score": 0.6099695, "hits": [ { "_index": "index", "_type": "fulltext", "_id": "4", "_score": 0.6099695, "_source": { "content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首" } }, { "_index": "index", "_type": "fulltext", "_id": "3", "_score": 0.54359555, "_source": { "content": "中韓漁警沖突調查:韓警平均每天扣1艘中國漁船" } ~~~ 設字段的分析器 ~~~ POST index/fulltext/_mapping { "properties": { "content": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word" } } } ~~~ ### 1.4 中文分詞文檔統計 * 因為content字段是text類型,不可以聚合,所以設置 "fielddata": true, ~~~ PUT /news/_mapping/new { "properties": { "content":{ "type": "text", "fielddata": true, "analyzer": "ik_max_word", "search_analyzer": "ik_max_word" } } } ~~~ * 查詢 #### 1.4.1 terms(分組) ~~~ GET /news/_search { "query": { "match": { "content": "中國國家領導人" } }, "aggs": { "top": { "terms": { "size": "10", "field": "content" } } } } ~~~ 得到 ~~~ "aggregations": { "top": { "doc_count_error_upper_bound": 1, "sum_other_doc_count": 67, "buckets": [ { "key": "中國", "doc_count": 5 }, { "key": "在", "doc_count": 3 }, { "key": "人", "doc_count": 2 }, { "key": "沖突", "doc_count": 2 }, ~~~ 中國出現在五篇文檔中,在出現在三篇文檔中
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看