elasticsearch基礎_1 · TUNA-daily

[TOC] ## 1. 什么是elasticsearch > * Elasticsearch是一個實時的分布式搜索和分析引擎 > * 可以擴展到上百臺服務器，處理PB級別的結構化或非結構化數據。 * * * * * ## 2. 應用案例 > * 維基百科使用Elasticsearch來進行全文搜做并高亮顯示關鍵詞，以及提供search-as-you-type、did-you-mean等搜索建議功能。 > * 英國衛報使用Elasticsearch來處理訪客日志，以便能將公眾對不同文章的反應實時地反饋給各位編輯。 > * StackOverflow將全文搜索與地理位置和相關信息進行結合，以提供more-like-this相關問題的展現。 > * GitHub使用Elasticsearch來檢索超過1300億行代碼。 > * 每天，Goldman Sachs使用它來處理5TB數據的索引，還有很多投行使用它來分析股票市場的變動 * * * * * ## 3. 術語 1. 集群健康狀態 > green : 所有的主分片和復制分配都可用 > yellow : 所有的主分片可用，復制分片不一定都可用，說明副本沒有被分配給其他節點 > red : 不是所有的主分片都可用分片 2. 分片 > * 分片分為：主分片和復制分片 > 主分片：一旦索引創建就不可以改變 > 復制分片:只是主分片的一個副本，它可以防止硬件故障導致的數據丟失，同時可以提供讀請求，比如搜索或者從別的shard取回文檔。 > 為了橫向擴容，機器數量超過了總shard的數量，可以增加復制分片的數量，增加性能 3. 文檔屬性 > _index :文檔存儲的地方 > _type ：文檔類型,代表對象的類 > _id : 文檔的唯一標識 4. 文檔更新 > 文檔在elasticsearch中不可以修改的，想要修改只能重建索引或者替換掉原來的索引，這樣_version就增加了 5. 查詢結果 > hits： > ## 4. 全文搜索與精準匹配 1. exact value > 2017-01-01，exact value，搜索的時候，必須輸入2017-01-01，才能搜索出來 > 如果你輸入一個01，是搜索不出來的 2. full text 有以下幾種匹配方式 ~~~ （1）縮寫 vs. 全程：cn vs. china （2）格式轉化：like liked likes （3）大小寫：Tom vs tom （4）同義詞：like vs love ~~~ 2017-01-01，2017 01 01，搜索2017，或者01，都可以搜索出來 ~~~ china，搜索cn，也可以將china搜索出來 # 匹配縮寫 likes，搜索like，也可以將likes搜索出來 # 模糊匹配 Tom，搜索tom，也可以將Tom搜索出來 # 忽略大小寫匹配 like，搜索love，同義詞，也可以將like搜索出來 # 同義詞匹配 ~~~ 就> 不是說單純的只是匹配完整的一個值，而是可以對值進行拆分詞語后（分詞）進行匹配，也可以通過縮寫、時態、大小寫、同義詞等進行匹配 ## 5. 倒排索引 doc1：I really liked my small dogs, and I think my mom also liked them. doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him. 分詞，初步的倒排索引的建立 ~~~ word doc1 doc2 I * * really * liked * * my * * small * dogs * and * think * mom * * also * them * He * never * any * so * hope * that * will * not * expect * me * to * him * ~~~ 演示了一下倒排索引最簡單的建立的一個過程搜索 mother like little dog，不可能有任何結果 mother like little dog 這個是不是我們想要的搜索結果？？？絕對不是，因為在我們看來，mother和mom有區別嗎？同義詞，都是媽媽的意思。like和liked有區別嗎？沒有，都是喜歡的意思，只不過一個是現在時，一個是過去時。little和small有區別嗎？同義詞，都是小小的。dog和dogs有區別嗎？狗，只不過一個是單數，一個是復數。 > normalization： > 建立倒排索引的時候，會執行一個操作，也就是說對拆分出的各個單詞進行相應的處理，以提升后面搜索的時候能夠搜索到相關聯的文檔的概率時態的轉換，單復數的轉換，同義詞的轉換，大小寫的轉換 ~~~ mom ―> mother liked ―> like small ―> little dogs ―> dog ~~~ 重新建立倒排索引，加入normalization，再次用mother liked little dog搜索，就可以搜索到了 ~~~ word doc1 doc2 I * * really * like * * liked --> like my * * little * small --> little dog * * dogs --> dog and * think * mom * * also * them * He * never * any * so * hope * that * will * not * expect * me * to * him * ~~~ ~~~ mother like little dog，分詞，normalization mother --> mom like --> like little --> little dog --> dog ~~~ doc1和doc2都會搜索出來 doc1：I really liked my small dogs, and I think my mom also liked them. doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him. ## 6. _mapping ### 6.1 核心的數據類型 1. 內置類型 ~~~ string # 字符串類型 byte，short，integer，long # 數字型 float，double boolean # 布爾型 date # 日期類型 ~~~ 2. dynamic mapping ~~~ true or false --> boolean 123 --> long 123.45 --> double 2017-01-01 --> date "hello world" --> string/text ~~~ 3. 查看mapping `GET /index/_mapping/type` 4. 創建_mapping 只能創建index時手動建立mapping，或者新增field mapping，但是不能修改字段對應的mapping（update field mapping） ~~~ PUT /website { "mappings": { "article": { "properties": { "author_id": { "type": "long" }, "title": { "type": "text", "analyzer": "english" }, "content": { "type": "text" }, "post_date": { "type": "date" }, "publisher_id": { "type": "text", "index": "not_analyzed" } } } } } ~~~ 或 ### type=keyword * 現在es 5.X版本，type=text，dynamic mapping默認會設置兩個field，一個是field本身，比如articleID，就是分詞的；還有一個的話，就是field.keyword，articleID.keyword，默認不分詞，會最多保留256個字符例如： bulk出入數據，沒有建立索引，自動映射 ~~~ POST /forum/article/_bulk { "index": { "_id": 1 }} { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 2 }} { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" } { "index": { "_id": 3 }} { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 4 }} { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" } ~~~ 查看映射 ~~~ GET forum/_mapping/article { "forum": { "mappings": { "article": { "properties": { "articleID": { "type": "text", # articleID分詞 "fields": { "keyword": { "type": "keyword", # articleID.keyword 不分詞 "ignore_above": 256 } } }, "hidden": { "type": "boolean" }, "postDate": { "type": "date" }, "userID": { "type": "long" } } } } } } ~~~ ~~~ PUT /website/_mapping/article { "properties" : { "new_field" : { "type" : "string", "index": "not_analyzed" # 不分詞，精準匹配 } } } ~~~ mapping中type=keyword 代表不分詞