[[using-stopwords]]
=== Using Stopwords
The removal of stopwords is ((("stopwords", "removal of")))handled by the
http://bit.ly/1INX4tN[`stop` token filter] which can be used
when ((("stop token filter")))creating a `custom` analyzer (see <<stop-token-filter>>).
However, some out-of-the-box analyzers((("analyzers", "stop filter pre-integrated")))((("pattern analyzer", "stopwords and")))((("standard analyzer", "stop filter")))((("language analyzers", "stop filter pre-integrated"))) come with the `stop` filter pre-integrated:
http://bit.ly/1xtdoJV[Language analyzers]::
Each language analyzer defaults to using the appropriate stopwords list
for that language. For instance, the `english` analyzer uses the
`_english_` stopwords list.
http://bit.ly/14EpXv3[`standard` analyzer]::
Defaults to the empty stopwords list: `_none_`, essentially disabling
stopwords.
http://bit.ly/1u9OVct[`pattern` analyzer]::
Defaults to `_none_`, like the `standard` analyzer.
==== Stopwords and the Standard Analyzer
To use custom stopwords in conjunction with ((("standard analyzer", "stopwords and")))((("stopwords", "using with standard analyzer")))the `standard` analyzer, all we
need to do is to create a configured version of the analyzer and pass in the
list of `stopwords` that we require:
[source,json]
---------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": { <1>
"type": "standard", <2>
"stopwords": [ "and", "the" ] <3>
}
}
}
}
}
---------------------------------
<1> This is a custom analyzer called `my_analyzer`.
<2> This analyzer is the `standard` analyzer with some custom configuration.
<3> The stopwords to filter out are `and` and `the`.
TIP: This same technique can be used to configure custom stopword lists for
any of the language analyzers.
[[maintaining-positions]]
==== Maintaining Positions
The output from the `analyze` API((("stopwords", "maintaining position of terms and"))) is quite interesting:
[source,json]
---------------------------------
GET /my_index/_analyze?analyzer=my_analyzer
The quick and the dead
---------------------------------
[source,json]
---------------------------------
{
"tokens": [
{
"token": "quick",
"start_offset": 4,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2 <1>
},
{
"token": "dead",
"start_offset": 18,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 5 <1>
}
]
}
---------------------------------
<1> Note the `position` of each token.
The stopwords have been filtered out, as expected, but the interesting part is
that the `position` of the((("phrase matching", "stopwords and", "positions data"))) two remaining terms is unchanged: `quick` is the
second word in the original sentence, and `dead` is the fifth. This is
important for phrase queries--if the positions of each term had been
adjusted, a phrase query for `quick dead` would have matched the preceding
example incorrectly.
[[specifying-stopwords]]
==== Specifying Stopwords
Stopwords can be passed inline, as we did in ((("stopwords", "specifying")))the previous example, by
specifying an array:
[source,json]
---------------------------------
"stopwords": [ "and", "the" ]
---------------------------------
The default stopword list for a particular language can be specified using the
`_lang_` notation:
[source,json]
---------------------------------
"stopwords": "_english_"
---------------------------------
TIP: The predefined language-specific stopword((("languages", "predefined stopword lists for"))) lists available in
Elasticsearch can be found in the
http://bit.ly/157YLFy[`stop` token filter] documentation.
Stopwords can be disabled by ((("stopwords", "disabling")))specifying the special list: `_none_`. For
instance, to use the `english` analyzer((("english analyzer", "using without stopwords"))) without stopwords, you can do the
following:
[source,json]
---------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english": {
"type": "english", <1>
"stopwords": "_none_" <2>
}
}
}
}
}
---------------------------------
<1> The `my_english` analyzer is based on the `english` analyzer.
<2> But stopwords are disabled.
Finally, stopwords can also be listed in a file with one word per line. The
file must be present on all nodes in the cluster, and the path can be
specified((("stopwords_path parameter"))) with the `stopwords_path` parameter:
[source,json]
---------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english": {
"type": "english",
"stopwords_path": "stopwords/english.txt" <1>
}
}
}
}
}
---------------------------------
<1> The path to the stopwords file, relative to the Elasticsearch `config`
directory
[[stop-token-filter]]
==== Using the stop Token Filter
The http://bit.ly/1AUzDNI[`stop` token filter] can be combined
with a tokenizer((("stopwords", "using stop token filter")))((("stop token filter", "using in custom analyzer"))) and other token filters when you need to create a `custom`
analyzer. For instance, let's say that we wanted to ((("Spanish", "custom analyzer for")))((("light_spanish stemmer")))create a Spanish analyzer
with the following:
* A custom stopwords list
* The `light_spanish` stemmer
* The <<asciifolding-token-filter,`asciifolding` filter>> to remove diacritics
We could set that up as follows:
[source,json]
---------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": [ "si", "esta", "el", "la" ] <1>
},
"light_spanish": { <2>
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"my_spanish": {
"tokenizer": "spanish",
"filter": [ <3>
"lowercase",
"asciifolding",
"spanish_stop",
"light_spanish"
]
}
}
}
}
}
---------------------------------
<1> The `stop` token filter takes the same `stopwords` and `stopwords_path`
parameters as the `standard` analyzer.
<2> See <<algorithmic-stemmers>>.
<3> The order of token filters is important, as explained next.
We have placed the `spanish_stop` filter after the `asciifolding` filter.((("asciifolding token filter", "in custom Spanish analyzer"))) This
means that `esta`, `ésta`, and ++está++ will first have their diacritics
removed to become just `esta`, which will then be removed as a stopword. If,
instead, we wanted to remove `esta` and `ésta`, but not ++está++, we
would have to put the `spanish_stop` filter _before_ the `asciifolding`
filter, and specify both words in the stopwords list.
[[updating-stopwords]]
==== Updating Stopwords
A few techniques can be used to update the list of stopwords
used by an analyzer.((("analyzers", "stopwords list, updating")))((("stopwords", "updating list used by analyzers"))) Analyzers are instantiated at index creation time, when a
node is restarted, or when a closed index is reopened.
If you specify stopwords inline with the `stopwords` parameter, your
only option is to close the index and update the analyzer configuration with the
http://bit.ly/1zijFPx[update index settings API], then reopen
the index.
Updating stopwords is easier if you specify them in a file with the
`stopwords_path` parameter.((("stopwords_path parameter"))) You can just update the file (on every node in
the cluster) and then force the analyzers to be re-created by either of these actions:
* Closing and reopening the index
(see http://bit.ly/1B6s0WY[open/close index]), or
* Restarting each node in the cluster, one by one
Of course, updating the stopwords list will not change any documents that have
already been indexed. It will apply only to searches and to new or updated
documents. To apply the changes to existing documents, you will need to
reindex your data. See <<reindex>>.
- Introduction
- 入門
- 是什么
- 安裝
- API
- 文檔
- 索引
- 搜索
- 聚合
- 小結
- 分布式
- 結語
- 分布式集群
- 空集群
- 集群健康
- 添加索引
- 故障轉移
- 橫向擴展
- 更多擴展
- 應對故障
- 數據
- 文檔
- 索引
- 獲取
- 存在
- 更新
- 創建
- 刪除
- 版本控制
- 局部更新
- Mget
- 批量
- 結語
- 分布式增刪改查
- 路由
- 分片交互
- 新建、索引和刪除
- 檢索
- 局部更新
- 批量請求
- 批量格式
- 搜索
- 空搜索
- 多索引和多類型
- 分頁
- 查詢字符串
- 映射和分析
- 數據類型差異
- 確切值對決全文
- 倒排索引
- 分析
- 映射
- 復合類型
- 結構化查詢
- 請求體查詢
- 結構化查詢
- 查詢與過濾
- 重要的查詢子句
- 過濾查詢
- 驗證查詢
- 結語
- 排序
- 排序
- 字符串排序
- 相關性
- 字段數據
- 分布式搜索
- 查詢階段
- 取回階段
- 搜索選項
- 掃描和滾屏
- 索引管理
- 創建刪除
- 設置
- 配置分析器
- 自定義分析器
- 映射
- 根對象
- 元數據中的source字段
- 元數據中的all字段
- 元數據中的ID字段
- 動態映射
- 自定義動態映射
- 默認映射
- 重建索引
- 別名
- 深入分片
- 使文本可以被搜索
- 動態索引
- 近實時搜索
- 持久化變更
- 合并段
- 結構化搜索
- 查詢準確值
- 組合過濾
- 查詢多個準確值
- 包含,而不是相等
- 范圍
- 處理 Null 值
- 緩存
- 過濾順序
- 全文搜索
- 匹配查詢
- 多詞查詢
- 組合查詢
- 布爾匹配
- 增加子句
- 控制分析
- 關聯失效
- 多字段搜索
- 多重查詢字符串
- 單一查詢字符串
- 最佳字段
- 最佳字段查詢調優
- 多重匹配查詢
- 最多字段查詢
- 跨字段對象查詢
- 以字段為中心查詢
- 全字段查詢
- 跨字段查詢
- 精確查詢
- 模糊匹配
- Phrase matching
- Slop
- Multi value fields
- Scoring
- Relevance
- Performance
- Shingles
- Partial_Matching
- Postcodes
- Prefix query
- Wildcard Regexp
- Match phrase prefix
- Index time
- Ngram intro
- Search as you type
- Compound words
- Relevance
- Scoring theory
- Practical scoring
- Query time boosting
- Query scoring
- Not quite not
- Ignoring TFIDF
- Function score query
- Popularity
- Boosting filtered subsets
- Random scoring
- Decay functions
- Pluggable similarities
- Conclusion
- Language intro
- Intro
- Using
- Configuring
- Language pitfalls
- One language per doc
- One language per field
- Mixed language fields
- Conclusion
- Identifying words
- Intro
- Standard analyzer
- Standard tokenizer
- ICU plugin
- ICU tokenizer
- Tidying text
- Token normalization
- Intro
- Lowercasing
- Removing diacritics
- Unicode world
- Case folding
- Character folding
- Sorting and collations
- Stemming
- Intro
- Algorithmic stemmers
- Dictionary stemmers
- Hunspell stemmer
- Choosing a stemmer
- Controlling stemming
- Stemming in situ
- Stopwords
- Intro
- Using stopwords
- Stopwords and performance
- Divide and conquer
- Phrase queries
- Common grams
- Relevance
- Synonyms
- Intro
- Using synonyms
- Synonym formats
- Expand contract
- Analysis chain
- Multi word synonyms
- Symbol synonyms
- Fuzzy matching
- Intro
- Fuzziness
- Fuzzy query
- Fuzzy match query
- Scoring fuzziness
- Phonetic matching
- Aggregations
- overview
- circuit breaker fd settings
- filtering
- facets
- docvalues
- eager
- breadth vs depth
- Conclusion
- concepts buckets
- basic example
- add metric
- nested bucket
- extra metrics
- bucket metric list
- histogram
- date histogram
- scope
- filtering
- sorting ordering
- approx intro
- cardinality
- percentiles
- sigterms intro
- sigterms
- fielddata
- analyzed vs not
- 地理坐標點
- 地理坐標點
- 通過地理坐標點過濾
- 地理坐標盒模型過濾器
- 地理距離過濾器
- 緩存地理位置過濾器
- 減少內存占用
- 按距離排序
- Geohashe
- Geohashe
- Geohashe映射
- Geohash單元過濾器
- 地理位置聚合
- 地理位置聚合
- 按距離聚合
- Geohash單元聚合器
- 范圍(邊界)聚合器
- 地理形狀
- 地理形狀
- 映射地理形狀
- 索引地理形狀
- 查詢地理形狀
- 在查詢中使用已索引的形狀
- 地理形狀的過濾與緩存
- 關系
- 關系
- 應用級別的Join操作
- 扁平化你的數據
- Top hits
- Concurrency
- Concurrency solutions
- 嵌套
- 嵌套對象
- 嵌套映射
- 嵌套查詢
- 嵌套排序
- 嵌套集合
- Parent Child
- Parent child
- Indexing parent child
- Has child
- Has parent
- Children agg
- Grandparents
- Practical considerations
- Scaling
- Shard
- Overallocation
- Kagillion shards
- Capacity planning
- Replica shards
- Multiple indices
- Index per timeframe
- Index templates
- Retiring data
- Index per user
- Shared index
- Faking it
- One big user
- Scale is not infinite
- Cluster Admin
- Marvel
- Health
- Node stats
- Other stats
- Deployment
- hardware
- other
- config
- dont touch
- heap
- file descriptors
- conclusion
- cluster settings
- Post Deployment
- dynamic settings
- logging
- indexing perf
- rolling restart
- backup
- restore
- conclusion