# alertmanager報警規則詳解
* * * * *
這篇文章介紹prometheus和alertmanager的報警和通知規則,prometheus的配置文件名為prometheus.yml,alertmanager的配置文件名為alertmanager.yml
報警:指prometheus將監測到的異常事件發送給alertmanager,而不是指發送郵件通知
通知:指alertmanager發送異常事件的通知(郵件、webhook等)
## 報警規則
在prometheus.yml中指定匹配報警規則的間隔
# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]
在prometheus.yml中指定規則文件(可使用通配符,如rules/*.rules)
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/etc/prometheus/alert.rules"
并基于以下模板:
ALERT <alert name>
IF <expression>
[ FOR <duration> ]
[ LABELS <label set> ]
[ ANNOTATIONS <label set> ]
其中:
Alert name是警報標識符。它不需要是唯一的。
Expression是為了觸發警報而被評估的條件。它通常使用現有指標作為/metrics端點返回的指標。
Duration是規則必須有效的時間段。例如,5s表示5秒。
Label set是將在消息模板中使用的一組標簽。
在prometheus-k8s-statefulset.yaml 文件創建ruleSelector,標記報警規則角色。在prometheus-k8s-rules.yaml 報警規則文件中引用
ruleSelector:
matchLabels:
role: prometheus-rulefiles
prometheus: k8s
在prometheus-k8s-rules.yaml 使用configmap 方式引用prometheus-rulefiles
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-k8s-rules
namespace: monitoring
labels:
role: prometheus-rulefiles
prometheus: k8s
data:
pod.rules.yaml: |+
groups:
- name: noah_pod.rules
rules:
- alert: Pod_all_cpu_usage
expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 10
for: 5m
labels:
severity: critical
service: pods
annotations:
description: 容器 {{ $labels.name }} CPU 資源利用率大于 75% , (current value is {{ $value }})
summary: Dev CPU 負載告警
- alert: Pod_all_memory_usage
expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 1024*10^3*2
for: 10m
labels:
severity: critical
annotations:
description: 容器 {{ $labels.name }} Memory 資源利用率大于 2G , (current value is {{ $value }})
summary: Dev Memory 負載告警
- alert: Pod_all_network_receive_usage
expr: sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 1024*1024*50
for: 10m
labels:
severity: critical
annotations:
description: 容器 {{ $labels.name }} network_receive 資源利用率大于 50M , (current value is {{ $value }})
summary: network_receive 負載告警
配置文件設置好后,prometheus-opeartor自動重新讀取配置。
如果二次修改comfigmap 內容只需要apply
kubectl apply -f prometheus-k8s-rules.yaml
將郵件通知與rules對比一下(還需要配置alertmanager.yml才能收到郵件)

## 通知規則
設置alertmanager.yml的的route與receivers
global:
# ResolveTimeout is the time after which an alert is declared resolved
# if it has not been updated.
resolve_timeout: 5m
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: 'xxxxx'
smtp_from: 'xxxxxxx'
smtp_auth_username: 'xxxxx'
smtp_auth_password: 'xxxxxx'
# The API URL to use for Slack notifications.
slack_api_url: 'https://hooks.slack.com/services/some/api/token'
# # The directory from which notification templates are read.
templates:
- '*.tmpl'
# The root route on which each incoming alert enters.
route:
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group_by: ['alertname', 'cluster', 'service']
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s
# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 5m
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
#repeat_interval: 1m
repeat_interval: 15m
# A default receiver
# If an alert isn't caught by a route, send it to default.
receiver: default
# All the above attributes are inherited by all child routes and can
# overwritten on each.
# The child route trees.
routes:
- match:
severity: critical
receiver: email_alert
receivers:
- name: 'default'
email_configs:
- to : 'yi.hu@dianrong.com'
send_resolved: true
- name: 'email_alert'
email_configs:
- to : 'yi.hu@dianrong.com'
send_resolved: true
### 名詞解釋
### Route
`route`屬性用來設置報警的分發策略,它是一個樹狀結構,按照深度優先從左向右的順序進行匹配。
// Match does a depth-first left-to-right search through the route tree
// and returns the matching routing nodes.
func (r *Route) Match(lset model.LabelSet) []*Route {
### Alert
`Alert`是alertmanager接收到的報警,類型如下。
// Alert is a generic representation of an alert in the Prometheus eco-system.
type Alert struct {
// Label value pairs for purpose of aggregation, matching, and disposition
// dispatching. This must minimally include an "alertname" label.
Labels LabelSet `json:"labels"`
// Extra key/value information which does not define alert identity.
Annotations LabelSet `json:"annotations"`
// The known time range for this alert. Both ends are optional.
StartsAt time.Time `json:"startsAt,omitempty"`
EndsAt time.Time `json:"endsAt,omitempty"`
GeneratorURL string `json:"generatorURL"`
}
> 具有相同Lables的Alert(key和value都相同)才會被認為是同一種。在prometheus rules文件配置的一條規則可能會產生多種報警
### Group
alertmanager會根據group_by配置將Alert分組。如下規則,當go_goroutines等于4時會收到三條報警,alertmanager會將這三條報警分成兩組向receivers發出通知。
ALERT test1
IF go_goroutines > 1
LABELS {label1="l1", label2="l2", status="test"}
ALERT test2
IF go_goroutines > 2
LABELS {label1="l2", label2="l2", status="test"}
ALERT test3
IF go_goroutines > 3
LABELS {label1="l2", label2="l1", status="test"}
### 主要處理流程
1. 接收到Alert,根據labels判斷屬于哪些Route(可存在多個Route,一個Route有多個Group,一個Group有多個Alert)
2. 將Alert分配到Group中,沒有則新建Group
3. 新的Group等待group_wait指定的時間(等待時可能收到同一Group的Alert),根據resolve_timeout判斷Alert是否解決,然后發送通知
4. 已有的Group等待group_interval指定的時間,判斷Alert是否解決,當上次發送通知到現在的間隔大于repeat_interval或者Group有更新時會發送通知
## Alertmanager
Alertmanager是警報的緩沖區,它具有以下特征:
可以通過特定端點(不是特定于Prometheus)接收警報。
可以將警報重定向到接收者,如hipchat、郵件或其他人。
足夠智能,可以確定已經發送了類似的通知。所以,如果出現問題,你不會被成千上萬的電子郵件淹沒。
Alertmanager客戶端(在這種情況下是Prometheus)首先發送POST消息,并將所有要處理的警報發送到/ api / v1 / alerts。例如:
[
{
"labels": {
"alertname": "low_connected_users",
"severity": "warning"
},
"annotations": {
"description": "Instance play-app:9000 under lower load",
"summary": "play-app:9000 of job playframework-app is under lower load"
}
}]
### alert工作流程
一旦這些警報存儲在Alertmanager,它們可能處于以下任何狀態:

* Inactive:這里什么都沒有發生。
* Pending:客戶端告訴我們這個警報必須被觸發。然而,警報可以被分組、壓抑/抑制或者靜默/靜音。一旦所有的驗證都通過了,我們就轉到Firing。
* Firing:警報發送到Notification Pipeline,它將聯系警報的所有接收者。然后客戶端告訴我們警報解除,所以轉換到狀Inactive狀態。
Prometheus有一個專門的端點,允許我們列出所有的警報,并遵循狀態轉換。Prometheus所示的每個狀態以及導致過渡的條件如下所示:
規則不符合。警報沒有激活。

規則符合。警報現在處于活動狀態。 執行一些驗證是為了避免淹沒接收器的消息。

警報發送到接收者

### Inhibition
抑制是指當警報發出后,停止重復發送由此警報引發其他錯誤的警報的機制。
例如,當警報被觸發,通知整個集群不可達,可以配置Alertmanager忽略由該警報觸發而產生的所有其他警報,這可以防止通知數百或數千與此問題不相關的其他警報。
抑制機制可以通過Alertmanager的配置文件來配置。
Inhibition允許在其他警報處于觸發狀態時,抑制一些警報的通知。例如,如果同一警報(基于警報名稱)已經非常緊急,那么我們可以配置一個抑制來使任何警告級別的通知靜音。 alertmanager.yml文件的相關部分如下所示:
inhibit_rules:- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['low_connected_users']
配置抑制規則,是存在另一組匹配器匹配的情況下,靜音其他被引發警報的規則。這兩個警報,必須有一組相同的標簽。
# Matchers that have to be fulfilled in the alerts to be muted.
target_match:
[ <labelname>: <labelvalue>, ... ]
target_match_re:
[ <labelname>: <regex>, ... ]
# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
source_match:
[ <labelname>: <labelvalue>, ... ]
source_match_re:
[ <labelname>: <regex>, ... ]
# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
[ equal: '[' <labelname>, ... ']' ]
### Silences
Silences是快速地使警報暫時靜音的一種方法。 我們直接通過Alertmanager管理控制臺中的專用頁面來配置它們。在嘗試解決嚴重的生產問題時,這對避免收到垃圾郵件很有用。

[alertmanager 參考資料](https://mp.weixin.qq.com/s/eqgfd5_D0aH8dOGWUddEjg)
[抑制規則 inhibit_rule參考資料](http://blog.csdn.net/y_xiao_/article/details/50818451)