第十章 alertmanager 報警規則詳解 · 如何以優雅的姿勢監控kubernetes

# alertmanager報警規則詳解 * * * * * 這篇文章介紹prometheus和alertmanager的報警和通知規則，prometheus的配置文件名為prometheus.yml，alertmanager的配置文件名為alertmanager.yml 報警：指prometheus將監測到的異常事件發送給alertmanager，而不是指發送郵件通知通知：指alertmanager發送異常事件的通知（郵件、webhook等） ## 報警規則在prometheus.yml中指定匹配報警規則的間隔 # How frequently to evaluate rules. [ evaluation_interval: <duration> | default = 1m ] 在prometheus.yml中指定規則文件（可使用通配符，如rules/*.rules） # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "/etc/prometheus/alert.rules" 并基于以下模板： ALERT <alert name> IF <expression> [ FOR <duration> ] [ LABELS <label set> ] [ ANNOTATIONS <label set> ] 其中： Alert name是警報標識符。它不需要是唯一的。 Expression是為了觸發警報而被評估的條件。它通常使用現有指標作為/metrics端點返回的指標。 Duration是規則必須有效的時間段。例如，5s表示5秒。 Label set是將在消息模板中使用的一組標簽。在prometheus-k8s-statefulset.yaml 文件創建ruleSelector，標記報警規則角色。在prometheus-k8s-rules.yaml 報警規則文件中引用 ruleSelector: matchLabels: role: prometheus-rulefiles prometheus: k8s 在prometheus-k8s-rules.yaml 使用configmap 方式引用prometheus-rulefiles apiVersion: v1 kind: ConfigMap metadata: name: prometheus-k8s-rules namespace: monitoring labels: role: prometheus-rulefiles prometheus: k8s data: pod.rules.yaml: |+ groups: - name: noah_pod.rules rules: - alert: Pod_all_cpu_usage expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 10 for: 5m labels: severity: critical service: pods annotations: description: 容器 {{ $labels.name }} CPU 資源利用率大于 75% , (current value is {{ $value }}) summary: Dev CPU 負載告警 - alert: Pod_all_memory_usage expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 1024*10^3*2 for: 10m labels: severity: critical annotations: description: 容器 {{ $labels.name }} Memory 資源利用率大于 2G , (current value is {{ $value }}) summary: Dev Memory 負載告警 - alert: Pod_all_network_receive_usage expr: sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 1024*1024*50 for: 10m labels: severity: critical annotations: description: 容器 {{ $labels.name }} network_receive 資源利用率大于 50M , (current value is {{ $value }}) summary: network_receive 負載告警配置文件設置好后，prometheus-opeartor自動重新讀取配置。如果二次修改comfigmap 內容只需要apply kubectl apply -f prometheus-k8s-rules.yaml 將郵件通知與rules對比一下（還需要配置alertmanager.yml才能收到郵件） ![報警聚合功能](https://box.kancloud.cn/b4f27df56e910070636fe1bacdd6f307_603x726.png) ## 通知規則設置alertmanager.yml的的route與receivers global: # ResolveTimeout is the time after which an alert is declared resolved # if it has not been updated. resolve_timeout: 5m # The smarthost and SMTP sender used for mail notifications. smtp_smarthost: 'xxxxx' smtp_from: 'xxxxxxx' smtp_auth_username: 'xxxxx' smtp_auth_password: 'xxxxxx' # The API URL to use for Slack notifications. slack_api_url: 'https://hooks.slack.com/services/some/api/token' # # The directory from which notification templates are read. templates: - '*.tmpl' # The root route on which each incoming alert enters. route: # The labels by which incoming alerts are grouped together. For example, # multiple alerts coming in for cluster=A and alertname=LatencyHigh would # be batched into a single group. group_by: ['alertname', 'cluster', 'service'] # When a new group of alerts is created by an incoming alert, wait at # least 'group_wait' to send the initial notification. # This way ensures that you get multiple alerts for the same group that start # firing shortly after another are batched together on the first # notification. group_wait: 30s # When the first notification was sent, wait 'group_interval' to send a batch # of new alerts that started firing for that group. group_interval: 5m # If an alert has successfully been sent, wait 'repeat_interval' to # resend them. #repeat_interval: 1m repeat_interval: 15m # A default receiver # If an alert isn't caught by a route, send it to default. receiver: default # All the above attributes are inherited by all child routes and can # overwritten on each. # The child route trees. routes: - match: severity: critical receiver: email_alert receivers: - name: 'default' email_configs: - to : 'yi.hu@dianrong.com' send_resolved: true - name: 'email_alert' email_configs: - to : 'yi.hu@dianrong.com' send_resolved: true ### 名詞解釋 ### Route `route`屬性用來設置報警的分發策略，它是一個樹狀結構，按照深度優先從左向右的順序進行匹配。 // Match does a depth-first left-to-right search through the route tree // and returns the matching routing nodes. func (r *Route) Match(lset model.LabelSet) []*Route { ### Alert `Alert`是alertmanager接收到的報警，類型如下。 // Alert is a generic representation of an alert in the Prometheus eco-system. type Alert struct { // Label value pairs for purpose of aggregation, matching, and disposition // dispatching. This must minimally include an "alertname" label. Labels LabelSet `json:"labels"` // Extra key/value information which does not define alert identity. Annotations LabelSet `json:"annotations"` // The known time range for this alert. Both ends are optional. StartsAt time.Time `json:"startsAt,omitempty"` EndsAt time.Time `json:"endsAt,omitempty"` GeneratorURL string `json:"generatorURL"` } > 具有相同Lables的Alert（key和value都相同）才會被認為是同一種。在prometheus rules文件配置的一條規則可能會產生多種報警 ### Group alertmanager會根據group_by配置將Alert分組。如下規則，當go_goroutines等于4時會收到三條報警，alertmanager會將這三條報警分成兩組向receivers發出通知。 ALERT test1 IF go_goroutines > 1 LABELS {label1="l1", label2="l2", status="test"} ALERT test2 IF go_goroutines > 2 LABELS {label1="l2", label2="l2", status="test"} ALERT test3 IF go_goroutines > 3 LABELS {label1="l2", label2="l1", status="test"} ### 主要處理流程 1. 接收到Alert，根據labels判斷屬于哪些Route（可存在多個Route，一個Route有多個Group，一個Group有多個Alert） 2. 將Alert分配到Group中，沒有則新建Group 3. 新的Group等待group_wait指定的時間（等待時可能收到同一Group的Alert），根據resolve_timeout判斷Alert是否解決，然后發送通知 4. 已有的Group等待group_interval指定的時間，判斷Alert是否解決，當上次發送通知到現在的間隔大于repeat_interval或者Group有更新時會發送通知 ## Alertmanager Alertmanager是警報的緩沖區，它具有以下特征：可以通過特定端點（不是特定于Prometheus）接收警報。可以將警報重定向到接收者，如hipchat、郵件或其他人。足夠智能，可以確定已經發送了類似的通知。所以，如果出現問題，你不會被成千上萬的電子郵件淹沒。 Alertmanager客戶端（在這種情況下是Prometheus）首先發送POST消息，并將所有要處理的警報發送到/ api / v1 / alerts。例如： [ { "labels": { "alertname": "low_connected_users", "severity": "warning" }, "annotations": { "description": "Instance play-app:9000 under lower load", "summary": "play-app:9000 of job playframework-app is under lower load" } }] ### alert工作流程一旦這些警報存儲在Alertmanager，它們可能處于以下任何狀態： ![alert 報警流程](https://box.kancloud.cn/1aa4400491ac3e8202fad57c6859d622_215x300.png) * Inactive：這里什么都沒有發生。 * Pending：客戶端告訴我們這個警報必須被觸發。然而，警報可以被分組、壓抑/抑制或者靜默/靜音。一旦所有的驗證都通過了，我們就轉到Firing。 * Firing：警報發送到Notification Pipeline，它將聯系警報的所有接收者。然后客戶端告訴我們警報解除，所以轉換到狀Inactive狀態。 Prometheus有一個專門的端點，允許我們列出所有的警報，并遵循狀態轉換。Prometheus所示的每個狀態以及導致過渡的條件如下所示：規則不符合。警報沒有激活。 ![](https://box.kancloud.cn/0f044c1fb0e4bf236269249a83b65787_634x223.png) 規則符合。警報現在處于活動狀態。執行一些驗證是為了避免淹沒接收器的消息。 ![](https://box.kancloud.cn/cb47caff504e873ace45116b0dded829_1874x625.png) 警報發送到接收者 ![](https://box.kancloud.cn/8d8c30577028613304be57b4750cae67_1895x446.png) ### Inhibition 抑制是指當警報發出后，停止重復發送由此警報引發其他錯誤的警報的機制。例如，當警報被觸發，通知整個集群不可達，可以配置Alertmanager忽略由該警報觸發而產生的所有其他警報，這可以防止通知數百或數千與此問題不相關的其他警報。抑制機制可以通過Alertmanager的配置文件來配置。 Inhibition允許在其他警報處于觸發狀態時，抑制一些警報的通知。例如，如果同一警報（基于警報名稱）已經非常緊急，那么我們可以配置一個抑制來使任何警告級別的通知靜音。 alertmanager.yml文件的相關部分如下所示： inhibit_rules:- source_match: severity: 'critical' target_match: severity: 'warning' equal: ['low_connected_users'] 配置抑制規則，是存在另一組匹配器匹配的情況下，靜音其他被引發警報的規則。這兩個警報，必須有一組相同的標簽。 # Matchers that have to be fulfilled in the alerts to be muted. target_match: [ <labelname>: <labelvalue>, ... ] target_match_re: [ <labelname>: <regex>, ... ] # Matchers for which one or more alerts have to exist for the # inhibition to take effect. source_match: [ <labelname>: <labelvalue>, ... ] source_match_re: [ <labelname>: <regex>, ... ] # Labels that must have an equal value in the source and target # alert for the inhibition to take effect. [ equal: '[' <labelname>, ... ']' ] ### Silences Silences是快速地使警報暫時靜音的一種方法。我們直接通過Alertmanager管理控制臺中的專用頁面來配置它們。在嘗試解決嚴重的生產問題時，這對避免收到垃圾郵件很有用。 ![](https://box.kancloud.cn/827627fe7d1652bda41c0e688a3a091f_1187x830.png) [alertmanager 參考資料](https://mp.weixin.qq.com/s/eqgfd5_D0aH8dOGWUddEjg) [抑制規則 inhibit_rule參考資料](http://blog.csdn.net/y_xiao_/article/details/50818451)