Prometheus對接alertmanager · Kubernetes

[TOC] ![](https://img.kancloud.cn/cf/6e/cf6e6d8b2b5d53cdb52fd8ea79ca9b4b_1036x148.png) 摘要官方的一句話。**建議在本地相關 Prometheus 服務器內繼續部署規則** 如果的確需要安裝 rule 組件，請參考 [Rule文章](Ruler.md) 。該文章演示Prometheus與alertmanager對接設置警報和通知的**主要步驟**是： - 安裝和配置 alertmanager - Prometheus 關聯 alertmanager - 在Prometheus中創建警報規則 ## 安裝和配置 alertmanager 1. 安裝alertmanager 請參考 [上一章節內容](alertmanager.md) 2. 配置alertmanager郵件告警 ```shell global: # 郵件配置 smtp_from: 'ecloudz@126.com' smtp_smarthost: 'smtp.126.com:25' smtp_auth_username: 'ecloudz@126.com' smtp_auth_password: 'FHWBDWBEUMQExxxx' # 郵箱的授權碼 route: # 當一個新的報警分組被創建后，需要等待至少 group_wait 時間來初始化通知 # 這種方式可以確保您能有足夠的時間為同一分組來獲取多個警報，然后一起觸發這個報警信息。 group_wait: 1m # 已經成功發送警報，再次發送通知之前等待多長時間 repeat_interval: 4h # 相同的group之間發送告警通知的時間間隔 group_interval: 15m # 分組，對應Prometheus的告警規則的labels group_by: ["cluster", "team"] # 子路由 # 當 team=hosts(Prometheus傳遞過來) 的 labels ，告警媒介走 email 方式。如果沒有到對于的labels，告警媒介則走default routes: - receiver: email matchers: - team = hosts receivers: - name: email email_configs: - to: "jiaxzeng@126.com" # 收件郵箱地址 html: '{{ template "email.to.html" . }}' # 發送郵件的內容 headers: { Subject: '{{ if eq .Status "firing" }}【監控告警正在發生】{{ else if eq .Status "resolved" }}【監控告警已恢復】{{ end }} {{ .CommonLabels.alertname }}' } # 郵件的主題 send_resolved: true # 是否接受已解決的告警信息 templates: - "/data/alertmanager/email.tmpl" # 模板路徑 ``` 3. 添加模板 ```shell cat <<-EOF | sudo tee /data/alertmanager/email.tmpl > /dev/null {{ define "email.to.html" }} {{- if gt (len .Alerts.Firing) 0 -}} {{ range .Alerts }} =========start========== 告警程序: prometheus_alert 告警級別: {{ .Labels.severity }} 告警類型: {{ .Labels.alertname }} 告警主機: {{ .Labels.instance }} 告警主題: {{ .Annotations.summary }} 告警詳情: {{ .Annotations.description }} 觸發時間: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} =========end========== {{ end }}{{ end -}} {{- if gt (len .Alerts.Resolved) 0 -}} {{ range .Alerts }} =========start========== 告警程序: prometheus_alert 告警級別: {{ .Labels.severity }} 告警類型: {{ .Labels.alertname }} 告警主機: {{ .Labels.instance }} 告警主題: {{ .Annotations.summary }} 告警詳情: {{ .Annotations.description }} 觸發時間: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} 恢復時間: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} =========end========== {{ end }}{{ end -}} {{- end }} EOF ``` > 第一行 `define` 定義的內容是 alertmanager 配置文件的 `receivers.email_configs.html` 的值保持一致，否則告警郵件內容為空 4. 檢測配置文件是否正常 ```shell $ amtool check-config /data/alertmanager/alertmanager.yml Checking '/data/alertmanager/alertmanager.yml' SUCCESS Found: - global config - route - 0 inhibit rules - 2 receivers - 1 templates SUCCESS ``` 5. 熱加載alertmanager ```shell systemctl reload alertmanager ``` ## Prometheus 關聯 alertmanager ```yaml alerting: alert_relabel_configs: - action: labeldrop regex: replica alertmanagers: - path_prefix: "/alertmanager" static_configs: - targets: - "192.168.31.103:9093" ``` > 請注意以下三點： > - 所有Prometheus節點都需要配置 > - 配置 `alert_relabel_configs` 是因為Prometheus有添加額外的標簽，如果告警時不刪除該標簽，則會出現重發告警郵件 > - 配置 `path_prefix` 是因為 alertmanager 添加子路徑，如果沒有添加的話，則不需要該配置行 ## 在Prometheus中創建警報規則 1. Prometheus配置告警規則路徑 ```shell rule_files: - "rules/*.yml" ``` 2. 創建告警規則 ```shell mkdir /data/prometheus/rules cat <<-EOF | sudo tee /data/prometheus/rules/hosts.yml > /dev/null groups: - name: hosts rules: - alert: NodeMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes - node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80 for: 1m labels: team: hosts annotations: summary: "節點內存使用率過高" description: "{{$labels.instance}} 節點內存使用率超過 80% (當前值: {{ $value }})" - alert: NodeCpuUsage expr: (1 - (sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by(instance) / sum(increase(node_cpu_seconds_total[1m])) by(instance))) * 100 > 80 for: 1m labels: team: hosts annotations: summary: "節點CPU使用率過高" description: "{{$labels.instance}} 節點最近一分鐘CPU使用率超過 80% (當前值: {{ $value }})" - alert: NodeDiskUsage expr: ((node_filesystem_size_bytes{fstype !~ "tmpfs|rootfs"} - node_filesystem_free_bytes{fstype !~ "tmpfs|rootfs"}) / node_filesystem_size_bytes{fstype !~ "tmpfs|rootfs"})*100 > 40 for: 1m labels: team: hosts annotations: summary: "節點磁盤分區使用率過高" description: "{{$labels.instance}} 節點 {{$labels.mountpoint}} 分區超過 80% (當前值: {{ $value }})" EOF ``` ## 熱加載告警規則 ```shell promtool check rules /data/thanos/rule/rules/hosts.yml sudo systemctl reload thanos-rule.service ``` ## 將文件同步給其他節點 ```shell # 告警目錄 scp -r /data/thanos/rule/rules ops@k8s-master02:/data/thanos/rule # 檢測配置文件 ssh ops@k8s-master02 "promtool check rules /data/thanos/rule/rules/hosts.yml" # 熱加載配置文件 ssh ops@k8s-master02 "sudo systemctl reload thanos-rule.service" ``` ## 驗證如果Prometheus沒有暴露可以訪問的地址，這里使用api進行驗證 ```shell # 告警規則名稱 curl -s http://localhost:9090/api/v1/rules | jq .data.groups[].rules[].name # 正在發生的告警 curl -s http://localhost:9090/api/v1/alerts | jq .data.alerts[].labels ```