**安裝**
Prometheus在容器內運行的話,數據不能持久
Node-exporter在容器里面收集物理節點數據的話,數據會不準確。
所以我們采用federation的方式。就是容器里面運行一個prometheus server采集容器里面的數據,外部再運行一個prometheus server采集物理節點的數據+容器內prometheus采集到的數據。

容器內部安裝就不介紹了,只介紹外部
安裝包在prometheus.io里面找
node-exporter
Prometheus
Alertmanager
**安裝****node-exporter**
在每個需要監控的節點上安裝node-exporter放在/usr/local下
\# tar -xf node\_exporter-0.16.0.linux-amd64.tar.gz
\# cd node-exporter
\# ./node\_exporter &
**安裝****prometheus**
\# tar -xf prometheus-2.4.2.linux-amd64.tar.gz
\# cd prometheus
加入job收集外部采集的數據,federation采集內部prometheus的數據
\# vim prometheus.yml
global:
scrape\_interval: 15s 15秒采集一次
evaluation\_interval: 15s 15秒評估一次規則
alerting:
alertmanagers:
\- static\_configs:
\- targets: \["localhost:9093"\]
rule\_files:
\- "rule/\*.yml" 報警規則文件
scrape\_configs:
\- job\_name: 'prometheus'
static\_configs:
\- targets: \['localhost:9090'\]
\- job\_name: 'node-exporter'
static\_configs:
\- targets: \['192.168.11.212:9100',
'192.168.11.213:9100',
'192.168.11.214:9100',
'192.168.11.215:9100',
'192.168.11.216:9100'\]
\- job\_name: 'federate'
scrape\_interval: 15s
honor\_labels: true
metrics\_path: '/federate'
params:
'match\[\]':
\- '{job=~"kubernetes.\*"}'
static\_configs:
\- targets:
\- 'prometheus.pkbeta.com'
**安裝****alertmanger**
\# tar -xf alertmanager-0.15.2.linux-amd64.tar.gz
\# cd alertmanager
\# vim alertmanager.yml
global:
resolve\_timeout: 5m
smtp\_smarthost: 'smtp.163.com:25' 我用的是163郵箱
smtp\_from: 'XXXXX@163.com'
smtp\_auth\_username: 'XXXXX@163.com'
smtp\_auth\_password: 'XXXXX'
smtp\_require\_tls: false
route:
group\_by: \['NODE'\]
group\_wait: 10s 報警等待時間
group\_interval: 10s 報警間隔時間
repeat\_interval: 1h 重復發送時間
receiver: 'node'
receivers:
\- name: 'node'
email\_configs:
\- to: 'XXXXX@163.com'
inhibit\_rules:
\- source\_match:
severity: 'critical'
target\_match:
severity: 'warning'
equal: \['alertname', 'dev', 'instance'\]
啟動alertmanager
\# ./alertmanager &
**編寫****prometheus****的報警規則**
\# cd prometheus/rule
\# vim test.yml
groups:
\- name: NODE 組的名字
rules:
\- alert: NodeCPUUsage 75% 報警名
expr: (100 - (avg by (instance) (irate(node\_cpu\_seconds\_total{mode="idle"}\[5m\])) \* 100)) > 75 報警的規則
for: 1m 達到閾值1分鐘就報警
labels:
severity: page
annotations: 以下就是報警收到的信息
summary: "{{$labels.instance}}: High CPU usage detected"
description: "{{$labels.instance}}: CPU usage is above 75% (current value is: {{ $value }})"
啟動prometheus
\# ./prometheus
瀏覽器訪問prometheus 默認端口9090


