開始部署 · Prometheus Practice

# Prometheus Stack 部署方案 ## Prometheus Stack 簡介 _PS: 為了后面行文方便，我把 Prometheus 有關組件統一命名為：Prometheus Stack，這些組件包括：Prometheus、Grafana、Alertmanager、\*\_exporter（眾多數據收集器）等。_ Prometheus 是由 SoundCloud 開源監控告警解決方案，從 2012 年開始編寫代碼，再到 2015 年 github 上開源以來，已經吸引了 9k+ 關注，以及很多大公司的使用；2016 年 Prometheus 成為繼 k8s 后，第二名 CNCF(Cloud Native Computing Foundation) 成員。作為新一代開源解決方案，很多理念與 Google SRE 運維之道不謀而合。 ### 它有什么特點？ + 自定義多維數據模型(時序列數據由metric名和一組key/value標簽組成) + 非常高效的存儲平均一個采樣數據占 ~3.5 bytes左右，320萬的時間序列，每30秒采樣，保持60天，消耗磁盤大概228G。 + 在多維度上靈活且強大的查詢語言(PromQl) + 不依賴分布式存儲，支持單主節點工作 + 通過基于HTTP的pull方式采集時序數據 + 可以通過push gateway進行時序列數據推送(pushing) + 可以通過服務發現或者靜態配置去獲取要采集的目標服務器 + 多種可視化圖表及儀表盤支持 ### 關鍵詞：時間序列數據 Prometheus 所有的存儲都是按時間序列去實現的，相同的 metrics(指標名稱) 和 label(一個或多個標簽) 組成一條時間序列，不同的label表示不同的時間序列。每條時間序列是由唯一的指標名稱和一組標簽（key=value）的形式組成。 Prometheus的數據格式是簡單的文本格式，可以直接閱讀。其中，#號開頭的是注釋，除此之外，每一行一個數據項，數據名在前，值在后。{}中是標簽，一條數據可以有多個標簽。 ``` # HELP go_gc_duration_seconds A summary of the GC invocation durations. # TYPE go_gc_duration_seconds summary http_request_count{endpoint="/a"} 10 http_request_count{endpoint="/b"} 200 http_request_count(endpoint="/c") 3 copy ``` + 指標名稱一般是給監測對像起一名字，例如 http_requests_total 這樣，它有一些命名規則，可以包字母數字之類的的。通常是以應用名稱開頭監測對像數值類型單位這樣。例如： ``` - push_total - userlogin_mysql_duration_seconds - app_memory_usage_bytes ``` + 標簽就是對一條時間序列不同維度的識別了，例如一個http請求用的是POST還是GET，它的endpoint是什么，這時候就要用標簽去標記了。最終形成的標識便是這樣了 ``` http_requests_total{method="POST",endpoint="/api/tracks"} ``` 如果以傳統數據庫的理解來看這條語句，則可以考慮 http_requests_total是表名，標簽是字段，而timestamp是主鍵，還有一個float64字段是值了。（Prometheus里面所有值都是按float64存儲） ### 關鍵詞：push vs pull model ![push_vs_pull](./imgs/push_vs_pull.png) 我們目前比較熟悉的監控系統系統，基本上都是第一種 push類型的，監控系統被動接受來自agent主動上報的各項健康指標數據，典型的監控系統是open-falcon；還有一種就是基于pull模型的，被監控系統向外暴露系統指標，監控系統主動去通過某些方式（通常是http）拉取到這些指標，最典型的是 prometheus；當然，同時支持pull類型和push類型的監控系統也是存在的，那就是經典的zabbix。 ### 核心組件 + Prometheus Server，主要用于抓取數據和存儲時序數據，另外還提供查詢和 Alert Rule 配置管理。 + client libraries，用于對接 Prometheus Server, 可以查詢和上報數據的客戶端類庫，如client_python。 + push gateway ，用于批量，短期的監控數據的匯總節點，主要用于業務數據匯報等。 + 各種匯報數據的 exporters ，例如匯報機器數據的 node_exporter, 匯報 MongoDB 信息的 MongoDB exporter 等等。 + 用于告警通知管理的 alertmanager 。 ### 基礎架構官方的架構圖如下： ![官方架構圖](./imgs/prometheus_arch.svg) 大致使用邏輯是這樣： 1. Prometheus server 定期從靜態配置的 targets 或者服務發現的 targets 拉取數據。 2. 當新拉取的數據大于配置內存緩存區的時候，Prometheus 會將數據持久化到磁盤（如果使用 remote storage 將持久化到云端）。 3. Prometheus 可以配置 rules，然后定時查詢數據，當條件觸發的時候，會將 alert 推送到配置的 Alertmanager。 4. Alertmanager 收到警告的時候，可以根據配置，聚合，去重，降噪，最后發送警告。 ## 現有監控項與替換方案： |現有監控|監控方法|替換方案| |:---:|:---|:---| |Linux Server info|zabbix_agentd|可由node_exporter替代| |nginx status|nginx_status.sh|可由hnlq715/nginx-vts-exporter代替 |sidekiq status|checksidekiqstatus.sh|可由client_python代替| |gitsrv status|ps -ef \| grep ...|可由client_python代替| |mysql|checkmysqlperformance.sh|可由mysqld_exporter代替| |redis|redis-status.sh|可由redis_exporter替代| ## 告警方案 alertmanager + 163郵箱告警規則 ## 開始部署部署分為： + go environment + Prometheus Server + alertmanager + Grafana + node_exporter + mysqld_exporter + redis_exporter + mission802/nginx_exporter + client_python ### go environment (all hosts) 1. 獲取二進制包`go1.8.3.linux-amd64.tar.gz` 2. 解壓二進制包，并移動到`/usr/local/`目錄下 ```bash tar -zxf go1.8.3.linux-amd64.tar.gz mv go /usr/local/ ``` 3. 修改環境變量`/etc/profile` ```bash export PATH=$PATH:/usr/local/go/bin ``` 4. 退出終端，重新登錄；執行`go`，如下則成功： ```bash git@app3:~/PrometheusStack$ go Go is a tool for managing Go source code. Usage: go command [arguments] The commands are: build compile packages and dependencies clean remove object files doc show documentation for package or symbol env print Go environment information fix run go tool fix on packages fmt run gofmt on package sources generate generate Go files by processing source get download and install packages and dependencies install compile and install packages and dependencies list list packages run compile and run Go program test test packages tool run specified go tool version print Go version vet run go tool vet on packages Use "go help [command]" for more information about a command. Additional help topics: c calling between Go and C buildmode description of build modes filetype file types gopath GOPATH environment variable environment environment variables importpath import path syntax packages description of package lists testflag description of testing flags testfunc description of testing functions Use "go help [topic]" for more information about that topic. ``` ### Prometheus Server (Prometheus Server host) 1. 獲取二進制包`prometheus-1.7.1.linux-amd64.tar.gz` 2. 解壓并移動到安裝目錄 ```bash tar -zxf prometheus-1.7.1.linux-amd64.tar.gz mv prometheus-1.7.1.linux-amd64 /usr/local/prometheus-server ``` 3. 配置prometheus-server 默認配置文件：`/usr/local/prometheus-server/prometheus.yml` ``` # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Attach these labels to any time series or alerts when communicating with # external systems (federation, remote storage, Alertmanager). external_labels: monitor: 'codelab-monitor' # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first.rules" # - "second.rules" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['localhost:9090'] ``` 檢查安裝環境是否正常： ```bash git@app3:/usr/local/prometheus-server$ /usr/local/prometheus-server/prometheus --version prometheus, version 1.7.1 (branch: master, revision: 3afb3fffa3a29c3de865e1172fb740442e9d0133) build user: root@0aa1b7fc430d build date: 20170612-11:44:05 go version: go1.8.3 ``` 啟動`prometheus-server` ```bash nohup /usr/local/prometheus-server/prometheus -config.file /usr/local/prometheus-server/prometheus.yml -alertmanager.url http://localhost:9093 > /usr/local/prometheus-server/prometheus.log & ``` ### alertmanager 1. 獲取二進制包`alertmanager-0.8.0.linux-amd64.tar.gz` 2. 解壓并移動到安裝目錄 ```bash tar -zxf alertmanager-0.8.0.linux-amd64.tar.gz mv alertmanager-0.8.0.linux-amd64 /usr/local/alertmanager ``` 3. 配置alertmanager 默認配置文件：`/usr/local/alertmanager/simple.yml` ``` global: # The smarthost and SMTP sender used for mail notifications. smtp_smarthost: 'localhost:25' smtp_from: 'alertmanager@example.org' smtp_auth_username: 'alertmanager' smtp_auth_password: 'password' # The auth token for Hipchat. hipchat_auth_token: '1234556789' # Alternative host for Hipchat. hipchat_url: 'https://hipchat.foobar.org/' # The directory from which notification templates are read. templates: - '/etc/alertmanager/template/*.tmpl' # The root route on which each incoming alert enters. route: # The labels by which incoming alerts are grouped together. For example, # multiple alerts coming in for cluster=A and alertname=LatencyHigh would # be batched into a single group. group_by: ['alertname', 'cluster', 'service'] # When a new group of alerts is created by an incoming alert, wait at # least 'group_wait' to send the initial notification. # This way ensures that you get multiple alerts for the same group that start # firing shortly after another are batched together on the first # notification. group_wait: 30s # When the first notification was sent, wait 'group_interval' to send a batch # of new alerts that started firing for that group. group_interval: 5m # If an alert has successfully been sent, wait 'repeat_interval' to # resend them. repeat_interval: 3h # A default receiver receiver: team-X-mails # All the above attributes are inherited by all child routes and can # overwritten on each. # The child route trees. routes: # This routes performs a regular expression match on alert labels to # catch alerts that are related to a list of services. - match_re: service: ^(foo1|foo2|baz)$ receiver: team-X-mails # The service has a sub-route for critical alerts, any alerts # that do not match, i.e. severity != critical, fall-back to the # parent node and are sent to 'team-X-mails' routes: - match: severity: critical receiver: team-X-pager - match: service: files receiver: team-Y-mails routes: - match: severity: critical receiver: team-Y-pager # This route handles all alerts coming from a database service. If there's # no team to handle it, it defaults to the DB team. - match: service: database receiver: team-DB-pager # Also group alerts by affected database. group_by: [alertname, cluster, database] routes: - match: owner: team-X receiver: team-X-pager - match: owner: team-Y receiver: team-Y-pager # Inhibition rules allow to mute a set of alerts given that another alert is # firing. # We use this to mute any warning-level notifications if the same alert is # already critical. inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' # Apply inhibition if the alertname is the same. equal: ['alertname', 'cluster', 'service'] receivers: - name: 'team-X-mails' email_configs: - to: 'team-X+alerts@example.org' - name: 'team-X-pager' email_configs: - to: 'team-X+alerts-critical@example.org' pagerduty_configs: - service_key: <team-X-key> - name: 'team-Y-mails' email_configs: - to: 'team-Y+alerts@example.org' - name: 'team-Y-pager' pagerduty_configs: - service_key: <team-Y-key> - name: 'team-DB-pager' pagerduty_configs: - service_key: <team-DB-key> - name: 'team-X-hipchat' hipchat_configs: - auth_token: <auth_token> room_id: 85 message_format: html notify: true ``` ### Grafana 1. 創建`/usr/local/services/grafana`目錄 ``` sudo mkdir -p /usr/local/services/grafana ``` 2. 獲取二進制包并解壓到目錄`/usr/local/services/grafana`下 ``` sudo tar -zxf grafana-4.4.3.linux-x64.tar.gz -C /usr/local/services/grafana --strip-components=1 ``` 3. 編輯配置文件`/usr/local/services/grafana/conf/defaults.ini`，修改`dashboards.json`段落下兩個參數的值： ``` [dashboards.json] enabled = true path = /var/lib/grafana/dashboards ``` 安裝儀表盤: ``` sudo mkdir -p /var/lib/grafana/dashboards # https://github.com/percona/grafana-dashboards.git sudo tar -zxf grafana-dashboards.tar.gz -C /var/lib/grafana/dashboards --strip-components=1 ``` 啟動grafana-server ``` cd /usr/local/services/grafana/ sudo nohup /usr/local/services/grafana/bin/grafana-server --homepath /usr/local/services/grafana > /home/atompi/grafana.log & ``` ### node_exporter 1. 獲取二進制包 `node_exporter-0.14.0.linux-amd64.tar.gz` 2. 解壓到`/usr/local`目錄下 ``` tar -zxf node_exporter-0.14.0.linux-amd64.tar.gz sudo mv node_exporter-0.14.0.linux-amd64 /usr/local/node_exporter ``` 3. 運行node_exporter ``` sudo nohup /usr/local/node_exporter/node_exporter > /usr/local/node_exporter/node_exporter.log & ``` ### mysqld_exporter 1. 獲取二進制包 `mysqld_exporter-0.10.0.linux-amd64.tar.gz` 2. 解壓到`/usr/local`目錄下 ``` tar -zxf mysqld_exporter-0.10.0.linux-amd64.tar.gz sudo mv mysqld_exporter-0.10.0.linux-amd64.tar.gz /usr/local/mysqld_exporter ``` 3. 創建`promuser`用戶并授權 ``` mysql> CREATE USER promuser; mysql> GRANT REPLICATION CLIENT, PROCESS ON *.* TO 'promuser'@'127.0.0.1' identified by '123456'; mysql> GRANT SELECT ON performance_schema.* TO 'promuser'@'127.0.0.1'; ``` 4. 創建`/usr/local/mysqld_exporter/.my.cnf`文件 ``` [client] user=promuser password=123456 ``` 5. 運行mysqld_exporter ``` sudo nohup /usr/local/mysqld_exporter/mysqld_exporter -config.my-cnf="/usr/local/mysqld_exporter/.my.cnf" > /usr/local/mysqld_exporter/mysqld_exporter.log & ``` ### redis_exporter 1. 獲取二進制包 `redis_exporter-v0.12.2.linux-amd64.tar.gz` 2. 解壓到`/usr/local`目錄下 ``` tar -zxf redis_exporter-v0.12.2.linux-amd64.tar.gz sudo mkdir /usr/local/redis_exporter sudo mv redis_exporter /usr/local/redis_exporter ``` 3. 運行redis_exporter ``` sudo nohup /usr/local/redis_exporter/redis_exporter redis//192.168.1.243:6379 > /usr/local/redis_exporter/redis_exporter.log & ``` ### mission802/nginx_exporter 1. 獲取二進制包 `nginx_exporter-1.0.linux-amd64.tar.gz` 2. 解壓到`/usr/local`目錄下 ``` tar -zxf nginx_exporter-1.0.linux-amd64.tar.gz sudo mv nginx_exporter-1.0.linux-amd64 /usr/local/nginx_exporter ``` 3. 運行nginx_exporter ``` nohup /usr/local/nginx_exporter/nginx_exporter -nginx.scrape_uri=http://192.168.1.243/nginx_status > /usr/local/nginx_exporter/nginx_exporter.log & ``` ### atompi/prompyclients 1. 獲取`prompyclients`可執行文件 `prompyclients-1.0.1.tar.gz` 2. 解壓到`/usr/local`目錄下 ``` tar -zxf prompyclients-1.0.1.tar.gz sudo mv prompyclients-1.0.1 /usr/local/prompyclients ``` 3. 運行prompyclients ``` nohup /usr/local/prompyclients/firstclient.py > /usr/local/prompyclients/prompyclients.log & ``` ### systemd腳本 #### /etc/systemd/system/prometheus.service ``` # /etc/systemd/system/prometheus.service [Unit] Description=Prometheus Server Documentation=https://prometheus.io/docs/introduction/overview/ After=network-online.target [Service] User=root ExecStart=/usr/local/prometheus-server/prometheus \ -config.file=/usr/local/prometheus-server/prometheus.yml \ -storage.local.path=/usr/local/prometheus-server/data [Install] WantedBy=multi-user.target ``` #### /etc/systemd/system/grafana.service ``` # /etc/systemd/system/grafana.service [Unit] Description=Grafana Server Documentation=http://docs.grafana.org After=network-online.target [Service] User=root ExecStart=/usr/local/services/grafana/bin/grafana-server --homepath /usr/local/services/grafana [Install] WantedBy=multi-user.target ``` #### /etc/systemd/system/node_exporter.service ``` # /etc/systemd/system/node_exporter.service [Unit] Description=Node Exporter Server Documentation=https://github.com/prometheus/node_exporter After=network-online.target [Service] User=root ExecStart=/usr/local/node_exporter/node_exporter [Install] WantedBy=multi-user.target ``` #### /etc/systemd/system/mysqld_exporter.service ``` # /etc/systemd/system/mysqld_exporter.service [Unit] Description=Mysqld Exporter Server Documentation=https://github.com/prometheus/mysqld_exporter After=network-online.target [Service] User=root ExecStart=/usr/local/mysqld_exporter/mysqld_exporter \ -config.my-cnf="/usr/local/mysqld_exporter/.my.cnf" [Install] WantedBy=multi-user.target ``` #### /etc/systemd/system/redis_exporter.service ``` # /etc/systemd/system/redis_exporter.service [Unit] Description=Redis Exporter Server Documentation=https://github.com/oliver006/redis_exporter After=network-online.target [Service] User=root ExecStart=/usr/local/redis_exporter/redis_exporter \ redis//192.168.1.243:6379 [Install] WantedBy=multi-user.target ``` #### /etc/systemd/system/nginx_exporter.service ``` # /etc/systemd/system/nginx_exporter.service [Unit] Description=Nginx Exporter Server Documentation=https://github.com/mission802/nginx_exporter After=network-online.target [Service] User=root ExecStart=/usr/local/nginx_exporter/nginx_exporter \ -nginx.scrape_uri=http://192.168.1.243/nginx_status [Install] WantedBy=multi-user.target ``` #### /etc/systemd/system/client_python.service ``` # /etc/systemd/system/client_python.service [Unit] Description=Client Python Server Documentation=http://gitee.com/atompi/prompyclients After=network-online.target [Service] User=root ExecStart=/usr/local/client_python/first_client.py [Install] WantedBy=multi-user.target ```