啟動 · Prometheus中文文檔

## 入門教程 --- 本篇是一篇`hello，world`風格的入門指南，使用一個簡單的例子，向大家演示怎么樣安裝、配置和使用Prometheus。你可以下載和本地運行Prometheus服務，通過配置文件收集Prometheus服務自己產生的數據，并在這些收集數據的基礎上，進行查詢、制定規則和圖表化顯示所關心的數據 ### 下載和運行Prometheus 最新穩定版[下載地址](https://prometheus.io/download), 選擇合適的平臺，然后提取并運行它 > tar xvfz prometheus-*.tar.gz > cd prometheus-* 在運行Prometheus服務之前，我們需要指定一個該服務運行所需要的配置文件 ### 配置Prometheus服務監控本身 Prometheus通過Http方式拉取目標機上的度量指標。Prometheus服務也暴露自己運行所產生的數據，它能夠抓取和監控自己的健康狀況。實際上，Prometheus服務收集自己運行所產生的時間序列數據，是沒有什么意義的。但是它是一個非常好的入門級教程。保存一下的Prometheus配置到文件中，并自定義命名該文件名，如：prometheus.yml ```prometheus.yml global: scrape_interval: 15s # By default, scrape targets every 15 seconds. # Attach these labels to any time series or alerts when communicating with # external systems (federation, remote storage, Alertmanager). external_labels: monitor: 'codelab-monitor' # A scrape configuration containing exactly one endpoint to scrape: # Here its Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:9090'] ``` 一個完整的配置選項，可以查看[文件文檔](https://prometheus.io/docs/operating/configuration) ### 啟動Prometheus服務 cd到Prometheus服務目錄，并指定剛剛自定義好的配置文件，并啟動Prometheus服務, 如下所示： > start Prometheus. > By default, Prometheus stores its database in ./data (flag -storage.local.path). > ./prometheus -config.file={$dir}/prometheus.yml # $dir = absolutely/relative path Prometheus服務啟動成功后，然后再打開瀏覽器在頁面上數據[http://localhost:9090](http://localhost:9090/). 服務運行幾秒后，會開始收集自身的時間序列數據你也可以通過在瀏覽器輸入[http://localhost:9090/metrics](http://localhost:9090/metrics), 直接查看Prometheus服務收集到的自身數據 Prometheus服務執行的操作系統線程數量由GOMAXPROCS環境變量控制。從Go 1.5開始，默認值是可用的CPUs數量盲目地設置`GOMAXPROCS`到一個比較高德值，有可能會適得其反。見[Go FAQs](http://golang.org/doc/faq#Why_no_multi_CPU) 注意：Prometheus服務默認需要3GB的內存代銷。如果你的機器內存比較小，你可以調整Prometheus服務使用更少的內存。詳細見[內存使用文檔](https://prometheus.io/docs/operating/storage/#memory-usage) ### 使用表達式瀏覽器我們試著查看一些Prometheus服務自身產生的數據。為了使用Prometheus內置表達式瀏覽器，可以在瀏覽器中數據[http://localhost:9090/graph](http://localhost:9090/graph), 選擇"Console"視圖，同一層級還有"Graph"tab。如果你可以從[http://localhost:9090/metrics](http://localhost:9090/metrics)查看到收集的度量指標數據，那么其中有一個指標數據名稱為`prometheus_target_interval_length_seconds`(兩次抓取數據之間的時間差)可以被提取出來，可以在表達式控制框中輸入： > prometheus_target_interval_length_seconds 它應該會返回帶有`prometheus_target_interval_length_seconds`度量指標的許多時間序列數據，只是帶有不能標簽, 這些標簽有不同的延遲百分比和目標群組之間的間隔。如果我們僅僅對p99延遲感興趣，我們使用下面的查詢表達式收集該信息 > prometheus_target_interval_length_seconds{quantile="0.99"} 為了統計時間序列數據記錄的總數量，你可以寫： > count(prometheus_target_interval_length_seconds) 更多的表達式語言，詳見[表達式語言文檔](https://prometheus.io/docs/querying/basics/) ### 使用圖形界面使用[http://localhost:9090/graph](http://localhost:9090/graph)鏈接，查看圖表"Graph"。例如：輸入下面的表達式，繪制在Prometheus服務中每秒存儲的速率. > rate(prometheus_local_storage_chunk_ops_total[1m]) ### 啟動一些樣本目標機我們更感興趣的是Prometheus服務抓取其他目標機的數據采樣，并非自己的時間序列數據。Go客戶庫有一個例子，它會產生一些自己造的RPC延遲。啟動三個帶有不同的延時版本。首先需要確保你有Go的環境下載Go的Prometheus客戶端，并運行下面三個服務： ```example # Fetch the client library code and compile example. git clone https://github.com/prometheus/client_golang.git cd client_golang/examples/random go get -d go build # Start 3 example targets in separate terminals: ./random -listen-address=:8080 ./random -listen-address=:8081 ./random -listen-address=:8082 ``` 你現在應該瀏覽器輸入[http://localhost:8080/metrics](http://localhost:8080/metrics), [http://localhost:8081/metrics](http://localhost:8081/metrics), and [http://localhost:8082/metrics](http://localhost:8082/metrics), 會看到這些服務所產生的度量指標數據. ### 配置Prometheus服務，監聽樣本目標實例現在我們將配置Prometheus服務，收集這三個例子的度量指標數據。我們把這三個服務實例命名為一個任務稱為`example-random`，并把8080端口服務和8081端口服務作為生產目標group，8082端口成為canary group。為了在Prometheus服務中建模這個，我們需要添加兩個群組到這個任務中，增加一些標簽到不同的目標群組中。在這個例子中，我們會增加`group="production"`標簽到帶個目標組中，另外一個則是`group="canary"` 為了達到這個目的，在`prometheus.yml`配置文件中，增加下面任務定義到`scrape_config`區域中, 并重啟Prometheus服務： ```example scrape_configs: - job_name: 'example-random' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:8080', 'localhost:8081'] labels: group: 'production' - targets: ['localhost:8082'] labels: group: 'canary' ``` 去表達式瀏覽器中驗證Prometheus服務是否能統計到這兩個群組的目標機度量數據，如：`rpc_durations_seconds`度量指標 ### 為聚集到抓取的數據，設置規則并寫入到新的時間序列中當計算ad-hoc時，如果在累計到上千個時間序列數據的查詢，可能會變慢。為了使這種多時間序列數據點查詢更有效率，我們允許通過使用配置的記錄規則，把預先記錄表達式實時收集的數據存入到新的持久時間序列中。該例子中，如果我們對每秒RPCs數量(`rpc_durations_seconds_count`)的5分鐘窗口流入的統計數量感興趣的話。我們可以下面的表達式： > avg(rate(rpc_durations_seconds_count)[5m]) by (job, service) 試著使用圖形化這個表達式為了存儲這個表達式所統計到的數據，我們可以使用新的度量指標，如`job_service:rpc_durations_seconds_count:avg_rate5m`, 創建一個配置規則文件，并把該文件保存為`prometheus.rules`: > job_service:rpc_durations_seconds_count:avg_rate5m = avg(rate(rpc_durations_seconds_count[5m])) by (job, service) 為了使Prometheus服務使用這個新的規則，在`prometheus.yml`配置文件的global配置區域添加一個`rule_files`語句。這個配置應該向下面這樣寫： ```example global: scrape_interval: 15s # By default, scrape targets every 15 seconds. evaluation_interval: 15s # Evaluate rules every 15 seconds. # Attach these extra labels to all timeseries collected by this Prometheus instance. external_labels: monitor: 'codelab-monitor' rule_files: - 'prometheus.rules' scrape_configs: - job_name: 'prometheus' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:9090'] - job_name: 'example-random' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:8080', 'localhost:8081'] labels: group: 'production' - targets: ['localhost:8082'] labels: group: 'canary' ``` 指定這個新的配置文件，并重啟Prometheus服務。驗證新的時間序列度量指標`job_service:rpc_durations_seconds_count:avg_rate5m`是否能夠在Console控制框中查找出時間序列數據